Circuit breakers and downgrades

Not far from Little Eyes’ house, there was a deli. There are two Windows in the store can always be a long queue, a window is to choose the cold dishes let the master seasonings, a window is to buy the braised chicken to let fat master scene off the bone. The normal flow of customers is something like this:

Hot summer, invite a few friends, drink a cup of beer blow cowhide, isn’t beautiful. Perhaps everyone agrees with small eyes, the business of the small shop is increasingly hot. One day, small eyes selected the dish, paid the money, is ready to line up for the chef to adjust the taste, take off the bone. It took no less than 20 minutes to queue at the two Windows, and several friends urged him to give up decisively and go home directly with the food. So my flow becomes:

When the downstream service (seasoning, deboning) suddenly becomes unavailable or responds too slowly due to some reason (it takes 3 minutes to buy vegetables and waits for half an hour), the upstream service does not continue to call the target service in order to ensure the availability of its overall service (it cannot wait), and directly returns to release resources quickly. Resume the call if the target service improves. This is called a service circuit breaker.

Due to the long queue time, Little Eyes decisively gave up the follow-up process and provided “lower quality” dishes. This is called service degradation.

There are several ways to make a circuit breaker

There are many ways to degrade a service, such as limiting current, switching, and fusing, which is one of the types of downgrades.

Fuse. Hystrix fuse downgrade library is available in Spring Cloud. Sentinel of Alibaba open source can also be used to achieve fuse downgrade in distributed projects. Both Hystrix and Sentinel require the introduction of third-party components to understand the implementation and are not suitable for simple scenarios.

Use of handwritten fuses

This paper introduces a fuse breaker method suitable for simple applications with no more than 100 lines of core code. The usage method is roughly as follows:

// Initialize a fuse
private CircuitBreaker breaker = new CircuitBreaker(0.1.10.true."serviceDemo");

public void doSomething(a) {
    // Check the service status on each invocation
    breaker.checkStatus();
    // If the fuse returns true that the service is available, continue with the logic
    if (breaker.isWorked()) {
        try {
            service.doSomething();
        } catch (Exception e) {
            e.printStackTrace();
            // The number of invocation failures is recorded
            breaker.addFailTimes();
        } finally {
            // For each call, increase the number of callsbreaker.addInvokeTimes(); }}// The service is unavailable
}Copy the code

In this pseudo-code, the fuse does three things:

  1. Check the service status and output statistics logs

  2. Return service state breaker. IsWorked ()

  3. Record the number of calls and failures as the basis for fusing

Realization of fuse

The specific implementation of fuse is as follows:

public class CircuitBreaker {
  /** * Record the number of failures */
  private AtomicLong failTimes =
          new AtomicLong(0);
  /** * records the number of calls */
  private AtomicLong invokeTimes =
          new AtomicLong(0);
  /** * Degradation threshold, such as 0.1 * ratio of failed requests to total requests */
  private double failedRate = 0.1;
  /** * The threshold judgment is performed only when the total number of requests is greater than this value * for example, if set to 10, the threshold judgment is performed only when the number of requests is greater than 10 */
  private double minTimes;
  /** * Fuse switch, default off */
  private boolean enabled;
  /** * Whether to send an email alarm */
  private boolean mail;
  /** * Whether to send an SMS alarm after the fuse is disconnected */
  private boolean sms;
  /** * Fuse name */
  private String name;
  /** * Saves the timestamp of the last count in minutes */
  private AtomicLong currentTime =
          new AtomicLong(
             System.currentTimeMillis() / 60000);
  /** * Record whether the service is unavailable */
  private AtomicBoolean isFailed =
          new AtomicBoolean(false);
  /** * The state of the service down is placed in the thread container */
  private ThreadLocal<Boolean> fail =
          new ThreadLocal<Boolean>();


  private Logger log =
          LoggerFactory.getLogger(getClass());


  /** * construct fuse **@paramFailedRate Fuse threshold, * Number of failed requests/Total number of requests *@paramMinTimes Specifies the minimum condition for fusing. * When the total number of requests exceeds this threshold, the system determines the number of requests and performs degradation. *@paramEnabled Whether to enable the fusing operation */
  public CircuitBreaker(double failedRate,
                        double minTimes,
                        boolean enabled,
                        String name) {
    fail.set(false);
    this.failedRate = failedRate;
    this.minTimes = minTimes;
    this.enabled = enabled;
    this.name = name;
  }

  /** * Check whether the service is in failed state **@return* /
  public boolean isFailed(a) {
    return isFailed.get();
  }

  /** * increase the number of errors */
  public void addFailTimes(a) {
    fail.set(true);
    if(enabled) { failTimes.incrementAndGet(); }}/** * Increases the number of calls */
  public void addInvokeTimes(a) {
    if(enabled) { invokeTimes.incrementAndGet(); }}/** * Check whether the service is available **@return* /
  public boolean isWorked(a) {
    if(! enabled) {return true;
    }
    // Sacrifice 1% of traffic for probe requests when the service is unavailable
    if (isFailed.get() &&
        System.currentTimeMillis() % 100= =0) {
      return true;
    }
    if (isFailed.get()) {
      fail.set(true);
      return false;
    }
    return true;
  }

  public void checkStatus(a) {
    if(! enabled) {return;
    }
    long newTime =
       System.currentTimeMillis() / 60000;
    if ((newTime > currentTime.get())
       && (invokeTimes.get() > minTimes)) {

      double percent =
              failTimes.get() * 1.0 /
                      invokeTimes.get();

      if (percent > failedRate) {
        if (isFailed.get()) {
          // Log output
          if (mail) {
            // Send an email notification}}else {
          // Log output
          isFailed.set(true);
          if (sms) {
            // Send SMS notification
          }
          if (mail) {
            // Send an email notification}}}else { // The service is restored
        if (isFailed.get()) {
          // Log output
          if (sms) {
            // Send SMS notification
          }
          if (mail) {
            // Send an email notification
          }
        }
        isFailed.set(false);
      }
      if (log.isInfoEnabled()) {
        // Log output
      }
      currentTime.set(newTime);
      failTimes.set(0);
      invokeTimes.set(0); }}}Copy the code

General idea:

  1. If the proportion of error requests exceeds the threshold, the fault is fused

  2. The statistical period is within the minute level (the statistics generated within 1 minute reach the threshold).

  3. If the total number of requests does not reach minTimes within a minute, no fusing is performed (request frequency is too low, statistical information is meaningless)

  4. Even when the circuit breaker condition is reached, 1% (modifiable) of requests are still sacrificed for probing

    isFailed.get()&&System.currentTimeMillis() % 100 == 0

The advantages and disadvantages

Hystrix provides a range of service protection features such as service circuit breaker and thread isolation. Our hand-written fuses can only provide caller-based manual fuses.

Hystrix provides both thread pools and semaphores. The function of handwritten fuses is relatively single, based on statistical information only, and the granularity of minute dimension is relatively rough.

Hystrix commands programming and registers callbacks for high code complexity. Handwriting fuse in the process of code intrusion, process oriented, low understanding cost.

It took less than 100 lines of code to implement the fusing feature after removing comments and invalid blank lines. Although there are many defects when applied to large-scale service scenarios, I hope it can at least provide an idea for everyone.

Pay attention to my

Welcome to pay attention, at any time to communicate with me ~