Issues arising from architectural evolution

When we use the traditional CS architecture, the server will block requests due to failures and other reasons, which may result in the client’s requests becoming unresponsive and, after a period of time, a group of users becoming unserviced. And the scope of the possible impact of this situation is limited and predictable. Under the micro service system, however, your server may depend on a number of other services, and the micro service is dependent on other more micro service, in this case, a service for downstream congestion, could instantly (in seconds) for cascade resource consumption on link disastrous consequences, we call it “service metrorrhagia”.

Several ways to solve the problem

  1. Fuse mode: As the name suggests, just like household circuits, if a line is overvoltage, the fuse will blow, preventing a fire. In the system that uses the circuit breaker mode, if the upstream service invocation is slow or a large number of times out, the system directly stops the invocation of the service and returns information to release resources quickly. Resume the call until the upstream service improves.
  1. Isolation mode: The invocation of different resources or services is divided into several different request pools. The exhaustion of resources in one pool does not affect requests for other resources, preventing a single point of failure from consuming all resources. This is a very traditional disaster recovery design.
  1. Current limiting mode: Fuses and isolation are after-the-fact solutions. Current limiting mode reduces the probability of a problem before it occurs. In traffic limiting mode, you can set a maximum QPS threshold for certain service requests. If the request exceeds the threshold, the request is returned directly and does not occupy resources. However, limiting traffic does not solve the problem of service crashes, which are often caused not by the large number of requests but by the amplification of multiple cascading layers.

Mechanism and implementation of circuit breakers

The presence of a circuit breaker gives us a layer of assurance that when we call services and resources that are unstable or likely to fail, the circuit breaker can monitor these errors and allow the request to fail after reaching a certain threshold, preventing excessive resource consumption. In addition, the circuit breaker also has the function of automatic identification and recovery of service status. When the upstream service is restored, the circuit breaker can automatically determine and resume normal requests.

Let’s look at a request process with no circuit breakers: the user relies on the ServiceA to provide the service, and the ServiceA relies on the service provided by the ServiceB, assuming that the ServiceB fails at this point, there will be a 10-second delay in response for each request within 10 minutes.

So suppose we have N users requesting services from ServiceA. Within seconds, ServiceA’s resources will be exhausted because requests to ServiceB are suspended, rejecting any subsequent requests from User. For users, both ServiceA and ServiceB fail at the same time, causing the entire service link to crash.

What happens when we install a circuit breaker in ServiceA?

  1. When the number of failed requests reaches a certain threshold, the breaker finds that the request to ServiceB is invalid. In this case, the ServiceA does not continue to request the ServiceB, but returns the failed request or uses the backup data of another Fallback. The circuit breaker is open.
  2. After a period of time, the circuit breaker periodically checks whether the ServiceB is restored. In this case, the circuit breaker is half-open.
  3. If the ServiceB has been restored, the circuit breaker will be turned off, and ServiceA will call the ServiceB normally and return the result.

The circuit breaker status diagram is as follows:

Thus, several core points of the circuit breaker are as follows:

  1. Timeout: How long a request reaches before it causes a failure
  2. Failure threshold: indicates the number of failures that must be reached before the circuit breaker triggers an open circuit
  3. Retry timeout: How often to retry a request when the circuit breaker is in open state, i.e. in half-open state

With this knowledge, we can try to create a circuit breaker:

class CircuitBreaker {
  constructor(timeout, failureThreshold, retryTimePeriod) {
    // We start in a closed state hoping that everything is fine
    this.state = 'CLOSED';
    // Number of failures we receive from the depended service before we change the state to 'OPEN'
    this.failureThreshold = failureThreshold;
    // Timeout for the API request.
    this.timeout = timeout;
    // Time period after which a fresh request be made to the dependent
    // service to check if service is up.
    this.retryTimePeriod = retryTimePeriod;
    this.lastFailureTime = null;
    this.failureCount = 0; }}Copy the code

State machine for constructing circuit breakers:

async call(urlToCall) {
    // Determine the current state of the circuit.
    this.setState();
    switch (this.state) {
      case 'OPEN':
      // return cached response if no the circuit is in OPEN state
        return { data: 'this is stale response' };
      // Make the API request if the circuit is not OPEN
      case 'HALF-OPEN':
      case 'CLOSED':
        try {
          const response = await axios({
            url: urlToCall,
            timeout: this.timeout,
            method: 'get'});// Yay!! the API responded fine. Lets reset everything.
          this.reset();
          return response;
        } catch (err) {
          // Uh-oh!! the call still failed. Lets update that in our records.
          this.recordFailure();
          throw new Error(err);
        }
      default:
        console.log('This state should never be reached');
        return 'unexpected state in the state machine'; }}Copy the code

Supplementary remaining functions:

// reset all the parameters to the initial state when circuit is initialized
  reset() {
    this.failureCount = 0;
    this.lastFailureTime = null;
    this.state = 'CLOSED';
  }
​
  // Set the current state of our circuit breaker.
  setState() {
    if (this.failureCount > this.failureThreshold) {
      if ((Date.now() - this.lastFailureTime) > this.retryTimePeriod) {
        this.state = 'HALF-OPEN';
      } else {
        this.state = 'OPEN'; }}else {
      this.state = 'CLOSED'; }}recordFailure() {
    this.failureCount += 1;
    this.lastFailureTime = Date.now();
  }
Copy the code

To use a circuit breaker, simply wrap the request in the Call method of the circuit breaker instance:

.const circuitBreaker = new CircuitBreaker(3000.5.2000);
​
const response = await circuitBreaker.call('http://0.0.0.0:8000/flakycall');
Copy the code

Mature Node.js circuit breaker library

Red Hat has long created a full-fledged Node.js circuit breaker implementation called Opossum, which is linked here. For distributed systems, using this library can greatly improve the fault tolerance of your services and fundamentally solve the problem of service collapse.

Author: ES2049 / In search of the Singularity

The article can be reproduced at will, but please keep this link to the original text. You are welcome to join ES2049 Studio. Please send your resume to [email protected]