This post is part of the third session of the Nuggets Creators Training Camp.Digg project | Creator Boot Camp phase 3 is underway, “write” to make a personal impact.

πŸ“– preface

Marquez wrote in One Hundred Years of Solitude:

“The past is false, memory is a road of no return. No spring before it can be restored, and even the wildest and most persevering love is, in the final analysis, but a fleeting reality, eternal solitude.”

Let me start by saying that I have for a long time conflated service downgrades and service outages as the same thing!


✨ Why did I have such a misunderstanding?

whenService AcallService B, failure repeatedly reaches a certain threshold,Service ANo more tuningService BInstead, the local degrade method is executed! For this set of mechanisms: inSpring CloudIn the combinationHystrix, and call itFusing the drop!

After all,fusing ε’Œ demotionIt happened at the same time, and the concepts are so similar! If you have different opinions, please spray gently!


πŸš€ Service avalanche

An avalancheWhat is? Once it’s there it’s unstoppable and it’s a kind ofdisaster!

Take a chestnut

There are three people who say this, and they depend on each other from A->C!

Me :” I’m ServiceA!”

You :” I’m ServiceB!”

He :” I’m ServiceC!”

All of a sudden,Service AThe flow fluctuation is very big, the flow often increases suddenly! So in this case, even ifService ACan withstand the request,Service B ε’Œ Service CMay not be able to withstand the sudden request. (Because they are different! Skin)

All of a sudden,Service CIt became unavailable because it couldn’t resist requests. Of course,Service BRequests are blocked and slowly run outService BThread resources,Service BIt also becomes unavailable. Then,Service AIt will not be available,

Draw a funny – service avalanche request sequence diagram (if you want to learn how to draw a markdown diagram, please leave a comment)

sequenceDiagram Title: Service avalanche request timing diagram participant ServiceA participant ServiceB participant ServiceC ServiceA->>ServiceB: ServiceA initiates the request ServiceB->>ServiceC: ServiceB initiates a request. Note right of ServiceC: ServiceC finds itself overwhelmed. ServiceC->>ServiceB: fails to respond and returns data. ServiceB->>ServiceA: fails to respond and returns data. ServiceA->>ServiceB: ServiceB tries again ServiceC->>ServiceB: failed to respond and returned data. ServiceB->>ServiceA: Failed to respond and returned data. ServiceC->> ServiceC: Failed to respond and returned data Note right of ServiceB: ServiceA->> ServiceC: ServiceA tries again. Note right of ServiceB: ServiceB tries again. ServiceB->>ServiceA: fails to respond and returns data. ServiceA->>ServiceB: ServiceA tries again. ServiceB->>ServiceA: fails to respond and returns data ServiceB->>ServiceA: failed to respond and return data Note right of ServiceA: Failed to respond and return data Note Right of ServiceA: Failed after repeated attempts The situation that leads to the failure of services along the whole industrial chain is called service avalanche.

The failure of one person (service) leads to the failure of the whole industry chain, which is called service avalanche.


πŸ’• service circuit breakers and service downgrades can be seen as a solution to the service avalanche.


πŸš€ The service is disconnected

So, what isService fusing?

Service meltdown: When the downstream service becomes unavailable or responds slowly for some reason, the upstream service directly returns to the target service to release resources to ensure the overall service availability. Resume the call if the target service improves.

It should be noted that the circuit breaker is actually a frame-level processing, so the design of this circuit breaker mechanism is basically used in the industry, as shown in the state transition diagram provided by Martin Fowler

  • To begin withclosedStatus, once an error is detected to reach a certain pointThe threshold valueSwitch,openState;
  • There’s going to be areset timeoutAt this time, it will move tohalf openState;
  • Try to pass a portion of the requests to the back end and return to once the detection is successfulclosedStatus, that is, service recovery;

The current popular fuse inside course of study is very much, for example ali givesSentinel, and the most usedHystrix(This is what the blogger uses for a little hand-spun service.)

inHystrix, the configuration is as follows

    // Slide window size, default is 20
    circuitBreaker.requestVolumeThreshold 
    // If it takes too long, the fuse checks whether it is on again. The default value is 5000, that is, 5s
    circuitBreaker.sleepWindowInMilliseconds 
    // Error rate, default is 50%
    circuitBreaker.errorThresholdPercentage
Copy the code

When 50% of the 20 requests fail, the fuse will be turned on and the call to the service will return failure without calling the remote service. Check the trigger condition again until 5 seconds later to determine whether to turn the fuse off or continue to open.

These implementations belong to the framework level, we just need to implement the corresponding interface! Isn’t it easy?


πŸ™Œ Service degradation

So, what isService degradation?

There are two scenarios:

  • When the downstream service responds slowly for some reason, the downstream service actively stops some unimportant services to release server resources and increase the response speed.
  • When the downstream service is unavailable for some reason, the upstream actively invokes some local downgrade logic to avoid stalling and quickly return to the user!

At first glance, many people still don’t get itfusing ε’Œ demotionThe difference between!

It should be understood as follows:

  • There are many ways to degrade a service! Such as switch downgrade, current limit downgrade, fuse downgrade!
  • Service circuit breaker is a form of degradation!

This is some people will not accept, feel that circuit breaker is circuit breaker, downgrade is downgrade, clearly are two different things! Not really, because, implementationally speaking, a circuit breaker and a downgrade must go together. Because when a downstream service becomes unavailable, the responsibility to the end user at this point is to enter the upstream degradation logic. Therefore, fuse downgrade as a way to downgrade, but also can drop!

Put aside the framework, to the simplest code to illustrate! The upstream code is as follows

    try{
        // Invoke the downstream helloWorld service
        xxRpc.helloWorld();
    }catch(Exception e){
        // It can not be adjusted because of fusing
        doSomething();
    }
Copy the code

Watch out, downstreamhelloWorldThe service is out of order because of a circuit breaker. At this point the upstream service will entercatchThe code block inside, socatchThe logic that you can interpret as degraded logic!

What, you tell me you don't catch exceptions, you just throw the page?

Then I'll give up!

Service degradation is mostly a business-level process. Here is another kind of downgrade, namely switch downgrade! This is also our production of another common way to downgrade!

The way is very simple, make a switch, and then put the switch in the configuration center (data dictionary is also possible)! Change switches in the configuration center to determine which services are degraded. It is beyond the scope of this article to discuss how applications can monitor configuration changes after they occur.

This process of switching on and off in the middle of the application,The course of study also has a noun, call bury point!


Next comes the most critical question, which business needsBuried point?

  1. Simplify the execution process

Sort out the core and non-core business processes yourself. Then put switches on non-core business processes, and when it becomes clear that the system can’t handle them, turn off the switches and end those secondary processes.

  1. Turn off secondary functions

There must be many functions under a microservice, so distinguish between primary and secondary functions. Then add the switch to the secondary function, need to downgrade the time, turn off the secondary function!

  1. Reduced consistency

Let’s say you find yourself in a business where the execution process can’t be simplified. There’s no secondary function to turn off, sonny! That can only reduce consistency, that is, core business processes from synchronous to asynchronous, strong consistency to final consistency!

But these are manual downgrades, is there a way to automatically downgrade?

Generally, the scenes that need to be degraded are predictable, such as certain activities. Assuming that there is really an emergency at ordinary times, abnormal flow, there is also a monitoring system email notification, to remind us to downgrade!

But that doesn’t mean automatic downgrades can’t be done,Because I have no practice in production, just theory, if let do automatic downgrade I can think of:

  1. Set a threshold for yourself, such as the number of failures in a few seconds, to initiate degradation

  2. Do your own interface monitoring (Rxjava for those interested) and push logic when the threshold is reached. How can I push it? For example, if your configuration is in Git, use Jgit to change the configuration center configuration. If the configuration is in a database, use Jdbc to change it.

  3. After changing the configuration center configuration, the application can automatically detect the configuration change and degrade! Hot refresh in configuration center

  4. If the service discovery, health check, and exception tracing is processed, consul is used for service discovery. Exception tracing adjusts the weight of service nodes to adjust traffic. (Threshold parameters are not well controlled and need to be tested according to the business, which is not my concern haha!) First of all, the deployment node traffic downgrade, if not fuse downgrade, and then every once in a while put some traffic, if there is no problem with the browsing slowly added, should be able to achieve full automatic.


πŸ‘ small problems

How does the application monitor the configuration after the configuration changes?

  • The first one is long polling, which keeps asking, has the configuration changed? Has it changed? It’s efficient.

  • The second approach, SpringCloud Bus, is slow and relies on message queues (bloggers will use RabbitMq to update their projects slowly).


πŸŽ‰ finally

  • For more references, see here:Chen Yongjia’s blog

  • Like the small partner of the blogger can add a concern, a thumbs-up oh, continue to update hey hey!