This is the second day of my participation in the August Text Challenge.More challenges in August

The introduction

In distributed micro-service scenario, due to the vertical separation of various business services and the use of cluster technology to ensure the reliability of business services, the explosive growth of application service nodes leads to the increase of the probability of failure. As discussed in the previous article, the unavailability of one application node can ultimately affect the performance of the entire platform, so we need some means to deal with this anomaly. Hystrix is a basic component for fault tolerant processing of microservices. This paper mainly designs and analyzes the fault tolerant component Hystrix, hoping to be of benefit to everyone.

What is Hystrix?

Due to the popularity of microservice architecture, in the microservice distributed scenario, some service nodes are both the dependent party of upstream business and the caller of downstream business, and the dependency relationship between each service forms our specific business processing process. In a real production environment, many of the service dependencies may run with slow and unresponsive services, either due to code issues or resource usage issues. Hystrix is a basic service fault tolerance component provided by Netflix. By introducing Hystrix, it can add delay tolerance and fault tolerance logic to the original application, so as to improve the service governance capability of the entire microservice architecture. Hystrix improves the overall resilience and stability of the system by isolating access points between services, preventing cascading failures across services, and providing degraded access mechanisms for implementation.

What problems does Hystrix solve?

In a complex distributed architecture, there are a lot of cluster nodes and dependent service nodes for applications. How to handle faults in time is an important reflection of the stability and reliability of microservices architecture. Then what problems can Hystrix solve for us?

1, prevent a problem call exhaust system all threads, limit thread resource consumption;

2. Prevent the propagation of exceptions between distributed systems;

3, fast failure instead of request queuing;

4. Error rollback, graceful service degradation;

Hystrix principle analysis

1. Problem analysis

As shown in the figure below, application requests go to all back-end service clusters. If the individual service nodes are fine, the service nodes will respond, the corresponding dependent services will work, and everything looks fine.

However, the reality can be harsh if a dependent service node fails, such as when the service is in full GC and cannot respond to external requests. Therefore, user requests are blocked at this point. As shown in the figure below, the user requests to invoke services A, H, I and P respectively to complete A certain business process. However, at this time, service I or A node of service I cluster is abnormal. Although the connection between the two is still maintained, all the business requests sent in the past appear timeout. The caller’s worker thread will be blocked, causing the caller to continue to be occupied.

In the case of heavy traffic, the service provider is unavailable due to the abnormality of some back-end nodes. As shown in the following figure, the dependent I service is unavailable and the request cannot be returned normally. Then the caller will continue to retry to further increase the traffic, and eventually the thread resources of the caller are exhausted and the service caller is unavailable. The service caller may also be the service provider of the upstream service. As the request resources are constantly occupied and the synchronization of upstream dependent applications is affected, the fault point will eventually spread to the whole platform.

2. Problem solving

Since the exception of one node in the microservices architecture can lead to the unavailability of the entire platform, what is a good solution to this problem? If this faulty node is like the first patient spread by the virus, then as long as it is found and isolated in time to avoid further influence divergence of abnormal nodes, is it possible to solve the problems caused by abnormal dependency invocation between services in the microservice architecture? Based on this analysis, we hope to implement the following core logic with Hystrix:

Resource isolation: Limits the resources used by invoking services so that when a downstream service fails, the entire service invocation chain is not affected.

Service fuse: When the failure rate reaches the threshold, the fuse is automatically triggered. After the fuse is triggered, the original request link is cut off and the request cannot reach the service provider.

Service degradation: If a circuit breaker is triggered by an exception such as timeout or insufficient resources, a preset degradation interface needs to be invoked to return bottom-pocket data to improve fault tolerance of the platform.

3. Implementation principle

(1) Business process

If we want to achieve abnormal fusing protection, we first need to have a circuit breaker, which acts as a switch to call the flow. The general operation logic is as shown in the figure below. When the service is invoked, it judges whether the circuit breaker is open, if so, it degrades; if not, it judges whether the semaphore and thread pool reject; if so, it also degrades the process. Report your health status if it is normal. Perform the normal process to see if it succeeds, and perform the degradation process if it fails. If it succeeds, the monitoring data will be reported. If it timed out, the demotion policy will also be implemented.

(2) Circuit breaker principle analysis

A circuit breaker acts like a fuse in a microservice architecture described in the previous article, protecting the system. When the number of abnormal calls to downstream services reaches the threshold, the circuit breaker is turned on to trigger the fusing operation to prevent traffic accumulation.

The main design points involved in circuit breaker mainly include two items, one is the control logic of circuit breaker (controlling the opening and closing of circuit breaker switch), the other is the threshold judgment and data statistics of triggering circuit breaker operation (statistical data as the operation judgment of circuit breaker). For the circuit breaker itself on the surface is the switch of the circuit breaker to open or close to control whether to downgrade the logic, in fact, the core logic is how to judge when to open and when to close.

Circuit breaker logic control

The state transition of a circuit breaker is similar to whether a registry service is online in that it involves a state change, but the circuit breaker has an additional half-open state. So there is actually a process of judging whether the service is back to normal. The process is shown in the figure above. The breaker is initially closed and will be turned on once a threshold of request failures is reached. After the circuit breaker is opened, it is necessary to detect when the abnormal recovery or to close the circuit breaker. However, after opening the circuit breaker, the detection will not be conducted immediately, but needs to go through a window period, otherwise immediately retry is bound to fail again, this window period is equivalent to giving others a time window to recover.

When the window period has passed, some requests are released to complete normal downstream business calls, the request is probed, the circuit breaker is closed if successful, and the current circuit breaker is kept OPEN if not. Of course, as for the detection part, it is not said that a success or failure will change the current state of the circuit breaker, here you can set the corresponding state change strategy.

Statistical threshold

Threshold value and data statistics are the basis for the judgment of switch on, so how to statistics data is very critical to the design. If the threshold statistics are not accurate and effective, the circuit breaker cannot function properly. If the circuit breaker is too sensitive, the circuit breaker will be turned on occasionally due to abnormal invocation (such as network jitter), which will seriously affect normal service processes. If the circuit breaker is too slow to turn on when it should, it can lead to the spread of anomalies across the platform

Since we cannot judge by the success or failure of a call, we can extend the statistical period by several cycles. Meanwhile, in order to ensure the timeliness of judgment, the statistical cycle needs to be constantly updated. As shown in the figure above, the statistical period is 0-7 at the beginning, and 1-8 after a time node. The time interval is unchanged, but the start time and end time of statistics are updated in real time, which is like a sliding window that keeps moving forward with the passage of time to ensure the timeliness of statistics.

(3) Isolation design

Hystrix uses isolation to limit the impact of abnormal node access on the platform, which is similar to the bulkhead inside the cabin I mentioned earlier to limit the impact of abnormal node access. This includes thread pool isolation and semaphore isolation.

Thread isolation

As shown in the figure above, the implementation of thread isolation minimizes the impact of resource isolation by isolating the user request thread from the Hystrix component thread so that if the service provider is unavailable, the blocked thread is the thread allocated by the thread pool. Component wrappers rely on call logic, and each call conmand is executed in a separate thread pool, limiting thread resource usage.

Cascading failures can be effectively prevented by isolating the thread that sends the request from the thread resource that executes the request. Hystrix will deny service when the thread pool or request queue is saturated, allowing service request threads to fast-fail to avoid abnormal proliferation of dependencies caused by service node problems.

Semaphore isolation

We all know that the introduction of a thread pool has some resource consumption because of the scheduling of thread resources within the thread pool. Therefore, it is appropriate to introduce scenarios where the benefits of thread pools outweigh the cost of resource calls. In general scenarios, a more lightweight isolation approach is more appropriate, and semaphores are such a lightweight isolation approach without the performance overhead of thread context switching. As you can see from the first figure of the isolation design, thread pools are used in such a way that they are isolated from the thread that sends the request and the thread that executes the dependent service. With semaphore, the thread that sends the request and the thread that executes the dependent service are the same thread that initiates the request, and semaphore isolation limits the number of concurrent calls to a resource.

conclusion

This paper mainly analyzes the background problems of service fault tolerance degradation in microservice architecture, and describes the design content of service fault tolerance, price reduction and fuse breaker of Hystrix component. I believe you have a deeper understanding of service fault tolerance. In the following articles, the author will explain the specific application of Hystrix components in the development of microservices, please stay tuned.


I’m Mufeng. Thanks for your likes, favorites and comments. I’ll see you next time!

A true master always has the heart of an apprentice

Wechat search: Mufeng technical notes, quality articles continue to update, we have learning punch group can pull you in, together with the impact of the big factory, in addition to a lot of learning and interview materials to provide you.