Take Bytedance’s elegant retry as an example

In the microservice architecture, a large system is divided into multiple small services. A large number of RPC calls between small services may fail due to network jitter. In this case, the retry mechanism can improve the final success rate of requests, reduce the impact of faults, and make the system run more stably.

Risk of retry

Retry can improve service stability, but in most cases, people are reluctant to retry, or dare not retry, mainly because retry magnifies the risk of failure.

First, retry increases the load on the direct downstream. As shown in the figure below, suppose that service A calls service B and the retry times are set to R (including the first request). When B is under high load, the call is likely to fail. In this case, A fails to call service B and try again. In addition, the load of B may continue to rise, or even directly suspend.

To make things worse, retry will also have link amplification effect, as illustrated in the following figure:

Assume that Backend A invokes Backend B, and Backend B invokes DB Frontend. Set the retry times to 3. If Backend B invokes DB Frontend and the request fails for three times, Backend B sends A failure message to Backend A. Backend A retries Backend B for three times. Backend B requests DB Frontend three times each time. In this case, THE DB Frontend is requested for nine times, which increases exponentially. Assume that the normal traffic volume is N, the link has m layers, and the retry times of each layer is R. Then, the last layer receives the maximum traffic volume of N * r ^ (m-1). This exponential amplification effect can be terrifying, causing multiple layers of the link to be knocked down and the entire system to collapse.

Usage cost of retry

The cost of using retry is also high. In bytes before the internal governance framework and service platform of no support retry, under some very need to retry the business scenario (such as call some third-party business often fail), a business can use a simple for loop, basic won’t consider the amplification effect of retry, it’s not safe, internal company appeared many times because the retry and cause of the accident, And when the accident also need to modify the code online to close retry, resulting in accident recovery is not fast.

But there are also some business use of open source retry components, these components are usually consider the protection of direct downstream, but not consider the retry amplification of link level, in addition to business to amend the RPC calling code to use, to the business code invasion is more, but also the static configuration, need to modify the configuration must be back online.

Based on the above background, in order to enable business parties to use retry flexibly and safely, our Bytedance Live Center team designed and implemented a retry governance component, which has the following advantages:

  1. Retry storm protection at link level.

  2. Ease of use and low service access costs.

  3. Flexibility and the ability to adjust configurations dynamically.

The following describes the specific implementation scheme.

Retry governance dynamic configuration

How to make the business side easy access is the first problem to be solved. If it is still a common component library, it is still a lot of intrusion into user code and difficult to adjust dynamically.

Bytedance’s Golang development framework supports Milddleware, which registers multiple custom Middleware and calls them recursion in sequence. It is usually used to complete non-business logic such as printing logs and reporting monitoring, effectively decoupling business and non-business code functions. So we decided to use Middleware for retry, define a Middleware and internally make repeated calls to RPC, and store the retry configuration in Bytedance’s distributed configuration storage center. This allows the Middleware to read the configuration center and retry, without the need for the user to modify the code that calls the RPC, just to introduce a global Middleware to the service.

As shown in the overall architecture diagram below, we provide the configured web page and background, so that users can easily configure and modify RPC on the page dedicated to service governance and it takes effect automatically. The internal implementation logic is transparent to users and there is no intrusion to business codes.

The configured dimension is set to a tuple based on bytedance’s RPC call characteristics, including caller service, caller cluster, called service, called method. The configuration is performed according to the tuple. Middleware encapsulates methods that read configurations and automatically reads and takes effect when RPC calls are made.

This Middleware approach makes it easy for businesses to access, much more convenient than the usual component library approach, and has the ability to dynamically configure once accessed, possibly adjusting or turning off retry configurations.

Retreat strategy

After determining the access mode, you can start to implement the specific functions of the retry component. In addition to the basic configuration of the retry times and total delay, a retry component also needs to have a retreat policy.

For some temporary errors, such as network jitter, retry immediately or fail. Generally, waiting for a short time and trying again has a higher success rate and may scatter the upstream retry time, reducing the downstream instantaneous traffic peak caused by simultaneous retry. The method of deciding how long to wait and try again is called the fallback policy. We implemented common fallback policies, such as:

  • Linear retreat: Wait for a fixed time and try again.

  • Random retreat: Wait for a certain period of time and try again.

  • Exponential retreat: In consecutive retries, the waiting time is a multiple of the previous one.

To prevent the retry storm

How to safely retry and prevent retry storm is the biggest problem we face.

Restrict single point retry

First of all, restrictions should be carried out at a single point. A service cannot retry downstream without restrictions, which is easy to cause downstream suspension. In addition to limiting the maximum number of retries set by the user, it is more important to limit the success rate of retry requests.

The implementation scheme is simple, based on the idea of a circuit breaker, limiting the ratio of failed/successful requests and adding a circuit breaker function to retry. We adopt the common sliding window method to achieve this, as shown in the figure below. A sliding window is maintained in memory for each type of RPC call. For example, the window is divided into 10 buckets, and each bucket records the result data (success and failure) of RPC requests within 1s. When a new second arrives, a new bucket is generated and the earliest bucket is discarded, holding data for only 10 seconds. When a new request for this RPC fails, a retry is performed based on whether the failure/success exceeds the threshold within the first 10 seconds. The default threshold is 0.1, that is, the downstream is subjected to a maximum of 1.1 times QPS. Users can adjust the fuse switch and threshold as required.

Restricted link retry

Said earlier in the multistage link if the dosage of each layer configuration retry may lead to the exponential expansion, while a retry after fusing, retry is no longer a exponential growth (each single node retry to expand limits the 1.1 times), but still can grow as the link of the series and expand the number of calls, so still need from the link level to the safety of retry.

The key to a retry prevention storm at the link level is to limit retries to occur at each layer, ideally only at the lowest layer. Google SRE points out the way Google uses special error codes internally to do this:

  • The convention has a special status code that says: Call failed, but do not retry.

  • If any level of retry fails, the status code is generated and returned to the upper level.

  • When the upper layer receives the status code, it stops retrying the downstream and passes the error code to its upper layer.

In this mode, only the lowest layer retries. The upstream layer does not retry after receiving the error code. The overall link magnification is r times (the number of retries in a single layer). But this strategy depends on the business side transmission error code, invasion of business code to a certain extent, and usually the business code of the difference is very big, call is RPC style and also each are not identical, need a lot more reform, business cooperation is likely to change because of leakage causes such as not convey the error code from the downstream to upstream.

However, the RPC protocol used by ByteDance has extended fields, so we tried a lot in Middleware to encapsulate error code handling and delivery logic. Passing nomore_retry in RPC’s Response extension tells upstream to stop trying again. Middleware manages the entire life cycle of error code generation, recognition, transmission, and so on, without requiring businesses to modify their OWN RPC logic, and error code schemes are transparent to businesses.

In the link, each layer is connected with retry component, so that each layer can recognize this flag bit to stop retry, and then transfer the retry to the upper layer, so as to achieve the protection of the link layer and achieve the effect of “retry only at the layer closest to the error occurred”.

Timeout handling

When testing the error code upload scheme, we found that the timeout situation may cause the error code transmission scheme to be invalid.

In the case of A -> B -> C, suppose B -> C times out, and B tries to request C again. At this time, it is likely that A -> B also times out, so A does not get the error code returned by B, but also tries B. At this time, although B tries C and generates retry failure error code, But it can’t be passed on to A. In this case, A will still retry B. If each layer in the link times out, the link exponential expansion effect will still occur.

So to handle this situation, in addition to passing the retry error flag downstream, we implemented a “no retry on retry requests” scheme.

For A retry Request, A retry flag is added to the Request. On the link A -> B -> C, B will read this flag to determine whether the Request is A retry Request. So it calls C and it doesn’t retry even if it fails; Otherwise C will be retried after the call to C fails. B also forwards the retry flag, and its requests will not be retry.

In this way, even if A does not get the return from B due to timeout, B can sense and will not retry C after sending A retry request to B, so that A can request r times at most, B can request R + R-1 at most, and C can request R + R + r-2 at most if there is A further level. Layer I requests at most I * r – (I -1) times, and at worst it grows by multiples, not exponentially. Add in the fact that there are retry circuit breakers, and the increase is much smaller.

Retry fusing is used to limit the magnification of a single point, retry is used to ensure that only the lowest layer of the link is retried, and retry request is sent down the link to ensure that the retry request is not retried. The combination of multiple control policies can effectively reduce the amplification effect of retry.

Timeout scenario optimization

In a distributed system, the result of an RPC request has three states: success, failure, and timeout. The most difficult one is timeout. However, timeout is often the most common one. We counted the RPC error distribution of some important services on ByteDance’s live service line and found that timeout error accounted for the highest proportion.

In timeout retry scenarios, adding a Retry flag to retry requests prevents exponential expansion, but does not improve the request success rate. As shown in the figure below, if the timeout times of A and B are both 1000ms, B will retry C when the load of C is high and B’s access to C times out, but the time has exceeded 1000ms. Time A also times out and disconnects from B. Therefore, B will retry C this time, no matter whether it is successful or not. From A’s perspective, this request has failed.

The essential reason is that the timeout period on the link is set improperly. The timeout period on the upstream link is the same as that on the downstream link, or even shorter than that on the downstream link. In practice, no RPC timeout period is configured for services. Therefore, the default timeout duration may be the same for upstream and downstream services. To deal with this, we need a mechanism to optimize stability in timeout cases and reduce useless retries.

In the following figure, the normal retry scenario is to wait for Resp1 (or get the timeout result) and then initiate the second request. The total time is T1 + T2. Service A may wait A long time after sending Req1, such as 1s, but the request’s PCT99 or PCT999 may be less than 100ms. If the request exceeds 100ms, there is A high probability that the access will eventually time out. Could you not wait and try again earlier?

With this in mind, we introduced and implemented the Backup Requests scheme. As shown in the figure below, we set a threshold t3 in advance (smaller than the timeout time, usually it is recommended to be PCT99 of RPC request delay). When Req1 is sent and there is no return after t3, we directly initiate a retry request Req2, which is equivalent to two requests running at the same time. And then wait for the request return, as long as Resp1 or Resp2 return any successful results, it can be an immediate end to this request, so that the overall time consuming is t4, it says out of the first request to return to the time between the first successful results, compared to wait for the request again after a timeout, this mechanism can greatly reduce the overall latency.

In fact, Backup Requests is a tradeoff for success (or low latency) in terms of traffic. Of course, we control the rate at which traffic increases, log a failure for the first request, and check if the current failure rate exceeds the circuit breaker threshold before initiating retries. The overall access rate would still be under control.

Combining with the DDL

The idea of Backup Requests can shorten the overall request latency while reducing some invalid Requests. However, not all business scenarios are suitable for configuring Backup Requests, so we combine DDL to control invalid retries.

DDL is short for “Deadline Request call chain timeout”. We know that TTL in TCP/IP is used to determine whether the packet in the network is too long and should be discarded. DDL is similar in that it is a full-link call timeout. Can be used to determine whether the current RPC request needs to continue. As shown in the following figure, bytedance’s basic team has implemented DDL function, which carries a timeout time in the RPC request invocation chain and subtracts the processing time of the layer after each layer. If the remaining time is less than or equal to zero, it can directly return failure without requesting the downstream.

DDL approach can effectively reduce the invalid for downstream invocation, we also combines the DDL in the retry management data, before each launch retry will judge whether the residual value of the DDL is greater than zero, if has not meet the conditions, it is not necessary to downstream retry, it can achieve the maximum reduce useless to try again.

Actual link amplification effect

The link exponential amplification mentioned above is an ideal analysis, but the actual situation is much more complicated, because there are many influencing factors:

Retry Retry link upload error flag If lower layer retry fails, upper layer retry no longer Link down retry flag Retry request Special flag. The lower layer does not retry retry requests. DDL Does not initiate retry requests when the remaining time is insufficient. Framework Fuse The micro-service framework fuse and overload protection also affect the retry effect

Comprehensive down a variety of factors, finally the actual method that is not a simple calculating formula, we construct a multi-layer call link, online testing and record the actual in different error types and the effect of the error rate under the condition of use retry governance components, find access to retry governance components can retry the control of the link level after magnification, Significantly reduces the chance of a retry causing a system avalanche.

conclusion

As mentioned above, we developed based on the thought of service governance retry governance function, support for dynamic configuration, basic business code without invasion, access and use multiple strategies combination of retry amplification effect on link level control, ease of use, flexibility, security, give attention to two or morethings in bytes beating inside has many services, including live access and online authentication, It has good effect on improving the stability of service itself. At present, the scheme has been verified and promoted in bytedance live and other businesses, and will serve more bytedance businesses in the future.