Traffic surge does not break down, and the service traffic limiting system architecture is decrypted

Today we’ll explore another common design in distributed system architectures: service limiting. So what is “service limiting”?

Explain “before current limiting service”, we take a look at the other time online is a joke to the fire, saying is an engineer of sina weibo is at home wedding, suddenly got a call from the company to the surge of emergency processing line flow problem, the day should be a pop star suddenly announced on twitter, weibo flow excursion for several times, lead to unstable system function, User access is not smooth. The engineer then had to leave the bride alone, open his laptop in his suit and debug the code at the wedding.

At that time, the engineer must have been devastated: why announce the relationship today! As soon as I get the system expansion and service limiting in place.

After reading the paragraph, basically the function of service limiting can be understood:

Service traffic limiting refers to a method to limit the traffic or function of the system according to preset rules when the system resources are insufficient to cope with a large number of requests, that is, when the system resources conflict with the access volume, in order to ensure the normal service of the limited resources.

I. Why do we need to do service flow limiting design

An example of our lives: some popular tourist attractions, tend to have strict limits on the number of people visit the travel daily, such as the Forbidden City in Beijing, happy valley, etc., every day only sell a fixed number of tickets, if to late, may on the day of the tickets had been sold out, can’t go in for pleasure in the day, even in line also can to you doubt life.

Why do tourist attractions have such restrictions? Wouldn’t it be better to sell more tickets and make more money?

In fact, for scenic spots, they are also very helpless, because the service resources of scenic spots are limited, and the number of people who can serve daily is also limited. Once the restrictions are lifted, the scenic spots will not have enough staff, health conditions can not be guaranteed, there are hidden dangers of safety, and the super-dense crowd will seriously affect the tourist experience. However, due to the popularity of the scenic spot, tourists came to play in an endless stream, far beyond the carrying capacity of the scenic spot, so the scenic spot had to make a daily limit on the flow of people.

The same is true for system services in the IT software industry.

If your system theory is that 100W users can be served in a unit of time, but today 300W users suddenly arrive, due to the randomness of user traffic, if unlimited flow, it is very likely that 300W users will crush the system in a moment, resulting in no service for everyone.

Therefore, in order to ensure that the system can provide normal services for at least 100W users, we need to design the system for traffic limiting.

One might wonder, given that 300 million users are going to visit, why isn’t the system designed to support such a large cluster of users?

That’s a good question. If the system is a long-term 300W users to access, is certainly to do the above upgrade, but often face the situation is that the daily access of the system is 100W, but occasionally there are some unpredictable specific reasons for a short period of traffic surge, at this time, the company is often for cost saving consideration, We don’t scale up our system to maximum size for an unusual spike.

Ii. What should be done to limit service flow

1. Current limiting mode

Traffic limiting for system services can be performed in the following modes:

fusing

This model requires that circuit breakers be taken into account at the beginning of the system design. If a fault occurs in the system and cannot be rectified within a short period of time, the system automatically checks and turns on the fuse breaker to deny traffic access and prevent overload requests from the backend due to heavy traffic. The system should also be able to dynamically monitor the repair status of the back-end program, and when the program is stabilized, the fuse switch can be turned off to resume normal service.

Service degradation

All functions and services of the system are classified. When the system has a problem and needs emergency traffic limiting, less important functions can be degraded and the service can be stopped. In this way, more resources can be released for core functions.

In electric business platform, for example, if a sudden traffic surges, can temporarily to review, non-core functions such as integral to downgrade, stop these services, release the machines and CPU resources to safeguard the normal order, and the whole system such as the degradation function services can be back to normal after, to start again, to the single/compensation processing. In addition to functional degradation, you can also adopt not directly operate the database, but all read cache, write cache as a temporary degradation scheme.

Delays in processing

This pattern requires a traffic buffer pool at the front end of the system, into which all requests are buffered and not processed immediately. The back-end real business handler then pulls the requests out of the pool and processes them in turn, often using the queue pattern. In this way, the processing pressure of the back end is reduced asynchronously. However, when the traffic is large, the processing capacity of the back end is limited, and the requests in the buffer pool may not be processed in time, resulting in a certain degree of delay.

Privilege to deal with

In this mode, users are classified. By preset classification, the system gives priority to the user groups that need high security, and the requests of other user groups are delayed or not directly processed.

2. Limiting methods

In actual projects, the following technical methods can be used to limit access traffic:

Fusing technology

The technology of fusing can mainly refer to the practice of Hystrix, the open source component of Netflix, which has three modules: fusing request judgment algorithm, fusing recovery mechanism and fusing alarm.

Hystrix reference link: https://github.com/Netflix/Hystrix

Counter method

The system maintains a counter that increments each incoming request by one, decreases the request by one upon completion, and rejects new requests when the counter is greater than a specified threshold.

Based on this simple approach, some advanced features can be extended, such as the threshold value can not be fixed, but can be adjusted dynamically. In addition, there can also be multiple groups of counters to manage different services, to ensure that they do not affect each other.

Queue method

Is based on THE FIFO queue, all requests into the queue, the back-end program from the queue to be processed requests in turn. The queue-based approach also allows for more gameplay, such as multiple queues with different priorities.

Token bucket method

First of all, it is based on a queue, and the request is put into the queue. But in addition to the queue, a token bucket is set up, and there is a script that places tokens in the token bucket at a constant rate. The backend processor must take a token out of the bucket for every request it processes, and if it runs out of tokens, it can’t process the request. We can control the rate at which the script places tokens to achieve the speed at which the back-end processing is controlled to achieve dynamic flow control.

3. Precautions for service traffic limiting

When we do service traffic limiting, we still have some principles and matters to pay attention to:

Real-time monitoring: The system must perform real-time monitoring of the entire link to ensure timely detection and processing of traffic limiting.
Manual switch:In addition to the system automatic current limiting, it is necessary to have a switch that can be manually controlled to ensure manual intervention at any time.
Performance of traffic limiting: In theory, traffic limiting affects the normal service performance to some extent. Therefore, you need to optimize and control the performance of traffic limiting.

Four,

System failures are often unpredictable and unavoidable. Therefore, as system designers, we must take various measures in advance to deal with possible system risks at any time.

Author: IVAN – JSJWK

Source: More than think subscription number (ID: bzsikao)

The dBAPlus community welcomes technical staff to contribute their articles. Email: [email protected]

Traffic surge does not break down, and the service traffic limiting system architecture is decrypted

Related Posts

The mob

X5 browser kernel research report

K8s container deployment practice of Xiaomi Redis