1. Crash day
December 20 was the day xi ‘an collapsed.
There were 21 new cases on December 19 and 42 on December 20, and some cases have spread in the community…
Xi ‘an epidemic prevention pressure is huge, all units and companies require 48 hours of nucleic acid test report to work.
In such a severe situation, as the core of the prevention and control system: Xi ‘an Yimitong unexpectedly collapsed, and the collapse is so thorough.
Paralyzed for over 15+ hours!
Throughout the whole day, how many office workers were blocked in the subway entrance, how many passengers were frozen in the halfway, advance and retreat can not…
By the afternoon, the news even suggested:
In order to reduce the pressure of the system, it is suggested that the general public do not need to show the code, light the code, in the system card immediately, please wait patiently, try to avoid repeated refresh, also thank the general public friends for their understanding and cooperation.
Is this the way to solve the problem?
If you really need to limit traffic to prevent system crashes, wouldn’t it be easier to use technical methods to limit traffic, or even add an Nginx in front of it?
Today, we will try to analyze this business and the corresponding technical problems.
2. Product analysis
We do not analyze the other business of Xi ‘an Yimitong, which is not the key point, and it did not completely collapse that day, only the scan function collapsed.
In fact, this is a very typical large number of queries, a few updates of the business, close your eyes and analysis, can say that more than 90% of the traffic is queries.
Let’s take a look at the product form of the first version. After scanning the code, part of the personal name and ID card information is displayed. At the same time, green, yellow and red codes are displayed below.
This is the beginning of Xi ‘an Yi Code, business processes only need a request, or even a query SQL can be done.
Since then, the interface has undergone two major revisions.
The first update added vaccination information with a border; The second revision added nucleic acid test information, showing the time and results of nucleic acid test at the bottom.
Two more query services are added to the entire page. If the system uses a relational database, at least two more QUERY SQL may be added.
According to the statistics, Xi ‘an has a population of 13 million. According to the maximum 10% of the citizens scan code at the same time (I doubt there will be so many), that is, millions of concurrent data.
Such a concurrent business is common in Internet companies, even more complex than this scenario.
So why did it just pop?
3. Technical analysis
In the official reply that evening, we read this:
At around 7:40 am on December 20, users of Xi ‘an Yipong saw a surge in traffic, which was more than 10 times the previous peak traffic per second. This caused network congestion, and some applications including Yipong could not be used normally. “
The “One Code Pass” background monitoring alerted the police immediately, and the 24-hour on-site communication, network, government cloud, security and operation and maintenance teams immediately carried out investigations. The platform application system and database ran normally, and the problem was judged to be on the network interface side.
According to the above information, the database and platform system are normal. It is the network that has a problem.
In my previous article, “A DNS cache disaster”, I drew an access diagram to analyze the network problems with this diagram.
A common user request starts with a domain name, is resolved by the DNS server and then gets an external IP address. After accessing the firewall and load through the external IP address, the request is sent to the server. The server responds and returns the result to the browser.
If the network is really a problem, the most common problem is the DNS resolution error, or the extranet broadband was full.
The DNS resolution error must not be the problem this time, or it may not only be a function error; The broadband of the external network is played full, increase bandwidth directly on the line, not a day did not fix.
If there is a problem on the network side, there is usually no need to change the business, but in fact when the system is restored, everyone found the interface back to the beginning of the article mentioned the first version.
In other words, the system “rolls back”.
The contents of inoculation information and nucleic acid test information are missing on the interface, and a new page of nucleic acid query is added on the home page of Ycitong.
So, is it just the network interface? I have a little bit of a question here.
4. Personal analysis
In my experience, this is a classic case of system overload, where the number of requests exceeds the number of server responses in a short period of time.
In human terms, the number of external requests exceeds the maximum processing capacity of the system.
Of course, the maximum processing capacity of the system is closely related to the system architecture. The system load of the same server varies greatly with different architectures.
To deal with such problems, there are no more than two solutions, one is to limit the flow, the other is to expand the capacity.
Limiting traffic is blocking out users and processing requests that can be processed first; Capacity expansion means adding servers and increasing database capacity.
The above mentioned official let everyone do not brush a code pass, is also a way of artificial limit flow; But that’s basically not the case in technical systems.
There are many technical schemes for limiting traffic, but the simplest is to hang a Nginx configuration in front of it. To complicate matters, the access layer writes its own algorithm.
Of course, limiting traffic doesn’t really solve the problem, it just keeps some requests out; The real solution is to expand capacity to meet all users.
But in fact, according to the solution to the problem and the situation of product rollback, Yimitong did not expand the capacity in the first time, but chose to roll back.
This shows that the system architecture design does not fully consider the situation of expansion, so it cannot support the first choice of this scheme.
5. Ideal solution?
This is all just speculation, but in reality they may face more realistic problems like tight deadlines, bosses controlling budgets…
Then again, if you were the architect in charge of One Code, how would you design the overall technical solution? Welcome to leave a comment, here to tell me what I think.
The first step is read and write separation and caching.
The system is divided into at least two blocks. Read services that meet daily needs are extracted and used to receive the maximum external traffic.
A separate subsystem is responsible for business updates, such as updating of inoculation information, changing of nucleic acid information, or changing the color of code periodically according to business.
At the same time for the user a large number of single query, the cache system, priority to read the cache system information, to prevent crushing the back of the database.
The second step is to divide the database into tables and services.
In fact, there is no relationship between the user and the user’s single query, can be done according to the user’s attributes of the sub-database sub-table.
For example, the user ID module is divided into 64 tables, and even can be divided into 64 sub-systems to query, in the front end of the interface to distribute traffic, reduce the pressure of a single table or service.
According to the above analysis, the failure to expand capacity in time may mean that services are not split. If all services are single service sub-services, it is easy to expand capacity in case of overload.
Of course, if the conditions are right, the microservices architecture is better, and there is a solution to deal with similar problems.
Step 3: Big data system and disaster recovery.
If a lot of information is presented on a single page, there is a technical solution to consolidate it into a large table in NoSQL through asynchronous data cleaning.
You can directly access the NOSQL database for related services such as user scanning and query.
The advantage of this process is that even if the update service is completely suspended, it will not affect the user scan code query, because the two systems and databases are completely separate.
Deploy services in remote and dual equipment rooms, and make overall DISASTER recovery and preparedness plans to avoid extreme situations, such as cutting the optical cable in the equipment room.
There are a lot of details on the optimization, here is not to say, here is just some of my ideas, welcome to add.
6. The final
No matter how you analyze it, it must be a man-made disaster, not a natural one.
The system was put into production without rigorous testing and crashed in a slightly more intense environment.
There are many cities bigger than Xi ‘an, and the epidemic situation is even more serious than Xi ‘an. Other cities have also encountered similar problems.
Xi ‘an as a big province of science and technology, such a problem really shouldn’t occur, especially after I saw the domain name address used behind this small program.
There is a sense of weakness to ridicule, although this has nothing to do with the use of the program, but from the details can really see the strength of a technical team.
Hopefully this time we can learn a lesson and avoid similar problems again!
Recommended reading: “Xi ‘an health code collapse! Programmer rush repair unexpectedly……”