This is a keynote speech delivered by HSA senior Architect Hsu Hanbin at the Technology Salon jointly hosted by HSA and DigitalTechnologies. Centering on “High concurrency practice under peak traffic”, this paper mainly introduces its high availability architecture practice in Tencent QQ member activity platform. Below are excerpts from the speech.
Share the theme of today is the practice of a high availability architecture for mass flow activity, before you start to do a simple introduce myself, my name is Xu Hanbin, in sequence is responsible for science and technology before the QQ member activity operating platform, the platform from the PV million level to level 1 billion, in response to heavy traffic activity that has accumulated certain experience, Today, this topic direction with you to do a discussion.
The content to be shared is divided into three parts: 1. System capacity expansion evaluation method for heavy traffic activities 2. System high availability architecture design practice 3. Practice cases of large-scale traffic activities
Evaluation method of system capacity expansion for heavy traffic activities
There are many forms of mass traffic activities. In addition to the common e-commerce promotion (Double 11, 618, etc.), there is also a new form of livestreaming sales in recent years. A well-known anchor guides fans to place a large number of orders in a relatively short period of time, which is a typical scenario of heavy traffic and poses a high challenge to the stability of the system. Another more common form of activities is marketing activities, such as the Spring Festival red envelopes, anniversary celebrations. The campaign’s traffic sources include internal sources, external paid advertising sources, and third-party cooperative diversion. The company has invested a large amount of money in the activity, and a large number of users participate in the activity. If there is a problem in the middle of the activity, it will not only have a very negative impact on the product reputation, but also have money loss, and involve the failure of cooperation with the third party.
For example, if you cooperate with the Spring Festival Gala, you are asked to pick up your mobile phone to scan codes in a certain part of the Spring Festival Gala. At this time, the system is down and the scanning fails. This situation is very embarrassing. In this case, you may not be the first person to come to you, but an external third-party partner, which shows how important it is to have a high availability system at a critical moment.
The challenges faced by high-traffic activities fall into three main parts.The first challenge is the difficulty of estimating trafficIf you tell operation and maintenance that you want to expand the capacity by 10 times, he will definitely ask you why, and you need to come up with reasonable grounds for expansion.The second challenge is the difficulty of scaling architectureEven if we have a fairly accurate estimate of peak traffic, how do we expand the system? The complexity of modern IT system architecture is getting higher and higher, especially the popular microservice architecture, which has more and more nodes. The call relationship between services is very complex, and the carding of links itself is also a big difficulty.The third challenge is how to implement high availability. We have done traffic assessment and link sorting and expansion, will the activity be no problem on the day of the arrival of heavy traffic? Of course not, so the third question is how to implement high availability.
In order to solve the above challenges, the dual complexity of business and system is unavoidable. The complexity of the microservices architecture, coupled with the complexity of the business, makes the whole system complex and difficult to understand by humans. Even the architects and business r&d students of the company can hardly explain clearly every detail of the whole link of the system and the invocation relationship between services, because the system at this time is already a huge and interwoven network. 要Solve traffic estimation problemsIt is necessary to carry out fine link sorting. The first step of sorting is to draw a schematic diagram of the architecture, draw out the large modules of each architecture, and then disassemble the service function links according to the schematic diagram of the architecture.
Once the link diagram is sorted out, we canThe system starts to estimate capacity expansionThere are three important metrics involved: promotion volume, link conversion rate, and total number of requests per UV. ** The first metric of promotion is usually calculated by the number of ads exposed per second.Add up the amount of exposure from different advertising channels to get an estimated level of promotion. Another way is to calculate the number of peak APP logins per second, which is roughly equal to the number of promotional exposures per second. It is generally predicted that the amount of promotion per second is more than 10 times of the existing capacity of the system. According to this amount of promotion, there is basically no problem with the system, but it will cause a serious waste of resources. Then you need to useThe second parameter to help us make a more detailed assessment is called link conversion rate. ** Take the red envelope activity as an example. When users see the red envelope activity, they may not click to enter the page, and they may not receive the red envelope when entering the page. Each link has a link conversion rate. There are two methods to estimate the conversion rate. One is to estimate the conversion rate according to the past experience data. Another way is to exercise in advance, push activities to some users in advance, and quickly and accurately collect the conversion rate of every link. Qualified companies usually adopt this method.One more parameter we need to pay special attention to is the total number of requests per UVWhy pay special attention? Because a user enters an activity page, he may be looking up personal information, viewing activity records, and often more than one request action, you can calculate the back-end pressure not only by the peak activity per second, but also by other request actions. Assuming the average user has four requests, your traffic is multiplied by four, which has a magnifying effect. Through the first three indicators, we can calculate how much traffic pressure the back end will bear, and then come to the estimated capacity expansion. Finally, the capacity expansion was set to be 3 times the estimated capacity. This 3 times is not a figurehead, but a rule based on the experience of a lot of activity expansion in the past. The peak of the activity on the day is usually 3 times the normal average. How to calculate the specific let’s take an example, suppose we are 100/S exposure, only 60% of users enter the claim page, and only 50% of them click the claim button, then the back end received requests is 30/S, according to the total number of requests per UV, the whole back end should support at least 120/S of pressure, Add the 3x rule and the final back-end expansion is 360/S. The peak traffic of the page is 180/S, so you can take the data to the operation and maintenance expansion.
The capacity of an activity is estimated to be 9.6W/S. According to the triple rule, the entire service link needs to be expanded to 30W/S. However, it is difficult to expand some applications during capacity expansion. For example, Web Server is easy to do parallel expansion, as long as there are enough machines can be expanded, but some of the daily interface is only 2000-3000/S QPS, to expand to 100,000 QPS is basically impossible, there is not so much resources and energy. This involves the specific practice of capacity expansion, which is also the practical content of the system high availability architecture design that we will talk about next.
High availability architecture design practice
Generally speaking, our system high availability architecture practice has three major expansion factors: full link QPS, machine bandwidth, and storage size.
Full-link QPS is the easiest to understand, and we pay the most attention to it, so we don’t expand it too much. Machine bandwidth problems, the larger the traffic is more likely to encounter. Let’s say CGI requests reach 10W/S. Usually they are not stored once. At that time, our normal business requests are stored up to 7-8 times, including querying activity configuration, querying user participation records, querying other associated information, etc. This 10W/S means that the storage terminal needs to withstand hundreds of thousands of per second or even millions of per second, if a storage of 1KB or hundreds of KB, in this magnitude of data storage is quite huge, the use of 100 mbit /S level network card will obviously encounter network bandwidth bottleneck problems. Next comes storage size, which is usually not the size of the hard disk, but memory, which is also prone to bottlenecks. Most companies use the Redis type of key-value storage, and some companies use their own key-value storage. Usually, data written into the memory is first written, and then synchronized to local SSDS or disks through the periodic elimination mechanism. In high-traffic scenarios, memory can easily fill up, and hundreds of GIGABytes of memory can be filled instantly on the day of the event without prior evaluation. Static resources are the images, JavaScript, and CSS styles that our web pages request, and often the bottleneck is not in the system itself. In general, if you don’t care about bandwidth constraints, you can easily set up an Nginx on a plain machine to pressure the performance of static resource requests to thousands of QPS. However, if you consider the size of static resources and the fact that a single user opens a page and requests many static resources, the size of the requested static resources can easily exceed 1M. The first thing to fail is the machine’s egress bandwidth, not the performance of the static resource server itself.
For static resource problems, we have a common solution — offline package push mechanism. Several days before the activity started, static resources were pushed to the mobile phones of active users, so that more than 90% of the traffic on the day of the activity was local, and he did not initiate network requests at all, so as to solve the request pressure of CDN and solve the bandwidth problem. This is fine in theory, but here are the key points of building a high availability architecture. It can be summarized as follows: overload protection, flexible availability, disaster recovery construction, weight separation, monitoring system, data reconciliation.
Monitoring systems are important, we need to know what state the system is in, and we need some plan to deal with it when something goes wrong. There is a problem mentioned in the previous share, some interfaces are only 2000-3000/S, there are not so many resources and energy to expand to 100,000 /S, so what to do? There are three commonly used ways to optimize the system link architecture. The first one is asynchronization, which is easy to understand. After the execution, it will be slowly digested in the message queue. The second type is memory Cache, where the data to be accessed is stored in the memory Cache in advance. The third is service degradation, which is more commonly used and won’t be explained too much. These methods can improve the performance of the link.
The core goal of overload protection is that partial unavailability is at least better than complete unavailability, and the main means of implementation is stratified flow control. These layers include the promotion layer, the front end layer, the CGI layer, and the back-end service layer. For the promotion layer, it can only be used in the worst case, because it requires the removal of advertising, and companies are not willing to use this method unless they have to, and it will also affect the business effect of activities. For the front-end layer, it can realize the random pop-up countdown, limit the user can not initiate a new network request within 1 minute, which helps to reduce the peak traffic. Other flow control layer implementation is similar, do not do too much expansion.
In terms of deployment, we should protect some of the core operations that are important, such as shipping in the case above, which should not be affected by other operations. If the query operation is unavailable, the user only needs to refresh the page again. However, if the payment is not received, the user complains directly. In this case, the delivery operation is separated and an asynchronous delivery cluster is deployed independently.
How do we achieve flexible usability? Generally, there are two directions. The first way is to set a super short timeout period. For example, set the timeout period for the security interface to 30 milliseconds. Another method, such as UDP service, is to directly wait for network packet return.
In disaster recovery construction, we should boldly put forward assumptions, assuming that each core service and storage may fail, and formulate corresponding disaster recovery and solution according to the assumptions. For example, in response to power failure and network failure in the machine room, we can make cross-machine room cross-network deployment, so that even if the telecom optical fiber is cut off, the service can remain basically normal. In addition, emergency plans should be made, and all the processing schemes corresponding to the assumption of failure of all key nodes should be made into switch mode, so as to avoid running logs and writing processing scripts on the day of the activity.
For monitoring and reconciliation, we must establish a multi-dimensional monitoring system. Some monitoring problems may occur under the pressure of peak flow. Multi-dimensional monitoring can help us discover the real state of the system and take correct countermeasures. About account checking ability, if the delivery fails, the students on duty on that day need to read the online log and write the reissue script, then half an hour, an hour has passed. The students on duty are usually busy, so the best way is to develop well, if the problem is found that day, it can automatically detect the failure of the log and reissue.
Practical examples of mass traffic activities
We have talked a lot about the preparation of architecture for high-volume activities. In this part, we will talk about specific practical cases. The first case is the Spring Festival red envelope activity of a certain business in a certain year. Although we made a lot of preparation in the early stage, there were a series of problems when the activity went online, but fortunately, they were all within our plan, so it was almost a surprise.
After the event, we also reflected on the review, why did there appear so many problems? First of all, the principle of triple capacity expansion is not followed, and then for example, link combing is not in place. The following is some of our summary content.
In our past experience, the most dangerous activities are not big ones, because in order to ensure the smooth operation of big events, the company will hire a lot of key personnel to participate, and the problems are usually small. And the flow of small and medium-sized activities will not be too large, there is no threat. It’s the medium to large events that are the most likely to cause problems. Here’s another example of a medium to large event. We did some evaluation and preparation for the annual event of one game, but it’s impossible to treat every event like a Chinese New Year red envelope event. At that time, we sent two people to have a look, and each online module team was responsible for the pressure test of their own modules. All the feedback was ok, so the activity went online. I was woken up by a phone call in the early hours of the day. There was a problem with the system, and many users were no longer able to participate in the event.
The Web system at that time was based on PHP and Apache, using Apache’s multi-process mode (Perfork). The response time of a back-end service is usually tens of milliseconds, and a worker process should be able to handle more than ten requests per second. However, in peak periods, the response time of the back-end service surges from tens of milliseconds to hundreds of milliseconds or even one second under the pressure of heavy traffic. At this point, a WRker process can only process one request per second, which results in a backlog of requests. We took advantage of the early morning break to do a lot of remedial work, including demoting and other actions, and after an overnight effort, users were able to enter the activity normally, but it was not completely resolved. In the afternoon of the next day, we went to the company to continue, and finally solved the problem and safely passed the crisis of the activity. These are the medium to large events that can go wrong because we don’t pay enough attention to them.
Summarize some general experience and procedures that need to be done in the early stage of heavy flow activities. Firstly, it is relatively reasonable to evaluate and sort out the plan, then optimize and adjust the structure, and finally it is to sort out the emergency plan.
I have participated in the Red envelope activities of the Spring Festival for three years in a row. Every year, I would convene the core members to present the plan about one month in advance. I made a lot of preparation every year, but there were some small problems every year. Introspection why invest a lot of manpower and material resources, the system is still a problem? In essence, I think there are three main aspects: first, many times the test environment, or even the performance environment, is not representative of the real production environment; Second, each local high availability does not represent the whole high availability; The third is the growth of the system. Complex business systems are constantly iterating and expanding. If there is no performance problem today, it does not mean there will be no problem tomorrow.
In order to ensure the performance of the production environment, no problems, from this point we have a clear direction of exploration — regular production environment performance testing. A lot of preparation is done to ensure that the online environment meets certain performance targets, so the best way to do this is to do real performance testing directly in the production environment. For example, I choose two or three o ‘clock in the morning when there is no traffic to perform the performance test, to see if the system can achieve my target performance.
So what are the keys to actually implementing performance pressure testing in a production environment? There are three main points, test traffic identification, test traffic identification and most importantly, test traffic isolation. It is not appropriate to write a large amount of fake data to the production database, and you will need to clean up the data later. So as long as the three key points mentioned above can be addressed, online performance testing can be achieved.
This is a performance test platform that we did based on JAVA, because we can enhance JAVA’s intermediate bytecode layer and do some control through agent probes without intruding into the user’s business code. The official flow rate remains unchanged, and we will do some control on the flow rate of pressure measurement. For example, if the system has a mysql database, we will create a shadow database online, which is the same as the original database, but with less data or empty data (only table structure). The Agent identifies incoming traffic and replaces the connected database with the shadow database. Both message queues and Cache storage can be implemented in a similar manner. In addition, we also made a thing called the baffle, what is it? Suppose there is a third-party SMS interface, business traffic to call this interface is ok, but the test traffic directly to call is certainly not, because the third party can not cooperate with you to do a shadow SMS system. The purpose of the baffle is to implement a mock through the Agent and return the mock directly when invoking the third-party interface to prevent test traffic from affecting production data.
This is the solution of the whole link performance test in the production environment of the series. On this basis, we also developed the function of automatic link combing. For complex business and complex system, which contains many microservice nodes and business nodes, no one can fully understand all the links of the whole link. Through agent probe, we can make use of technical means to figure out the flow direction of system link and assist in link sorting. By combining with business, we can sort out the link of business invocation. Using the E2E inspection platform, you can not only know the performance of a link, but also locate performance bottlenecks in specific links. This helps you timely and accurately tune performance, sort out service link views, and achieve high availability of the system.
Want to further into the group communication or interested in the product to apply for a trial, you can scan the code to add [Satree] enterprise wechat, but also will not regularly share dry goods oh ~
Founded in 2016, Serial Technology is a leading system high availability expert in China, initiated by a number of alibaba senior experts. It aims to solve the microservice architecture governance and performance problems as the core, to provide a comprehensive guarantee for the performance and stability of enterprise systems. It has built a complete product matrix covering the full link pressure test, E2E inspection, failure drill and other modules, and is committed to helping enterprises improve the system availability to 99.99%.