10 billion challenges: How to achieve a "sure" Spring Festival Gala shake system

This is a technical copy of the shake a shake event in early 2015 (The Year of the Sheep Spring Festival Gala). When reviewed two years later, there are many architectural trade-offs that can still be referenced for similar activities, so re-issue.

The Spring Festival Gala of the Year of the Sheep has come to an end. Now, what kind of backstage system does this interesting activity involve all the people? How is this system designed and implemented?

Spring Festival Gala shake activities

Before we understand the system, what kind of activities does the Spring Festival Gala have? Spring Festival gala shake shake reuse shake shake entrance, but provide a new interface and interactive content.

In the year of the Sheep Spring Festival Gala shake interface, users shake their phones, you can see the star New Year, family photo, friends cards and other wonderful activities page; There are also special purpose pages such as “take a break,” or “hang server,” which many mistakenly believe won the lottery.

The most anticipated event is definitely the red envelope shaking. The lucky users who win the red envelope will get a red envelope (seed red envelope) for themselves and several red envelopes to share with other friends (split red envelope).

Around these activities, the following will introduce some thoughts and practices in the process of designing and implementing this system through four milestone versions at different stages of the project, especially the origin of “certainty” mentioned in the title.

V0.1 prototype system

The prototype system is very simple, but it has basically achieved the requirements of the Spring Festival Gala shake. The architecture of the prototype system is shown below.

The related processing flow is as follows:

After the user shakes the phone, the client generates a shake request, which will be forwarded to the shake service after it is sent to the access service.
Shake a service according to the flow of live programs, through a series of logical judgment, to the client to return a result: star New Year, red envelope or other activities;
Assume that the red envelope is won. Since the red envelope is sponsored by the enterprise, the enterprise image needs to be displayed. The client will pull back the enterprise LOGO and other resources from the CDN, and finally display a complete red envelope.
Then when the user opens the red envelope, the request will enter the red envelope system, then the payment system, and finally the Tenpay system to complete a series of complex accounting processing, and finally get the red envelope;
Users can also share red packets, which are sent to friends or groups through the messaging system, and others can grab another round.
In this process, the security system guarantees the business security of the red envelope activity.

The flow of the above data can be classified as resource flow, information flow, business flow and capital flow. This article will focus on resource flow and information flow.

Challenges

The prototype system looks simple enough and its functions are basically complete. Can it be directly used in the Spring Festival Gala after slight modification? The answer is definitely no. So the question is: Why not?

Before we answer that question, what are the challenges behind this seemingly simple activity?

Massive user requests, with an estimated peak of 10 million requests per second

To get a better idea of the scale of 10 million/SEC, see the following figure:

Note: The data of train ticket snatching are quoted from public data

The Spring Festival Gala is full of interactive and uncertain factors

This system needs to interact closely with the whole process of the Spring Festival Gala of the Year of the Sheep. From the beginning to the end of the project, a series of uncertain factors will increase the complexity of the system implementation: In the development stage, the discussion on how to coordinate the program and activity forms may last until the Spring Festival Gala. How to make the system serve the changeable demands? In the live Spring Festival Gala, the number of programs, the length and even the order of the program may be adjusted, how to achieve the live program and shake shake seamless?

System depth customization, success or failure

As a system designed for the Spring Festival Gala, it can only run for a few hours after it is deployed online. Within these hours, the grayscale release advocated by the conventional system, first hold and then optimize is not very applicable. In that short time, there is only one chance: make it or break it.

The people are highly concerned and must succeed

Spring party has about 700 million viewers, we have great expectations for this activity, under the attention of the whole people, can only succeed, can not fail.

Lack of historical experience, not quite grasp

Such a large event, it is unprecedented for us, we do not have a lot of confidence. How is the peak of 10 million/s estimated? How many users will participate in each session? How many resources should be reserved for the system? There are no ready answers to these questions. They all require exploration and reflection.

It can be seen that the seemingly simple activity hides a huge challenge. The prototype system assumed before is unlikely to be competent and needs to be further optimized.

What aspects need to be optimized? There are three obvious ones:

Traffic bandwidth

The Spring Festival Gala needs to use a lot of multimedia resources, which need to be downloaded from THE CDN. After evaluation, the peak bandwidth demand is 3Tb/s, which will bring huge bandwidth pressure. Even if we had unlimited resources and bandwidth requirements could be met, the amount of time a client would have to wait to download a resource after a shake would be too much of a user experience detriment to be acceptable.

Access to quality

Access is the first level of the background system, all requests will reach access. With 350 million people expected to be online by the end of the day, how to ensure the quality of Internet access as much as possible? Even when the extranet fluctuates?

Huge amounts of request

When 10 million requests per second come in from the extranet, they are routed to shake service, which means shake service will also have 10 million requests per second. This means that you need to ensure that there are two 10 MBPS high availability in the system, which is a big problem in distributed systems.

If you had to quantify a confidence score on whether the system would ultimately succeed, the prototype system would have a confidence score of 10. This figure means that if the Spring Festival Gala adopts this system and is successful, we think it will be 90% luck. Can also be put another way: take this system to the Spring Festival Gala shake, 90% of the likelihood will die.

The system certainly cannot be based on luck. How can these problems be solved?

V0.5 beta

The goal of V0.5 is to solve several problems in the V0.1 prototype system.

Resource pre-download

The Spring Festival Gala uses more resources, but also relatively large, are basically static resources, can be downloaded to the client a few days in advance. After saving the file to the local PC, the client can load the file directly from the local PC when needed, thus saving the bandwidth required by centralized online download. In addition, users do not need to download resources when shaking, which provides better user experience. The following figure shows the resource pre-download process.

The resource push module uploads resources to the CDN and pushes the resource list to the client. The push process is based on the wechat message system, which can push the resource list to hundreds of millions of users in a very short time.

The following problems need to be solved:

Resource Delivery
Resource update
Resource download failed
Resource coverage
Offline Resource Security

Through this resource pre-download system, 65 resources were delivered from September 2015 to February 18 2015, with a cumulative flow of 3.7PB, among which the peak at idle time reached 1Tb/s.

Extranet access sorting

To ensure access quality, in addition to ensuring the stability of access services, the following requirements are required:

The ability to automatically switch to normal access services when some external network access fluctuates;
Ensure that the network and services have sufficient redundant capacity.

The wechat client already has the ability of external network automatic DISASTER recovery switchover. The following shows the external network deployment of the system.

We reorganized the deployment of the external network, and finally set up 9 TGW access clusters in Shanghai IDC and Shenzhen IDC respectively. Each IDC deployed telecom, mobile and Unicom external network access lines in three different campuses.

In addition, access services have also been temporarily expanded, with a total of 638 access servers, supporting a maximum of 1.46 billion simultaneous online services.

Access service built-in “shake”

Architecture changes

As mentioned earlier, the system needs to deal with an estimated 10 million requests per second from the external network, and also needs to forward the same 10 million requests per second from the internal network for the shake service, in addition to ensuring high availability in various abnormal situations. With so many requests, it’s hard to imagine how any fluctuation in network or service in this distributed system could be catastrophic.

The cost of addressing this head-on was too high and we chose to remove the shake service — by integrating the shake logic into the access service, we removed 10 million/SEC forward requests.

However, there is a precondition for doing so: the stability of access services should not be reduced. Because not only the Spring Festival Gala shake request, wechat message sending and receiving and other basic functions of the request also need to be transferred through the access service, if the embedded shake logic drag down the access service, it will be more than worth the loss.

The architecture of access services helps solve this problem.

As shown in the figure above, the access service has a network IO module that maintains TCP connections from the client, sends and receives data, and communicates with the client. The network IO module operates as an independent process group. The received requests are sent to the access logic module for processing through the local shared memory. The access logic module is also a set of processes that run independently, typically by request forwarding logic to other logical servers for processing via RPC. Shake logic is now integrated into the access logic module as embedded logic. The network I/O module and the access logic module are isolated from each other. Therefore, the access logic module can be upgraded independently without affecting the existing network connection, which greatly reduces the risk caused by the embedded logic.

But this is not enough, we can also simplify the rocking logic embedded in the access logic module as much as possible, and only keep the relatively simple, stable, only need to complete the single computing lightweight logic; The other more complex logic, which may need to be changed frequently, or need to be accessed across the machine, is independent as the shake Agent. As an independent running process, it communicates with the embedded shake logic through the local shared memory.

Shake the logic implementation

The modification of the access service architecture makes the built-in shake logic possible. What remains is how to implement the shake logic? Spring Festival Gala shake logic is the most important is shake red envelope, shake red envelope need to focus on solving the following problems:

How to distribute red envelopes?

All red packets are derived from the red packet system, but in order to run without relying on the red packet system, the seed red packet files generated by the red packet system are deployed to the access service in advance. To ensure that the red packets will not be distributed repeatedly, there is a file shredding program to complete the file shredding of different machines, each machine only keeps the part of the red packets that it needs to process; In addition, to ensure that the red packets do not leak, there is another program to complete all the machine red packets file merge verification.

If the users who win the red envelope choose not to share the red envelope, the red envelope will be recycled by the red envelope system, and part of the red envelope will be recycled to the access service through the shaking agent.

The shake Agent provides deliverable seed red packets to the built-in shake logic based on preset policies. The red packets are delivered at a uniform speed per second, allowing precise control of the global red packet delivery speed and keeping it within the range of normal processing by the red packet/payment system.

How to ensure safety?

The service requires that each user can receive at most3Red packets, and the most per sponsor1A. In a conventional implementation, you need to store the user’s collection record in the background, and calculate it every time you shake the request, and update the storage if you get a red envelope. However, it is dependent on storage, and the stability is difficult to be guaranteed under massive requests. We borrowed fromHTTPOf the agreementCOOKIEA similar mechanism is also implemented in the shake protocol: the background can write the user’s red envelope to receive the situationCOOKIE, stored by the client, the client in the next request to bring up, the server to judge. This can achieve the same goal without background storage.

It is always possible that the protocol can be broken, and malicious users can create automata to circumvent itCOOKIEDetection mechanism. soCOOKIEThe detection mechanism can only be used by the vast majority of users using normal clients. Malicious users can crack the protocol by storing red envelope delivery records on the machine.

Although the access service uses a long connection, malicious users can constantly reconnect to different servers to obtain red packets. In view of this situation, we designed a red packet distribution summary service, which can synchronize the users who have reached the maximum limit of receiving red packets on all access services.

After the above strategy, in turn, progressive, step by step deeper layer to solve the problem earlier, but in the design, the previous step does not depend on the step, even if the back steps to malfunction cannot as expected, also won’t influence the outcome of the front, in the worst cases, the system can guarantee the single-user red envelope number limit demand.

How to interact with the gala?

The Spring Festival Gala Shake system needs to interact with live programs of the Spring Festival Gala. When the Spring Festival Gala changes programs or the host performs oral broadcast of activities, the system needs to synchronize the change of activity configuration.

Here, the first requirement is fast, the second requirement is stable. So we designed a configuration system like this.

There is a configuration front desk, through which the field staff can send change instructions to the background. Backstage, there are two receiving points in Shanghai and Shenzhen, each of which consists of three redundant configuration services in different parks. After receiving the configuration, the configuration service synchronizes it to another destination and sends it to all access services in a file. To ensure successful synchronization, three redundant change channels RPC/RSYNC/ Change system were used to change the configuration simultaneously. The whole process can be completed within 10 seconds from the configuration front desk initiating changes to the access service loading configuration.

It looks fast and stable, isn’t that enough? Consider the following exception scenarios:

The front desk of the Spring Festival Gala cannot work or the network is faulty, and the instructions cannot be sent to the configuration service;
All configuration services cannot work or the network is faulty, and the configuration cannot be synchronized to the access service.

If these anomalies occur temporarily during the Spring Festival Gala, the impact will not be great under normal circumstances. We can change the configuration manually temporarily to win enough time to repair the system. However, if at this time the host is calling on everyone to pick up the phone to grab the red envelope, we are not enough to grasp that we can complete the configuration change of the whole system in a very short time, that is to say, it is very likely to fail, the key configuration can not be changed, the red envelope can not come out at all. There is a similar scenario, that is, after grabbing red packets starts, it cannot be finished due to configuration changes, and it cannot be switched back to other activities, which damages the user experience.

In view of the key point of snatching red packets in oral broadcasting, we adopt the following configuration change strategy: using the countdown configuration of snatching red packets, snatching red packets automatically starts after the point, and automatically ends after a preset time. In addition, due to the uncertainty of the oral broadcast time of the Spring Festival Gala program, we developed a change strategy to gradually correct the countdown during the program. Therefore, this key configuration is only related to time, machine time is relatively easy to guarantee, through NTP can achieve enough precision time synchronization, we also centralized monitoring of machine time, can find and deal with machine time abnormal machine time.

Problems with estimation

At this point,V0.1The three problems mentioned earlier in the prototype system have been solved,1000wan/Seconds of massive requests can be carried. However,1000wan/Seconds are estimated, if it eventually comes2000What about Wan?4000Wan? The numbers seem a little crazy, but they are1Hundreds of millions or even hundreds of millions of users are constantly shaking, but also possible.

For 20 million/SEC, we are not really worried, the access service is constantly optimized to provide disaster redundancy, but still has 25 million/SEC throughput capacity.

But 40 million/second is far beyond the background service capacity, how to do?

There is an overload protection in the way of mass service. If you can’t carry it hard, you can protect yourself. To put it simply, the front end protects the back end and the back end rejects the front end.

The client proactively reduces requests when service access is unavailable, service access times out, or service speed limiting occurs.

When the access service detects that a client requests too frequently, it automatically delays packet return, indirectly extending the request interval, and finally limiting the client request frequency within a reasonable range in the background. In addition, the access service calculates the CPU usage of the machine. When the CPU usage reaches different preset thresholds, the access service automatically returns rate limiting rates of different levels to the client.

With these measures, the system automatically degrades to protect itself when the volume of requests exceeds expectations.

To sum up, we can get the V0.5 beta architecture with a confidence index of 50.

V0.8 preview

What is the core experience?

Why does V0.5 beta only have a confidence score of 50? What else is missing?

On reflection, the main focus of the V0.5 beta is on how to handle the flood of shake requests so that users can happily shake out a red envelope. The V0.5 beta does not give this further thought.

Let’s take a look at the core experience of shaking — shaking a red envelope. Shaking a red envelope involves shaking a red envelope, opening a red envelope, sharing a red envelope, grabbing a red envelope with a friend, and so on.

A little analysis can be found that the first three steps are my operation, followed by friends. There is a delay between what I’m doing and what my friend is doing, which means that you can actually accept a degree of experience loss — delay.

V0.5 beta has solved the problem of shaking red envelopes, the rest is to open and share red envelopes.

How to ensure the user experience of opening and sharing red envelopes?

The opening and sharing of red packets can be abstracted. They are both composed of two parts: user operations (information flow) and the red packet processing logic in the background (business flow). For the user, what he cares about is whether the operation is successful, not the red envelope processing logic in the background. Therefore, the problem can be further simplified as: how to ensure that the user operation of opening and sharing the red envelope is successful?

We added an intermediate layer between user operations and red envelope processing logic. This middle layer includes simplified logic of red packets and asynchronous queue of red packets to realize the asynchronization and decoupling of information flow and business flow.

Red envelope simplified Logic

Through some algorithm design, the red envelope simplification logic can complete the preliminary judgment of whether the red envelope request is legitimate through local calculation, the legitimate request is put into the asynchronous queue, and then can return to the client processing success;

Red packet asynchronous queue

The red packet asynchronous queue stores requests from simplified logic. In storage, the asynchronous queue uses the three-machine scheme. If two copies are written from three machines, the queue is successfully joined. The processing logic of the asynchronous queue will fetch the request from the queue and continue to initiate the call to the red envelope system to complete the subsequent red envelope logic. In the event of a red envelope system failure, a retry policy is in place to ensure successful retries after the system is restored.

Access service, red packet simplified logic service and red packet asynchronous queue constitute the so-called “iron triangle”, which can protect the user experience in a long time when the red packet system has serious problems, and gain precious time to repair the fault.

On second thought, the “iron Triangle” can go even further: split the red packet simplification logic again into basic business logic and local passthrough queues. After the validity check of the red packet is complete, the service logic writes the request to the local file queue, and the local file queue transmits the request to the asynchronous red packet queue for subsequent processing. In this way, user operation results will not be affected even if unexpected jitter occurs in the asynchronous queue.

V0.8 preview version has been basically formed, the confidence index reached 70.

Formal edition V1.0

As we all know, design does not equal implementation. In order to ensure that the final implementation and online system operation meet the design expectation, we also do the following:

Full pressure test

Untested systems are dangerous, and there is no way to really know if the system is performing as expected under pressure. What are the limits of the system? To this end, we set up a pressure test environment, the system for continuous pressure test. After the system is deployed, several rounds of real pressure tests are performed on the live network.

Project CODE REVIEW

A multi-department joint project was madeCODE REVIEWCritical path is carefully evaluated at the code level.

Internal exercise

The client release process is longer, the full rollout takes longer, and it can’t go live as quickly as the serverfixThe problem. Therefore, in addition to testing, it is very necessary to do a real rehearsal before releasing the client.

Online preheating

In February 12, 2015 and February 15, 2015, we finally gained two opportunities to warm up and verified the system in actual combat.

	2015.2.12	2015.2.15
Total number of shakes(time)	310 million	40 million
Shake the peak(time/points)	50 million	17 million
Shake the peak(time/seconds)	1 million	430000
Red Envelope Rate(a/seconds)	50000	50000

Because of the preheating scale, the peak value of the shake did not reach the estimated value. However, the rate of red packets was maintained at the preset value of 50,000 / SEC, which fully tested the back-end “iron Triangle” and the red packet system behind it.

Review and adjustment

Every drill and warm-up is a rare opportunity, not only those abnormal modules, those seemingly normal module, but also through review and deduction, confirm whether they fully meet expectations, dig out possible hidden trouble, and then solve.

Finally, we released itV1.0The official version.

V1.0 is architecturally similar to V0.8 preview, but the confidence index has increased to 80.

We believe that 10% of the remaining 20% is reflected in the contingency plan and on-site response. With a system so large and complex, are there problems that haven’t been thought of, haven’t been exposed, haven’t been done enough, or can’t be solved with limited resources? Is it possible that these problems, if they arise, ultimately determine the whole situation? So the last 10 percent is just luck. Don’t have those problems.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

10 billion challenges: How to achieve a “sure” Spring Festival Gala shake system