From scratch

2011.1.21 wechat was officially released. It was about two months after the launch of the wechat project. In these 2 months, wechat grew from nothing, we may wonder what is the most important thing that wechat backstage did during this period?

I think it’s the following three things:

1. Determined the message model of wechat

Wechat was originally positioned as a communication tool, and its core function as a communication tool is to send and receive messages. Wechat team is originated from the Comparator team, and the message model is also very similar to the mail model of mailbox, both of which are stored and forwarded.

FIG. 1 wechat message model

Figure 1 shows the message model. After a message is sent, it is temporarily stored in the background. To enable the receiver to receive the message more quickly, a notification is pushed to the receiver. Finally, the client takes the initiative to collect messages from the server.

2. A data synchronization protocol is developed

Since data such as user accounts, contacts, and messages are stored on the server, how to synchronize the data to the client becomes a critical issue. To simplify the protocol, we decided to synchronize all of the user’s basic data through a unified data synchronization protocol.

In the original scheme, the client records a Snapshot of local data. When data needs to be synchronized, the Snapshot is taken to the server. The server calculates the difference between the Snapshot and server data and sends the difference data to the client, and the client saves the difference data to complete synchronization. However, this solution has two problems. First, Snapshot will become larger with the increase of client data, resulting in high traffic overhead during synchronization. Second, the client computes the Snapshot for each synchronization, resulting in additional performance overhead and implementation complexity.

After several discussions, the solution was changed to allow the service to calculate the Snapshot and send the data to the client together with the data when the client synchronizes data. The client does not need to understand the Snapshot, but only needs to store the data and take it with it in the next data synchronization. At the same time, Snapshot is designed to be very simple. It is a combination of key-values, where the Key represents the type of data and the Value represents the latest version number of the data given to the client. There are three keys: account data, contact, and message. An additional benefit of this synchronization protocol is that after data synchronization, the client does not need additional ACK protocol to confirm the data collection success, which also ensures that no data will be lost: As long as the client takes the latest Snapshot to the server for data synchronization, the server can confirm that the last data synchronization has been successfully completed and perform subsequent operations, such as clearing messages that temporarily exist in the service.

Since then, streamlining the scheme, reducing traffic overhead, trying to complete more complex business logic by the server, reduce the complexity of client implementation as an important guiding principle, continue to affect the subsequent design and development of wechat. I remember that there is a classic case: we implemented the group chat function in wechat 1.2 version, but in order to ensure the group chat experience between the new and old versions of clients, we made the 1.0 version of clients also participate in group chat through server adaptation.

3. Finalize the background architecture

FIG. 2 background system architecture of wechat

Wechat backstage uses three layers: access layer, logic layer and storage layer.

  • The access layer provides access services, including long connection access services and short connection access services. The long connection service supports both client initiating request and server initiating push. The short connection service supports only the client to initiate the request.
  • The logical layer includes business logic services and underlying logic services. The business logic service encapsulates the business logic and is an API provided by the background to the wechat client to call. The underlying logic service abstracts the lower-level and generic business logic and provides access to business logic services.
  • The storage layer includes data access services and data storage services. Data storage services persist user data through underlying storage systems such as MySQL and SDB, which mirrors the key-table data storage system widely used in the early background. Data access services adapt and route data access requests to different underlying data storage services, providing structured data services for the logical layer. In particular, wechat background uses separate data access services and data storage services for each different type of data, such as accounts, messages and contacts, which are all independent.

Wechat background mainly uses C++. Background services are constructed using Svrkit framework, and services communicate with each other through synchronous RPC.

Figure 3 Svrkit framework

Svrkit is another high-performance RPC framework that has been widely used in the background. It was not widely used at that time, but it shone brilliantly in the background of wechat. As the most important part of wechat’s back-end infrastructure, Svrkit has been evolving over the years. We use Svrkit to build thousands of service modules, providing tens of thousands of service interfaces, with hundreds of trillions of RPC calls per day.

These three events had a profound impact, and even today, five years later, we still use the original architecture and protocol, and can even support the original 1.0 version of wechat client.

There’s a lesson here — operational support systems really matter. The first version of wechat background was completed in a hurry, only the basic business functions were completed, and there was no supporting business data statistics and so on. After we registered in open, at that time had no business monitoring pages and data curve we can see, the number of registered users is the preliminary statistics from a database, online number has been extracted from the log, the data by running a script every hour (this script is temporary and that day), then automatically send mail to the email group. There are other kinds of business data are also released through email, it can be said that email is the most important data portal in the early stage of wechat.

On January 21, 2011, the highest number of concurrent users was 491, compared to 400 million today.

jog

In the more than four months since the release of wechat, we have experienced the surprise of hot registration after the release, and also experienced the subsequent tepid confusion.

During this period, wechat made a lot of functions aimed at increasing the number of users’ friends and getting users chatting. Get through Tencent micro blog private messages, group chat, work email, QQ/ email friend recommendation and so on. One of the more important changes in the background is the need for asynchronous queues as a result of these features. For example, micro-blog private messages need to be connected with external departments. Different systems require different processing time and speed, so they can be buffered through queues. Group chat is a time-consuming operation. After a message is sent to a group, it can be asynchronously written through an asynchronous queue.

FIG. 4 Message sending process of single chat and group chat

Figure 4 shows the application of asynchronous queue in group chat. Wechat group chats are written to spread, which means that a message sent to a group will be saved for everyone in the group (message index). Why not read diffusion? There are two reasons:

  • The size of the group is small, and the upper limit of the group is 10 (gradually increased to 20, 40, 100 and now 500). The cost of diffusion is not too great, unlike weibo, which has thousands of fans. If each fan saves one copy after sending a microblog, one is inefficient and the other has a much larger storage capacity.
  • After message diffusion is written to everyone’s message store (message inbox), the receiver only needs to check his or her own inbox to synchronize data in the background. The synchronization logic is the same as that of single chat messages, which can unify the data synchronization process and is lightweight to implement.

As an important mode of background data interaction, asynchronous queue has become a powerful supplement to synchronous RPC service invocation and is widely used in the background of wechat.

The rapid growth

Wechat’s meteoric rise began with version 2.0, which released voice chat. After that, the number of wechat users increased rapidly, exceeding 1 million in 2011.5, 10 million in 2011.7, and 100 million registered users in 2012.3.

Along with happy results, there is also a pile of happy troubles.

  • Pressure on business to iterate quickly

    The function of wechat is very simple, the main function is to send messages. However, in a few versions after sending voice messages quickly introduced mobile phone contacts, QQ offline messages, check nearby people, shake, message bottles and moments of friends and so on.

    There is a widely circulated legend about the development of moments – moments took four months and over 30 iterations before they were finally formed. In fact, there is a little-known story — at that time, due to a shortage of staff, there was only one developer at the back of moments for a long time.

  • Background stability requirements

    More users, more functions, the number of back-end modules and machines are doubling, followed by a variety of failures.

What helped us get through this phase was the following:

1. Minimalist design

Although a variety of requirements, but we are meticulous implementation of each solution. The biggest difficulty in realizing requirements is not to design and implement a plan, but to select the most simple and practical one among several possible plans.

In this process, it often needs to go through several rounds of iterative process of thinking, discussing and overturning. There are many advantages in planning and then acting. On the one hand, it can avoid making extravagant over-design and improve efficiency. On the other hand, the seemingly simple scheme that comes out through detailed discussion is usually the most reliable scheme with exquisite details.

2. Do small things with big systems

The business logic service of the logic layer at the earliest only had a service module (we call it MMWeb), including all apis provided for client access, and even a complete official wechat website. This module architecture is similar to Apache, consisting of a CGI container (CGIHost) and several CGis (each CGI is an API), the difference is that each CGI is a dynamic library SO, dynamically loaded by CGIHost.

When the number of Cgis in MMWeb was relatively small, the module’s architecture was perfectly adequate, but as functionality iterations accelerated and the number of CGis increased, problems began to arise:

1) Every CGI is a dynamic library. When the interface definition of common logic of some CGI changes, CGI updated online at different times may use different versions of logical interface definition, which will lead to strange results or process crash at runtime, and it is very difficult to locate;

2) When all the CGI is put together, every time the big version is released online, it is a very long process from testing to gray scale and then to the full deployment. Almost all background developers will be stuck in this link at the same time, which greatly affects the efficiency;

3) The newly added less important CGI sometimes has poor stability. Some abnormal branches will crash, leading to the failure of the CGIHost process to serve, and sending messages to these important CGI affected and unable to run.

So we started experimenting with a new CGI architecture, Logicsvr.

Logicsvr is based on the Svrkit framework. The Svrkit framework and CGI logic are statically compiled to generate Logicsvr that can be accessed directly using HTTP. We split the MMWeb module into eight different service modules. The splitting principle is: The CGI that implements different business functions is split into different Logicsvr, and the same function but with different importance is also split. For example, the message sending and receiving logic, a core function, is split into three service modules: message synchronization, text and voice messages, and picture and video messages.

Each Logicsvr is a separate binary that can be deployed separately and launched independently. Up to now, wechat has dozens of Logicsvr in the background, providing hundreds of CGI services, deployed on thousands of servers, and receiving hundreds of billions of client visits every day.

In addition to API service, other background service modules also follow the practical principle of “large system small”. The number of wechat background service modules has rapidly increased from about 10 modules when wechat was released to hundreds of modules.

3. Service monitoring

During this period, there were many backstage failures. What is more troublesome than faults is that some faults cannot be detected immediately due to the lack of monitoring. As a result, the fault impact surface is magnified.

On the one hand, the lack of monitoring is due to the importance of functional development rather than business monitoring in the rapid iteration process. On the other hand, infrastructure support for business logic monitoring is weak. The infrastructure provides machine resource monitoring and Svrkit service status monitoring. This is standard on a per-machine, per-service basis and requires no additional development, but monitoring the business logic is much trickier. At that time, business logic monitoring was carried out through the business logic statistics function, and there were four steps to achieve a monitoring:

1) Apply for log reporting resources;

2) Add a log reporting point in the service logic, and the logs will be collected by the agent on each machine and uploaded to the statistics center;

3) Develop statistical codes;

4) Implement the statistical monitoring page.

As you can imagine, this time-consuming and laborious pattern in turn reduces developers’ incentive to participate in business monitoring. So one day we went to the company’s benchmark, the QQ backend, and found that the solution was surprisingly simple and powerful:

1) Failure report

Before each failure, QA led a failure report, focusing on the impact of the failure assessment and fault classification. The new approach is that each failure is small or large, and developers need to thoroughly review the failure process, then agree on a solution, supplemented by a detailed technical report. This report focuses on how to avoid the recurrence of the same type of fault, improve the fault detection capability, and shorten the fault response and handling process.

2) Service-independent monitoring and alarm system based on id-value

Figure 5 Monitoring alarm system based on id-value

The monitoring system has a very simple implementation idea. It provides two apis to allow business codes to set Value or add Value to a monitoring ID in the shared memory. The Agent on each machine periodically reports all the ID-values to the monitoring center. The monitoring center can output the monitoring curve through the unified monitoring page after the data is summarized and stored in the database, and generate an alarm through the pre-configured monitoring rules.

For business code, simply invoke the monitoring API in the business process to be monitored and configure the alarm conditions. This greatly reduces the cost of developing monitoring alarms, and we complete a variety of monitoring items, so that we can actively find problems in time. The new features also include pre-built monitoring items so that in a few grayscale phases you can see directly from the monitoring curve whether the business is meeting expectations.

4. KVSvr

Each storage service in the wechat background has its own storage module, which is independent of each other. Each storage service consists of a service access module and an underlying storage module. The service access layer isolates the business logic layer from the underlying storage and provides rPC-based data access interfaces. There are two types of underlying storage: SDB and MySQL.

SDB works for data stores with user UIN(Uint32_t) as Key, such as message indexes and contacts. In terms of reliability, it provides the master-slave mode based on asynchronous flow synchronization. When the Master fails, the Slave can read data but cannot write new data.

Because the wechat account is a combination of letters and digits, it cannot be directly used as the Key of SDB. Therefore, the wechat account data is stored by MySQL instead of SDB. MySQL also uses the master-slave mode based on asynchronous pipeline replication.

Version 1 uses one master-slave account storage service. The Master provides read and write functions, while the Slave provides no services and is only used for backup. If the Master fails, the Slave fails to provide the write service. To improve access efficiency, memcached is added to the service access module to provide Cache service and reduce access to underlying storage.

In version 2, the account storage service is still one master-slave. The difference is that the Slave can provide read service, but dirty data may be read. Therefore, the service logic that requires high consistency, such as registration and login logic, can only access the Master. When the Master is faulty, it provides only read services but not write services.

Version 3 of the account storage service uses one Master and multiple slaves to solve the horizontal scalability of the read service.

Version 4 uses multiple master-slave groups for the underlying storage of the account service. Each group consists of one Master and multiple slaves, which solves the problem of horizontal scaling when the write service capability is insufficient.

Finally, there is an unsolved problem: In a single master-slave group, the Master is still a single point and cannot provide real-time write DISASTER recovery, which means that the single point of failure cannot be eliminated. In addition, the delay of data synchronization between Master and Slave greatly affects the read service. A large delay of data synchronization may cause service faults. Therefore, we sought for an underlying storage solution with high performance, read/write level expansion, no single point of failure, read/write disaster recovery capability, and strong consistency guarantee, and finally KVSvr came into being.

KVSvr uses the Quorum – based distributed data consistency algorithm to provide storage services in the key-value/key-table model. The performance of the traditional Quorum algorithm is not high. KVSvr creatively distinguishes the data version from the data itself, applies the Quorum algorithm to the negotiation of the data version, and then provides strong data consistency guarantee and extremely high data write performance through asynchronous data replication based on pipelinization. In addition, KVSvr has the data Cache capability, which can provide efficient read performance.

KVSvr solved the single point of failure disaster recovery capability that we desperately needed at that time. In addition to the account services in version 5, all SDB underlying storage modules and most MySQL underlying storage modules were soon switched to KVSvr. As the business has evolved, KVSvr has evolved and customized versions have been created to meet the needs of the business. KVSvr still plays an important role as the core storage.

platform

2011.8 Shenzhen held the Universiade. Wechat launched the service number “wechat Shenzhen Universiade Volunteer Service Center”. Wechat users can search “szdy” and add the service number as friends to get information related to the conference. At that time, the background of “SZdy” did special processing, the user search, will randomly return “SZDY01”, “SZDY02”,… , “SZdy10”, one of the 10 wechat signals. Behind each wechat signal is a volunteer serving.

In September 2011, “Wechengdu” settled in the wechat platform. Wechat users could search for “Wechengdu” to add friends, and Chengdu citizens could also see this account in “people nearby”. We made some special logic for this account in the background, which could support the background to automatically reply to the messages sent by users.

As the demand grew, we began to build a media platform. This platform was later separated from the wechat background and evolved into the wechat public platform, which developed and grew independently and started the platformization of wechat. In addition to the wechat public platform, the periphery of the wechat background has also emerged a series of platforms such as wechat payment platform, hardware platform and so on.

FIG. 6 wechat platform

Go abroad

Wechat’s attempt to go abroad began with version 3.0. From this version, wechat gradually supports traditional Chinese, English and other languages. The real signature, though, was the opening of the first overseas data center.

1. Overseas data centers

An overseas DATA center is positioned as an autonomous system, which means it has complete functions and can operate independently of domestic data centers.

1) Multi-data center architecture

Figure 7 Multi-data center architecture

System autonomy is simple for the stateless access layer and logic layer. It is enough to deploy one set of all service modules in an overseas data center.

But the storage layer is a big problem — we need to make sure that the domestic data center and the overseas data center can operate independently, but not as two separate systems deployed and playing different games, but as one system that can fully communicate business functions. Therefore, our task is to ensure data consistency between the two data centers, and the master-master architecture is mandatory, i.e. both data centers need to be writable.

2) Master-master storage architecture

Master-master data consistency is a big problem. The high latency network between the two data centers means that seemingly one-size-fits-all solutions such as using the Paxos algorithm directly or deploying the Quorum based KVSvr directly between data centers are not applicable.

We chose Yahoo! A similar solution to the PNUTS system requires the user set to be shred. The domestic user takes the Shanghai data center as the Master, and all data write operations must be completed back to the domestic data center. Overseas users can write data only in overseas data centers. The storage architecture is master-master, but the data of a specific user is master-slave. Each piece of data can be written only in the data center to which the user belongs and then asynchronously copied to other data centers.

Figure 8 Data master-master architecture in multiple DCS

3) Data consistency among data centers

This master-master architecture enables the ultimate consistency of data across different data centers. How do you ensure that your business logic does not suffer from this kind of weak data consistency guarantee?

This problem can be broken down into two sub-problems:

  • Users access their own data

    Users can travel all over the world, and allowing them access to a nearby data center can have a big impact on business processes. To allow access to nearby data while ensuring that data consistency does not affect services means that either the Master of user data needs to be able to change dynamically. Or you need to comb through all the business logic, strictly distinguish the requests from users in the local data center and across data centers, and route the requests to the right data center for processing.

    Considering the high implementation and maintenance complexity caused by the above problems, we limited each user to access his/her own data center. If a user roams, the roaming data center automatically reconnects the user to the owning data center.

    In this way, the problem of consistency of users’ access to their own data is solved, because all operations are restricted to the owning data center, whose data is highly consistent. As an added bonus, users’ own data (such as messages and contacts) does not need to be synchronized across data centers, which greatly reduces the bandwidth requirements for data synchronization.

  • Users access data of other users

    Users can use data created by users in other DCS because services in different DCS need to communicate with each other. For example, participate in group chats created by other data center users, view friends of other data center users, etc.

    After careful analysis, it can be found that most scenarios do not have high requirements for data consistency. It’s not a problem for a user to see that they’ve been added to a group created by another user in a data center, or that they’ve seen a friend’s feed later. In these scenarios, the business logic directly accesses the data in the local data center.

    Of course, there are still scenarios that require high data consistency. For example, to set up a wechat signal for yourself, and the wechat signal is required to ensure the unique in the entire wechat account system. To solve this problem, we provide a globally unique wechat application service through which all data centers apply for wechat. Such scenarios that require special treatment are rare and don’t pose much of a problem.

4) Reliable data synchronization

There is a large amount of data synchronization between data centers, and the consistency of data depends on the reliability of data synchronization. In order to ensure the reliability of data synchronization and improve the availability of synchronization, we developed a queue component based on Quorum algorithm, and each group of this component is composed of three storage services. Different from common queues, this component greatly simplifies queue write operations. The storage services of three machines do not need to communicate with each other. Data on each machine is written in sequence. If you fail, try again in another group. Therefore, this queue can achieve extremely high availability and write performance. After each DATA center writes the data to be synchronized to the synchronization queue of the local data center, the data replay service of other data centers takes the data away and replays it to achieve data synchronization.

2. Network acceleration

The construction cycle of overseas data centers is long and the investment is large. Wechat only has two overseas data centers in Hong Kong and Canada. However, the world is so big that even these two data centers are not able to radiate the world, so that users in every corner can enjoy the smooth service experience.

Through the actual comparison test in overseas, it is found that there is a big gap between wechat client and main rival products in some main usage scenarios such as sending messages. To this end, we cooperated with our sister departments, such as architecture Platform Department, Network Platform Department and international Business Department, to build dozens of POP points (including signaling acceleration points and picture CDN network) around overseas data centers. In addition, through in-depth analysis and research on the mobile network, we have also greatly optimized the communication protocol of wechat. Wechat eventually caught up with and surpassed the major competitors in comparison tests.

Intensive cultivation

1. Three-zone Dr

Wechat suffered its largest-ever outage on July 22, 2013, when its messaging and moments services were disrupted for five hours, during which the number of messages dropped by half. The fault was caused by a major optical fiber cut in one campus of the Shanghai data center. Nearly 2,000 servers were unavailable, causing the entire Shanghai data center (the only one in China at the time) to be paralyzed.

During the failure, we tried to cut off users who had access to the failed park, but with little success. Although hundreds of online modules are designed for DISASTER recovery and redundancy, individual service modules do not appear to have a single point of failure; However, on the whole, numerous service instances are scattered in more than 8,000 servers in each machine room of the data center. RPC calls of each service are complicated and in a network structure. In addition, there is a lack of system-level planning and DISASTER recovery verification, which ultimately leads to failure recovery. Until now, we knew that a single failure of a single service would not affect the system, but no one knew what could happen to the entire system when 2,000 servers were down at the same time.

In fact, we were working on this problem three months before it happened. An abnormal switch on the Shanghai data center’s Intranet led to an unexpected glitch in wechat, which became almost completely unavailable for sending and receiving messages for 13 minutes. When analyzing the failure, we found that a core module in a messaging system and three mutually standby service instances were all deployed in the same machine room. A switch failure in the machine room caused the service to be unavailable and the message dropped to zero. This service module was the core module in the earliest days (when the wechat background was small and most of the background services were deployed in a data park). Based on the 3-machine redundancy design, the service ran reliably year after year, so that everyone completely ignored the problem.

In order to solve similar problems, three-park DISASTER Recovery (Dr) came into being. The aim is to evenly deploy the services of the Shanghai data center to three physically isolated data parks, so that wechat can still provide non-destructive services in the event of an overall failure of any single park.

1) Simultaneous service

The traditional DC level Dr Solution is geo-redundant. That is, there are two data centers in the same city for mutual backup and a Dr Center is built in another place. Only one of the three data centers may provide online services at normal times. The main problem is that the Dr Data center does not have actual service traffic. When the active data center is faulty, the switchover to the Dr Center may not be normal. In addition, a large number of backup resources do not provide services in normal times, resulting in a large amount of resource waste.

The core of three-park Dr Is that three data parks provide services at the same time. Therefore, even if one park fails, the service traffic of the other two parks increases by only 50%. On the other hand, the server resources in each campus need to run at 2/3 of the upper limit and retain 1/3 of the capacity to provide lossless disaster recovery capability, while the traditional “two – and three-center” server resources are much more idle. In addition, the three parks provide external services at the same time in peacetime. Therefore, in case of failure, the problem we need to solve is “how to switch the business traffic to other data parks?” “Rather than” Can we switch business traffic to another data park?” The former is obviously the easier problem to solve.

2) Strong consistency of data

The key to three-park Dr Is that storage modules evenly distribute data in three data parks and have at least two consistent copies of the same data in different parks. In this way, non-destructive services can be provided without interruption after a disaster occurs in any one park. Since most of the back-end storage modules use KVSvr, the solution is relatively simple and efficient — deploy each group of KVSvr machines evenly across three campuses.

3) Automatic switchover in case of failure

Another difficulty of THREE-zone Dr Is the automatic masking and switchover of faulty services. That is, the business logic service module can accurately identify that some downstream service instances are no longer accessible, and then quickly automatically cut to other service instances to avoid being dragged to death. We hope that each business logic service can make its own decision to quickly shield problematic service instances and automatically disperse business traffic to other service instances without external auxiliary information (such as building a central node that delivers the health status of each business logic service). In addition, we also built a set of manual operation of the global shielding system, can be in the large-scale network failure, by manual intervention to shield all the machines in one park, quickly dispersed business traffic to the other two data parks.

4) Test the disaster recovery effect

Whether the THREE-park DISASTER recovery can function normally still needs to be tested. After completing the construction of the three-park disaster recovery in Shanghai data center and overseas Data center in Hong Kong, we conducted several actual drills, shielding thousands of services in a single park, and testing whether the disaster recovery effect meets expectations. In particular, in order to avoid over time a core service module for an update will no longer support the disaster three parks, we also built a set of disaster system, every day in all the service module to select a park service active block, automatic check service overall failure quantity change, realize the continuous inspection of three parks disaster effect.

2. Performance optimization

In the past, when our business was developing rapidly, we gave priority to supporting rapid iteration of business functions, and had no time to take into account performance problems. We carried out the massive approach of “first bear and then optimize” extensively. Since 2014, when operating costs have been drastically reduced, performance optimization has been on the agenda.

We basically reviewed the design and implementation of most of the service modules, and made targeted optimization, which can save a lot of machine resources. But the more effective optimization measures are the optimization of the infrastructure, specifically the optimization of the Svrkit framework. Svrkit framework is widely used in almost all service modules, if the framework level can use the machine resources to the extreme, it is certainly twice the result with half the effort.

It worked. We added support for coroutines to the infrastructure. The key point was that the coroutine component could be used without breaking the structure of the original business logic code, so that our code using synchronous RPC calls could be asynchronized directly through the coroutine without making any changes. The Svrkit framework directly integrated this coroutine component, and then the nice thing happened. What was once a single instance of a service capable of handling hundreds of concurrent requests at most, when the reprogramming went live, suddenly it was capable of handling thousands of concurrent requests. The underlying implementation of the Svrkit framework was also completely new during this period, and the processing power of the service was greatly increased.

3. The avalanche

We have not been too worried about the failure of a service instance, causing the instance to be completely unable to provide services, which can be handled very well in the backend service disaster recovery system. The biggest worry is an avalanche: a service becomes overloaded for some reason, causing the request processing time to be greatly extended. As a result, service throughput drops and a large number of requests sit on the service’s request queue for so long that the upstream service accessing the service times out. To make things worse, the upstream service often retries, and then the overloaded service uses what little processing power it has (i.e., by the time the result is returned, the caller has already given up), and finally the overloaded service collapses completely. The worst-case scenario is that the upstream service takes so long per request that the avalanche spreads up the RPC call chain until the overload of a single service module triggers an avalanche of service modules.

Service overloads became a regular occurrence as we tightened our belts to conserve machine resources and eliminate low-load machines. The powerful weapon to solve this problem is the FastReject mechanism with QoS guarantee in Svrkit framework, which can quickly reject requests that exceed the service’s own processing capacity and provide effective output stably even when overloaded.

4. Security hardening

In recent years, Internet security incidents happen from time to time, and various drag libraries emerge endlessly. In order to protect users’ private data, we have built a data protection system — the whole process bill system. Its core is, after the user login, the background will be issued a ticket to the client, the client request every time take notes, request service throughout the process chain in the background, all access to the core data service, will be check of paper is legal, illegal request is rejected, so as to safeguard user privacy data can only be initiated by his client to access operation.

A new challenge

1. Resource scheduling system

Wechat has thousands of service modules in the background, deployed on tens of thousands of servers around the world, and has always relied on manual management. In addition, wechat background mainly provides real-time online services, and the daily server resource occupation varies greatly between business peak and trough, and computing resources are wasted in business trough. On the other hand, many off-line big data calculations are limited by insufficient computing resources and are difficult to be completed efficiently.

We are experimental and deployment of resource scheduling system (Yard) can put the machine resource allocation and service deployment automation, offline tasks of the dispatching automation, to achieve the optimal allocation of resources, in the business for a change in the demand of service resources, can be more timely, more flexible automatic reconfiguration and deployment of service.

2. Ha storage

KVSvr based on Quorum algorithm has achieved strong consistency, high availability and high performance key-value/key-table storage. Recently, another storage system based on Paxos algorithm was born in wechat background. The first one is PhxSQL, a SQL storage system that supports complete MySQL functions and has strong consistency, high availability and high performance at the same time.