Baidu Feed stability architecture practice

Introduction: Baidu Feed information flow recommendation system serves the vast majority of information flow business scenarios of Companies such as Shoubai, Qiaoqiao, Quanmin and Tieba. With the rapid development of business, the flow of the whole system has reached billions. Behind the huge flow scale are hundreds of micro services and tens of thousands of machines to do the support. How to ensure the external high availability of the entire system is the key to the capacity building of the entire system, and also a very core direction of our team. In order to ensure the normal 5 9 usability goals of the information flow recommendation system, this paper will introduce how to build the high availability architecture of Baidu Feed online recommendation system based on our practical work experience sharing.

The background,

Baidu Feed information flow recommendation system serves the vast majority of the company’s information flow business scenarios, and currently has carried the information flow recommendation capacity of handbai, Good-looking, Quanmin, Tieba and other multi-end products. The following is a simplified diagram of Baidu feed recommendation system from the resource funnel dimension.

△ Figure 1: Schematic diagram of system simplification

The system is briefly described as follows:

Resource index library: provides the index query capability of tens of millions of resources based on inverted index or vector index. There are usually dozens or even hundreds of different resource index libraries, providing different types of resource index capability respectively.
Recall layer: recall hundreds of thousands of user-related resources from tens of millions of resources, usually strongly bound with the resource index library, for example, some explicit recall layer will invert the index library to find user-related document resources based on the user’s tag.
Ranking layer: in large recommendation systems, in order to balance computing power and effect, a two-tier ranking system of coarse ranking and fine ranking will be constructed. Based on the unified evaluation and scoring mechanism, hundreds of resources that users are most interested in will be found out and sent to the mixed ranking layer.
Mash-up layer: Mash-up layer recommends dozens of resources seen by end users based on user context information from hundreds of resources, often coupled with product intervention capabilities such as resource diversity control.
User model: User model services provide user-level feature retrieval capabilities, such as collections of tags of interest to users.
Document characteristics: Document characteristics services provide full feature services at the document level, usually obtained based on document ID access.

With the rapid development of business, Baidu Feed recommendation system has now borne billions of online request traffic, supported by hundreds of micro services and tens of thousands of machines. How to ensure the high availability of the system is one of the key capabilities of the entire system architecture design, and also a core direction of our entire team work.

Second, the overall idea

In order to ensure the high availability target of the online recommendation system, the flexible processing capability of our entire architecture is formed based on the continuous abstraction and precipitation of our practical work.

△ Figure 2: Multi-level flexible processing capability

As shown in Figure 2, from the long tail timeout exception of single request to idC-level large-scale failure, based on our practical work summary, we have formed the corresponding architecture design scheme to deal with it. The following is a detailed introduction.

3. Specific plans

1. Instance-level fault solution

The online recommendation system is composed of hundreds of micro-services and has carried out large-scale mixing on the private cloud. There are two main factors that have the greatest influence on the normal availability index:

Availability jitter caused by a single instance fault may be caused by machine crash, full disk, and CPU overload caused by mixed parts.
A long-tail request timeout failure is usually caused by some random factors in the system. The causes are diverse and difficult to be abstractly normalized, such as resource competition, transient imbalance of received traffic, periodic memory clearing, and off-line log operations.

The usual solution to this problem is to add a retry mechanism and a masking probe mechanism for exception instances. Our approach is similar, but the implementation is a further evolution of the possible problems.

1.1 Dynamic retry scheduling

The biggest problems with retry are retry time Settings and avalanches:

A short retry time means that the proportion of retry traffic is relatively large under normal circumstances, resulting in a waste of resources and an avalanche of downstream services. For example, once a large area of latency degradation occurs in the downstream, the overall traffic will double and the entire downstream service will be dragged down.
If the retry time is long, the timeout of the entire service will be long. Otherwise, the retry effect will not play an effective role, and it is easy to cause timeout inversion on complex micro-service scheduling links.
Of course, when setting retry time, we can weigh the advantages and disadvantages based on a certain state and give an approximate solution to avoid the above problems. However, the rapid iteration of business requires us to constantly update this value based on new conditions, which will bring great operation and maintenance costs.

Based on the above considerations, we designed and implemented a dynamic retry scheduling mechanism, whose core idea is to strictly control the proportion of retry traffic (for example, only 3% of traffic is allowed to trigger retry), and allocate retry opportunities to traffic that needs to be retried based on the real-time dynamic collection mechanism of fractional time consumption. This also means that we no longer need to manually maintain the timeout Settings.

Here is an implementation comparison diagram:

△ Figure 3: Dynamic retry scheduling

Its core implementations include:

Back-end quantile time statistics mechanism: The time series is divided into periodic intervals (for example, 20 seconds per cycle), the quantification time is counted based on the traffic of the previous period, the backup_request_MS value is dynamically determined based on the configured retry ratio, and the request-level retry time setting mechanism based on RPC achieves the goal of dynamic timeout scheduling.
Fuse breaker control mechanism: From points, a time-consuming statistics mechanism, can be found that there is a cycle of lag, but not effect a big impact for us, this is because the two cycles before and after service of the delay change doesn’t happen a lot of volatility, as long as able to handle the extreme lag caused by the flow of super hair (could trigger an avalanche) problem is ok, This is exactly the problem that the circuit breaker control mechanism aims to solve. The circuit breaker control mechanism calculates the number of retry requests in real time and controls the proportion of requests based on a smaller time window. At the same time, the circuit breaker control mechanism deals with the inaccurate statistics caused by a small window through a smooth mechanism.

After the application of the recommendation system, the usability of the whole scheme can be improved by more than 90% under normal conditions.

1.2 Troubleshooting single Instance Faults

Masking and probing are common solutions to single instance failures, but high availability architectures face the following challenges:

Recognition ability: Here contains several dimensions, accuracy and recall rate and timeliness in three aspects, on the one hand, based on single view live agent in terms of timeliness, although can do real time effect, but because of local Angle accuracy is not high, often deterrent to other probability is bigger, and based on global information acquisition often brings timeliness of greatly reduced, At the same time, there will be a certain loss in performance.
In terms of policy processing, on the one hand, exploring traffic will lead to a small amount of traffic loss, while shielding from a non-global perspective will lead to the risk of service capacity of the available cluster, thus bringing overall deterioration.

Based on the above problems, we make a further evolution on the basis of dynamic retry scheduling, whose core idea is to reduce the loss as much as possible in real time and to control the global exception handling in asynchronous and quasi-real time.

Real-time dynamic stop: Dynamic retry mechanism to deal with the scenario is no instance differences long tail request, in response to a single instance failure of this particular scenario, there are two abstract treatment scheme to solve, the weight of one based on availability and time consuming feedback regulation mechanism, availability can be used to perceive the existence of abnormal instance, quickly reduce weight, At the same time, the selection of retry instances will be distinguished. Smooth adjustment with time feedback can adjust the weight of normal instances adaptively based on pressure to ensure that the availability of normal instances will not be destroyed due to capacity problems.
Global quasi-real-time stop-loss: Based on our integrated components within the BRPC framework, instance-level information from the standalone perspective can be quickly collected and reported periodically to the unified control aggregation layer. With efficient code implementation and two-tier aggregation mechanism, client performance is basically unaffected. More importantly, we further control our control strategy for exception instances based on linkage with Pass to make complete stops while maintaining available capacity.

△ Figure 4: Global quasi-real-time stop-loss plan

Based on the implementation of the above scheme, our system can basically automatically deal with the problem of single instance failure. Compared with the simple dynamic retry scheme, the jitter interval is smaller, and the jitter interval can be basically converged within 5s. Meanwhile, the reduction of availability in the jitter interval is reduced by 50% compared with the previous one.

2. Service level fault solution

The service-level fault mentioned here refers to the large-scale unavailability of a sub-service in the micro-service system, which basically cannot provide any external service capability. At present, microservice architecture based on cloud native is the development trend of the whole industry. A well-designed microservice system can greatly improve the iterative efficiency of the whole system development, and at the same time provide more space for exploration in the direction of stability. On the basis of microservice architecture, we adopt the design idea of flexible architecture combined with multi-level disaster recovery design to deal with such large-scale service failure scenarios. Here are the key points:

Differentiated treatment of anomalies: for example, for core services, we need to have a certain mechanism to deal with potential anomalies; For a non-core service, it may be just a matter of controlling its exceptions from spreading to the core link.
Multi-layer DISASTER recovery (Dr) : When a large-scale fault occurs, user experience is protected within a certain period of time.

Of course, the handling of service-level faults needs to be designed based on specific business scenarios. Here, some examples are given to illustrate how to deal with them in the Feed recommendation system.

2.1 Multi-recall scheduling framework

In the recommendation system, due to the difference of recall methods, multiple recall frameworks are often designed, and each path of recall is responsible for certain specific recall capabilities. For example, from the perspective of resources, we can divide it into graphic recall and video recall, and from the perspective of recall algorithm, we can divide it into ItemCF recall and UserCF recall. In the case of a single recall road, the impact on the overall service depends on its functional division:

Level 1 recall: for example, recall related to operation control directly affects product experience perception and failure is not allowed;
Second-level recall: determines the distribution effect of information flow. Failure of these recall routes will lead to a significant decline in distribution of the large market. Short-term failure of single route is allowed, but multiple routes cannot fail at the same time.
Level 3 recall: supplementary recall path, such as exploration interest recall, will have a certain gain effect on the long-term effect, while short-term impact on the product effect is not significant, allowing short-term failure.

For the three-level recall path, although service invocation failure has no significant short-term impact on the product, large-scale timeout failure will lead to delayed degradation of the overall performance, and then affect the availability of the overall service. A rough solution is directly down the recalled the timeout of the road, that it does not affect the overall timeout, but this will obviously reduce the constant availability of these services, the effect of is not what we want, in order to solve this problem, unity of abstract we have designed many recall scheduling framework, specific design is as follows:

△ Figure 5: Multi-recall scheduling framework

Its core design includes:

Recall classification: First-level recall does not allow active discarding; Level 2 recall path is divided into multiple groups. At present, we divide it into graphic recall group & video recall group &… ; Each recall group is composed of several recall roads that contribute most to distribution, and more than 40%(optional) recall roads are allowed to be discarded proactively within the group. The three-level recall path allows active discarding;
Layer-dropping mechanism: when all the recall paths that meet the requirements return, the remaining recall paths that do not return will be actively discarded without waiting for the return result;
Compensation mechanism: discarding always has an impact on the product effect, but the impact is small. In order to minimize the impact, we designed a bypass cache system. When the active discarding occurs, the cache of the last result is used as the return of this one, so as to reduce the loss to a certain extent.

Based on the implementation of the above scheme and the actual results of our exercise, our whole system can automatically deal with the second/third class recall queue service-level faults.

2.2 Sorting Layer Troubleshooting Solution

Ranking service is built on recall and provides a unified scoring mechanism for multi-path recall data, so as to complete the overall ranking of resources. At present, the industry usually adopts a two-layer ranking system of rough ranking and fine ranking to solve the problem of conflicting computing power and effect.

Coarse scheduling can support tens of thousands of resource sequencing estimates in one user request through the design of the two-tower model, but the effect is relatively weak.
Refined sorting can provide better sorting capability through more complex network models, but the estimated order of magnitude supported is only thousands;
Recall, coarse discharge, fine discharge funnel filtration system in the case of a certain calculation force is very good to improve the recommendation effect.

It can be seen that ranking service is crucial to the overall recommendation effect. Once large-scale abnormalities occur in ranking service, it is basically random recommendation in tens of thousands of resources. In order to prevent the occurrence of such problems, at the beginning of the design, we also fully consider the failure response scenario, the core idea is to build a hierarchical degradation mechanism.

△ Figure 6: Ranking level demotion scheme design

Its core design includes:

Build a stable intermediate proxy layer Router: the change layer has simple logic, few iterations, high stability, and the ability to degrade when exceptions occur;
Processing scheme of rough dispatching exception: The sorting ability of the point spread ratio and duration information of the bypass construction road is based on the resource, and the data is mined from the off-line delay statistics. When the rough dispatching service is abnormal on a large scale, the information of the road is used to provide the emergency sorting ability.
Refined exception processing scheme: The sorting effect of coarse exception is worse than that of refined exception, but it can be used as the degraded data of refined exception for the whole system. When the refined exception occurs on a large scale, coarse exception service is directly used to degrade.

Compared with random recommendation, the whole sorting layer fault solution can greatly reduce the impact of fault on the recommendation effect.

2.3 Elastic Dr Mechanism

Based on the further evolution of the sorting layer fault solution, we build a flexible disaster recovery mechanism for the whole system. The goal is to use a unified architecture to support the handling of exceptions and reduce the impact on user experience perception and policy effect when exceptions occur.

△ Figure 7: New architecture for resilient Dr

Its core design includes:

Stable recommendation system entrance module Front: this module is the entrance layer of the recommendation system. It is not coupled with any service logic and only supports elastic DISASTER recovery capabilities.
Hierarchical exception data construction: Personalized Dr Data + global Dr Data. Personalized Dr Data is usually constructed based on the cache of cross-refreshed data, and global Dr Data is constructed by bypass mining.
Unified anomaly sensing mechanism: Cross-service transparent transmission of anomalies. When the Front layer detects core service anomalies, it obtains personalized Dr Data for Dr. If no personalized Dr Data exists, it accesses global Dr Data for Dr.

We illustrate this by exception handling of the top data of the information flow.

Full Dr Data mining: Bypass periodically writes all available top data to the global Dr Cache.
Personalized Dr Data construction: When a user brushes up, the top recall service takes out the best top data set that meets the user’s conditions and writes this part of data into the personalized Dr Data cache.
Service failure perception: When top recall access fails or any link between Front and top recall service fails, there is no mark indicating top data success in the data return packet received by Front.
Dr Control: The Front checks the returned data. If no success flag is displayed, the Front starts Dr Processing. Personalized top-to-top Dr Data is read first.

This mechanism provides a unified elastic Dr Capability. On the one hand, personalized data is used to minimize policy loss during Dr, and on the other hand, global Dr Data is used to ensure that user experience is not sensed.

3. Idc-level fault solution

The IDC-level fault solution usually adopts the remote multi-master solution to solve the problem, that is, the IDC cut flow to stop the loss in time when the fault occurs. Baidu recommendation system also uses a similar solution to solve the problem. Here we introduce the remote multi-master solution design for delivering history storage services.

Delivery History The storage service provides the recently delivered, displayed, and clicked resources of the user level. It provides cross-request policy support, such as diversity control and, more importantly, preventing the delivery of duplicate resources. Once the service is down, the entire recommended service will not be able to provide services.

Its overall design is as follows:

△ Figure 8: Multi-master architecture for delivering history storage service

Its core design includes:

Maintain a full storage for each area;
For read requests, only local IDC is allowed to be read.
Write requests are simultaneously written to the local equipment room and across IDCs. If the write fails, the requests are written to the message queue, which asynchronously updates the write operations across IDCs.

When an IDC-class failure occurs, the remote multi-master and multi-live architecture can quickly support cutting flow stop loss.

Four,

Baidu Feed recommendation architecture supports the high-speed iterative development of business. In the whole process of architecture evolution, availability construction has always been a key indicator of system architecture capability. Through the construction of flexible architecture, we have guaranteed the high availability target of 5 9 external normal, and have the ability to deal with conventional large-scale failures.

△ Figure 9: Export availability index for the last month

This author | windmill & smallbird

Reading: mp.weixin.qq.com/s/sm8eehjoU…

Recommended reading

Exploration and application of elastic near line computing in Baidu information flow and search business

| interesting! Has the material! Have a temperature! “Baidu Technology Salon” new upgrade, heavy return!

———- END ———-

Baidu Architect

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome students to pay attention to!

Baidu Feed stability architecture practice

The background,

Second, the overall idea

3. Specific plans

1. Instance-level fault solution

2. Service level fault solution

3. Idc-level fault solution

Four,

Related Posts

To expand, Spring Bean IOC, AOP cycle dependencies

The Gin Framework (9)- Commonly used middleware development

A good open source blog system based on SpringBoot: Halo