Introduction: Information flow products in order to provide you with thousands of content recommendation, back-end built a huge, complex recommendation system, the whole system includes hundreds of modules, hundreds of strategy algorithms and models, and this system with extremely high speed efficiency iteration, daily online nearly 100 needs; How can such a large system achieve such rapid and stable iterations? This is inseparable from the intelligent delivery system jointly constructed by PM, RD, QA and other roles.

The introduction

This paper mainly introduces the research and development of Baidu’s large-scale information flow recommendation system and the related practice of improving delivery efficiency. It covers various stages of R&D, testing, release and online deployment. Through intelligent means, combined with data and algorithms, it can realize efficiency improvement and intelligent flow of process, and finally realize unattended of the whole process.

I. Background introduction

On the surface of the flow of products to provide you with one thousand thousand content recommendation, the back-end to build a set of large, complex large-scale system, the whole system includes hundreds of modules, hundreds of strategy algorithm and model, and the efficiency of the whole system with extremely high speed constant iterative related strategies and architecture, online daily average of one hundred times; How can such a large system achieve such rapid and stable iterations? This is inseparable from the intelligent delivery system jointly constructed by PM, RD, QA and other roles.

△ Figure 1: Simplified recommendation system

_

The intelligent delivery system covers every link from research and development, testing, release to launch. First of all, it is necessary to design a set of efficient delivery mode, and solve the problems in each link of the current delivery mode one by one:

△ Figure 2: Schematic delivery model

1. R&d and self-testing stage: With micro service framework, business & execution engine can assign for developers, reduce development costs, improve efficiency of research and development, and explore new r&d model for independent test drive, through the QA provide delivery service, RD own quality consciousness and the code measurability, thus achieved the integration of research and development testing (testing capability left provide service for r&d, measurability reconstruction, etc.; R&d moves right to provide more infrastructure capabilities, quality awareness, etc.) to pilot overall performance improvements;

2. Stage test: will be split to test the input for each test behavior, test execution and test analysis, test position 4 is the link, after establishing complete test system, through the data and algorithm, intelligent means to each link can assign, improve the overall test efficiency and the ability to recall, and release the screening, positioning, the results of calibration analysis of human;

3. Release stage: evaluate whether requirements can be released and put online from the perspective of test and evaluation, and realize the unattended process in the up-line deployment stage of intelligent flow of requirements that meet the requirements for going online;

4. Online deployment phase: cooperate with OP, EE and RD to improve online efficiency from four aspects of compilation optimization & deployment package cutting, intelligent monitoring, concurrent dynamic adjustment and service restart optimization;

5. Bottom layer: it relies on powerful central platform capabilities, including configuration data management, construction central platform, policy central platform and so on, to support efficient operation of the pipeline.

△ Figure 3: Diagram of intelligent delivery system

= = =

Ii. Core issues & solutions

2.1 R&D and self-testing Stage — this stage mainly solves the problems of R&D efficiency and self-testing efficiency

2.1.1 Business framework & Execution Engine Construction

1. The background

Currently, many of the policies & business logic of the convergence layer are scattered. It is mainly reflected in the following aspects:

  • Code architecture: policy logic is not cohesive, data dependence is scattered, lack of universality;

  • Research and development efficiency: to develop a weight adjustment function, it is necessary to modify N file functions, and the dependent data and use mode of each place may be different, and it is necessary to be familiar with the code of each location before development. It is easy to miss a place that has not been developed in the development process, and the research and development debugging and test cycle is forced to prolong.

Based on the above background, it is expected to implement an operator execution framework, whose main objectives are as follows:

  • Standardize operator interface and data dependence to improve the generality and iteration efficiency of strategy code;

  • The external interface of the framework is as simple as possible, and the internal implementation is as light as possible.

2. Implement

The framework is implementationally split into two parts: core execution + execution strategy. The diagram below:

Figure 4: Business Framework & Execution Engine Design diagram

(1) Core execution

Main function: according to the given “execution mode” to run the operator. The core details include the following:

  • Concurrent multiplexing & short circuit execution function, input processing data organization mode support: streaming & random access container;

  • The operator supports stateless and stateless mode to realize the information collection and feedback of the operator.

(2) Execute the strategy

Main function: Generates execution mode based on the information collected by core execution. The specific implementation logic is as follows: collect the operation information of each operator of core execution, input the execution analysis strategy module, and generate the execution mode of the next core execution. The core details are as follows:

  • Independent threads periodically analyze running conditions and generate running patterns;

  • Run mode update and use 0-1buF to reduce read-write contention;

  • The analysis policy module adopts plug-in design (similar to routing IPtable). When generating the running mode, each policy is traversed in sequence and different analysis policy modules can be customized.

2.1.2 Pilot new R&D mode

1. The background

The so-called independent testing refers to RD’s r&d behavior of making use of high-quality test-related services to guarantee the quality during the r&d process, and then providing deliverable conclusions directly with complete and effective automation capabilities. Overall through the construction of the ultimate assembly line and test service capacity, the introduction of research and development to deeply participate in the test, combined with intelligent test landing, improve the overall delivery efficiency.

2. Implement

  • Process change: After the original code was submitted, QA supplemented the corresponding case according to this change to cover the changed scene and moved to RD development;

  • RD work: RD completed the writing of case by configuring case or writing custom verification function on the basis of testing service capability provided by QA while developing requirements and policy codes. The overall writing cost of case was less than 30min.

  • QA work: QA provides low-cost test service capability and reduces RD case writing cost by building configurable module access and configuring automatic test framework added by case; In addition, the test service capability covers 90+% core function points such as basic check and policy check, and P0 test scenarios such as function, performance and stability are covered by pipeline to ensure that the requirements of this part are on-line without risk.

△ Chart 5: New R&D mode pilot _

2.2 Access & Test Phase — this phase mainly addresses the test efficiency

We have built a complete assembly line integrating “automatic test”, “performance test”, “stability test” and other engineering capabilities as the access system. After building a complete engineering capability, we still ran into the following problem:

  • Automated testing ability is more function regression ability, how to quickly cover the new functions, improve the independent testing?

  • Performance DIFF test output report includes hundreds of indicators, how to analyze, how to determine the fluctuation caused by system reasons, or the increase caused by code change, reduce the analysis cost?

  • How can an access system with so many capabilities operate more efficiently and provide the silky pipelined experience of RD and QA?

Therefore, in the period of intelligent delivery, on the basis of the complete testing capacity of the early construction, with the support of the center and data, enabling strategy algorithm, to improve the quality and efficiency of the whole access system, and through analysis, positioning, evaluation of intelligent, intelligent flow process, release the input manpower.

△ Chart 6: Schematic diagram of intelligent unattended assembly line

2.2.1 Test input: intelligent case generation

1. The background

Automatic testing ability is more as a function regression ability, how to quickly cover the new functions, the part of the non-autonomous test project function of high quality and high quantity coverage, improve the independent test?

2. Implement

Through the white-box analysis results of the incremental code, combined with the business strategy, test case generation was carried out to cover the new function as much as possible.

Figure 7: Intelligent case generation scheme

2.2.2 Test Execution: Intelligent build

1. The background

Is it necessary to run all the tasks in the entry phase once for every requirement? If only the logging functions and other scenarios are changed, it is necessary to run such a heavy build? The answer is definitely no, but how do you know which tasks to run and which to skip?

2. Implement

Based on intelligent building middle ability, combined with business characteristics, and the analysis of the white box, historical task characteristics as a result, whether intelligent decision-making tasks need to run in strategy, break line mechanical repetition task execution status, let strategy instead of human data and algorithm for assembly line tasks tailored to make decisions, which increase assembly line operation efficiency.

Figure 8: Intelligent build system

2.2.3 Test analysis: Performance white-box analysis

1. The background

In order to prevent speed decline, performance diff testing has become an essential part of testing capabilities. Engineering capabilities are complete, but the analysis of performance test results is still a time-consuming and labor-consuming task:

  • Performance test reports contain hundreds of metrics. How do you analyze them?

  • The system-level long tail time fluctuation problem has been plagued for a long time. It is difficult to determine whether the 99.9 quantile value of the single stage time is abnormal. How to effectively intercept the long tail deterioration?

  • How to determine whether the fluctuation caused by system reasons or the increase caused by code changes is caused by abnormal module-level time indicators, so as to reduce the analysis cost?

2. Implement

(1) Long-tail deterioration interception based on DAPPER: Based on the GLOBAL Performance analysis system (DAPPER System) based on RD, we already have the observable performance of the system. By combining DAPPER with offline performance test as the data basis and making decisions with business strategy algorithm, we have the ability of long-tail deterioration interception.

△ Figure 9: Performance test white-box analysis – Long Tail Interception

(2) Volatility elimination based on white-box code analysis: based on dapper time-consuming log analysis results, combined with function call chain analysis, the impact of time-consuming estimation and the time-consuming stage affected by incremental code can be eliminated and corrected to eliminate abnormal fluctuations

Figure 10: Performance test white box analysis – wave elimination

2.2.4 Unattended: intelligent process flow

1. The background

Described above are mainly concentrated in r&d self-test – > access to promote the efficiency of testing at various stages, efficiency improvement, production line still need to rely on manpower and experience to judge and process flow, so this part of the human, whether we can through the data and algorithm instead of people to make decisions, make lines between the various stages of smooth flow?

2. Implement

Starting from the requirements process, at each node, the process flow is guided by quality model & risk assessment; In the stage of beginning, mining the risk, introduced the change, and the probability of occurrence of a risk, the influence of the corresponding, combining risk assessment, risk matrix and at the end stages, comprehensive data and the characteristics of stage, degradation, and so on and so forth, assessing risk guidance whether can turn to the next stage, finally give a demand for online, a comprehensive risk unattended on the implementation process.

2.3 Release & Online Deployment phase: This phase mainly improves deployment efficiency

1. The background

The efficiency of the online deployment stage determines the upper limit of the release frequency of the whole product, which can reduce the waiting time for the online demand. Therefore, the joint RD, OP and EE in Q3 2020 have made special optimization for the efficiency improvement of the online deployment link.

2. Implement

It mainly starts from three dimensions of process specification, platform optimization, engineering capability, deployment package cutting, concurrent dynamic adjustment, restart time optimization, intelligent monitoring capability and other aspects, and makes corresponding efficiency optimization for each stage of packaging, deployment, post-check and manual check.

△ Chart 11: Optimization plan and effect of online deployment time

= = =

Third, summary & effect

Through a series of construction of R&D business framework & execution engine, INTEGRATION of R&D and testing, intelligent assembly line, intelligent flow of process, on-line efficiency improvement and so on, the efficiency of recommended technology direction has been significantly improved: 50%+ demand has achieved sky-level R&D test delivery, and online quality is stable and rising.

  1. Mode innovation: to realize the integration of R&D and testing, RD made use of high-quality test-related services to carry out the independent testing mode of quality assurance in the r&d process, and the independent testing rate was greatly increased, thus improving the efficiency of demand delivery;

  2. Daily delivery & Throughput increase: 400+/ week of delivery requirements, 50%+ of which can be daily R&D test delivery;

  3. Human efficiency improvement: through the improvement of assembly line stability, automatic labeling, intelligent customer service, etc., the human resources invested by QA in assembly line operation and maintenance are greatly released; Through test evaluation and intelligent flow of process, some projects do not need QA input, unattended, and QA human efficiency is improved;

  4. Stable quality: when the iteration efficiency is greatly improved, the quality is stable and rising, and the number of online problems drops steadily

Recommended reading:

Application and exploration of atlas correlation technology in risk control and anti-cheating

Android Refactoring — Refactoring practices around players

Discussion on Baidu Reading/Library NA side typesetting technology

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention