Writing in the front

Quality is the foundation of the long-term survival of enterprises, is the enterprise competition free gold medal. As a member of the quality control team, ensuring and improving the quality of the system under my responsibility is the core of my job. And perfect test coverage is an effective means to ensure quality.

Tests are divided into functional tests and performance tests by type. Functional testing, according to the test pyramid model, is divided into three types: unit testing, interface testing and UI testing. Unit tests are method-level tests, which are the basis for ensuring code quality. Interface test and UI test are end-to-end tests that need to cover complete business scenarios. They are generally covered by testers in an automated way and added into continuous integration to ensure that all submitted codes will not affect the normal functions of the product.

But interface testing and UI testing cannot cover all testing requirements, such as algorithms. Algorithm as the basis of machine learning and artificial intelligence, its effectiveness is very important, especially in the tide of intelligent operation and maintenance of the group, a variety of algorithms emerge in endlessly, finding effective methods to evaluate the advantages and disadvantages of algorithms has become the responsibility of the test team. However, the algorithm does not need to verify the interface, nor does it need to test the UI, but needs to establish a set of targeted evaluation indicators, and find a way to get the index values of the measured method to evaluate the algorithm.

Algorithm testing

The algorithm testing process is actually very simple, only three steps:

  • To construct the input
  • Runs the algorithm using constructed inputs
  • Obtain the output, and use the output of the algorithm to calculate the value of each indicator, to evaluate the algorithm

Using the algorithm as a black box, all the testing needs to do is complete steps 1 and 3. The most important step is the first step, because the input is determined, and the output is basically determined, the difference is just how you analyze it. So how do you construct the input? There are two methods: one is to construct the data set manually. The advantage is that it is relatively simple and can be constructed at will, while the disadvantage is that it cannot reflect the real situation on the line and a large number of missed testing scenarios will occur. Another approach is to use online data directly, which has the advantage of comprehensive scenario coverage but the disadvantage of time-consuming data collection. If a testing system can be constructed to make online data collection = algorithm operation = algorithm output evaluation become a fully automated process, then the efficiency and effectiveness of algorithm testing can be greatly improved.

The following will take the algorithm test of unattended publishing system as an example to introduce a method of implementation of the above testing ideas.

Unattended release

Unattended Release (RiskFree) focuses on quickly analyzing the metrics of a new version of the application to identify anomalies, intercept problematic releases, and reduce the failure rate caused by releases. There are three main inputs for the unattended publishing system: system monitoring and basic monitoring data of ALI360, business monitoring data of Sunfire, and log analysis data of A3. The input data of these three systems are analyzed by algorithm to get abnormal score, and the application with higher abnormal score triggers interception.

Failure playback test

In order to test the unattended release system, in addition to ensuring its basic functions, the most important thing is to verify the effectiveness of its algorithm, which mainly falls on two indicators: accuracy rate and recall rate:

  • Accuracy = Valid intercepts (potential failures)/ all intercepts
  • Recall rate = valid intercepts/all releases that should be blocked

The test needs to construct a data set, so that the output from the input can accurately reflect the accuracy and recall rate of the algorithm. Give your developer an accurate reference to verify that your optimizations work.

The input data set consists of the output data of each monitoring system. Not to mention the large and complex output data of the monitoring system, manual construction of data is inefficient and heavy workload. Even if the data set is successfully constructed manually, the validity and coverage of the data cannot be guaranteed. High accuracy and recall rates in the testing process do not mean that failures can be effectively intercepted online, which reduces the value of testing.

Therefore, the most effective method is to directly record the monitoring data on the line and play back the data set to verify the effect of the algorithm. And in order to improve test efficiency and free hands, data selection, collection, playback and result display need to be made into an automated process, so that the development can be triggered by a button to select any desired data set for playback.

For an unattended publishing system, one publishing corresponds to one plan. Therefore, the basic idea is to record all the data generated by the three monitoring systems during the operation of the plan and store them in three tables respectively. Then call the playback interface provided by RiskFree to trigger analysis through planId and return the corresponding recorded data to complete the playback. Finally, the results are collected and displayed after the plan analysis. The recording and playback modules are explained separately below.

The recording process

The first step is to select the plan you want to record. In this article, we have selected all the plans that triggered interception during the release. In order to calculate the accuracy and recall rate after playback, these plans need to be marked, which are effective interception and which are false interception. The marking criteria are: unattended triggered interception && Release slips manually closed or rolled back = valid interception, others are false interception. This method can guarantee the accuracy of marking theoretically. This task is completed by a scheduled task. The release order of the previous day is filtered at midnight every day, and the plan to be selected and corresponding markers are calculated and stored in the local database through data analysis of unattended release and sewolf database.

After selecting the plan, you need to pull down and save the data of the three monitoring systems obtained during the plan analysis. The process is very simple, as long as the monitoring data obtained in the analysis process can be saved. Riskfree provides the recording interface and re-triggers the analysis through planId. During the analysis, the monitoring data obtained is stored in the DB of the recording module through the API provided by the recording module. Since the monitoring system stores about a month’s worth of historical data, all the required monitoring data can be obtained as long as it is recorded in time. You can use the same interface for recording and playing back. You can determine whether to record or play back through configuration items. Of course, the recording module needs to perform idempotent operations to ensure that no duplicate data is inserted into the data set.

The recording module provides two ways to trigger recording. One is the scheduled task. The monitoring data of the plan screened in the first step will be collected at 1:00 a.m. every day. The second mode is triggered by an interface. You can specify certain plans or plans in a certain period to record. In this mode, you can re-record plans that fail to record in scheduled tasks.

Playback process

1. After recording, all the monitoring data corresponding to each plan will be contained in the local DB. During playback, these data need to be accurately returned to RiskFree system. To fully fit how the monitoring system is called, you need to provide a mock layer that mocks calls to the underlying monitoring, business monitoring, and log monitoring interfaces. The mock interface and the real interface on the line must return exactly the same data for the same input. For actual playback, replace the URL of the monitoring system with the URL of the mock layer in the aone configuration item.

2. The upper layer of RiskFree is the operational layer. The running layer encapsulates various playback modes, including playback by planId, playback by monitoring type, playback by time period, fast playback and so on. The development can trigger various playback modes with one click through the interface. At the bottom of the running layer is a concurrency layer, which can configure the number of concurrent playback plans. Through the way of concurrent can not only compress the playback time, improve the test efficiency. Moreover, the performance of the algorithm can be verified in the case of high concurrency.

3. The top layer is the display layer, which includes the pin reminder, test report and trend chart. At the beginning and end of each replay, there is a spike alert, and the spike alert at the end contains a link to the test report. The test report is divided into two parts: overview and details. The overview includes six indicators: total number of replay work orders, effective intercepts, false intercepts, missed intercepts, accuracy and recall rate. Each indicator is an anchor point that can be directly jumped to the corresponding position in the details. The detailed information includes five parts: missing Plan details, mistakenly intercepted Plan details, Plan details different from the last playback result, Plan details different from the online running result, and Plan details of all playback. Each detail part is a table, including online PlanId, offline PlanId generated after playback, current running result, last running result, current playback time, last playback time, corresponding release single state and so on. Each comparison field in the overview information and detail information allows the developer to quickly and accurately obtain the results of the optimization and quickly locate the problem. Below is a partial screenshot of a replay result.

4. In order to intuitively display the optimization effects of previous algorithms, a trend chart is automatically generated for playback results of the same data set and monitoring type, and corresponding links are generated in the test report. The following figure shows the playback trend of A3 log analysis data from November 3 to November 10.

Write in the last

The most important part of algorithm testing is to construct data sets, and online real data sets are often more representative than manually constructed data sets. The recording + playback method mentioned above can be applied to each algorithm evaluation project with a little flexibility. Just string recording and playback into an automated process, and you don’t have to worry about the construction and updating of the data set.

The authors introduce

Li Chaoyang, Senior test and development engineer, Operation and maintenance area, R&d Efficiency Division, Alibaba. Joined Ali in 2016, as a member of the quality control team, responsible for quality assurance of hybrid cloud, group Docker full link and unattended release system.