Use Benchmark in CI for regression analysis

We released the first alpha version of the Benchmark library at I/O 2019. Since then, we’ve been improving the Benchmark library to help you accurately evaluate performance when tuning your code. Jetpack Benchmark is a standard JUnit Instrumentation tests running on Android devices. It uses a set of rules provided by the Benchmark library to measure and report:

@get:Rule
val benchmarkRule = BenchmarkRule()

@UiThreadTest
@Test
fun simpleScroll(a) {
    benchmarkRule.measureRepeated {
        // Scroll RecyclerView by one item
        recyclerView.scrollBy(0, recyclerView.getLastChild().height)
    }
}
Copy the code

△ Sample project on Github

Examples of Android Studio exporting and running multiple benchmarks

The Benchmark library handles warm-up, detecting configuration issues, and evaluating code performance through its own JUnit Rule API.

The above works fine in our own working environment, but much of the benchmark data actually comes from testing regression models in Continuous Integration (CI). So what do we do with baseline data in CI?

Benchmarking vs correctness testing

Even if there are thousands of correctness tests in a project, they can be easily displayed on the data panel through information folding. Here’s our test information in Jetpack:

Nothing special here, but two common techniques are used to reduce visual load. First, a list of thousands of tests is folded in package and class dimensions; The package with all correct results is then hidden by default. Just like that, nearly 20,000 test results from dozens of databases are encapsulated in a few lines. The correctness test panel has good control over the size of the data presented.

But what about benchmarking? Benchmarks don’t simply output pass/fail, and the result of each test is a scalar, which means we can’t simply fold the result of a pass. We can take a look at the data chart and maybe get a sense of the pattern of the data, after all, there are usually far fewer benchmarks than there are correctness tests…

But all you see is a lot of visible noise. Even if the test results were reduced from thousands to hundreds, looking directly at the chart would still not help you analyze the data. The data in the benchmark that preserves the original performance results occupies the same visible area as the data in the test regression, so we need to filter out the data that does not have the test regression (so that the test regression data can stand out).

Simple regression detection method

We can start with some simple things and try to get back to pass-or-fail correctness tests. For example, a percentage drop in the results of two runs above a certain threshold can be defined as a failed result of the benchmark. But because of the variance, this approach doesn’t work.

The base data for view filling is prone to large variance, but still provides useful data

Although we have always tried to produce stable and consistent results in our benchmarks, the curve can vary significantly, depending on the amount of work and the device being run. For example, we found that the test results for populating views were very inconsistent compared to other CPU workload benchmarks. Setting a threshold of 100 percent does not yield desirable results for every test, but we do not want to place the burden of setting a threshold (or baseline) on the benchmark author either, since this is cumbersome and scalable as the analysis size increases.

Variance may also occur in the form of large, low-frequency peaks when some test equipment produces unusually slow results over successive benchmarks. While we can fix some of these (for example, to prevent running tests when the core is disabled due to low battery), it’s hard to avoid all of the variance.

— All spikes in RecyclerView, AdS-Identifier, and a benchmark test of Room — we don’t want to report this as a regression model

In summary, we cannot locate a test regression problem just by looking at the NTH and n-1 Build results — we need more contextual information to help us make decisions.

Step by step fitting, an extensible solution

Our step-by-step fitting method in Jetpack CI is provided by the Skia Perf Application.

This method looks for the step function in the baseline data. As we examine the sequence of results for each benchmark, we can try to look for a “step” of up or down data points as a signal that a particular Build changes the benchmark’s effect. But let’s also look at a few more data points to make sure we’re seeing a consistent trend across multiple results, not a fluke:

▷ The location where context can reveal a large deterioration in performance is just a capricious change in the results of the benchmark analysis

So how do we pick out such a step? We need to look at multiple results before and after the change:

We then calculate the weight of the test regression with the following code:

The idea here is that by detecting the error before and after the change and weighting the difference between the mean of that error, the smaller the variance of the benchmark, the more confident we can detect subtle test regressions. This allows us to run microbenchmarks with nanosecond accuracy on a system with a higher variance for large (for mobile platforms) database benchmarks.

You can also try it yourself! Click the Run button to try out the algorithm in our CI that processes the data generated by the WorkManager benchmark. It outputs two links, one to the build with the test regression and one to subsequent related fixes (click “View Changes” to see the details of this code commit). These match the regressions and improvements one sees when plotting the data:

According to our algorithm configuration, all secondary noise in the graph will be ignored. When it starts running, you can try to control the algorithm with the following two parameters:

WIDTH – How many code submissions to cover
THRESHOLD – At what point does regression appear on the panel

Increasing the width reduces inconsistencies, but it also makes it harder to find test regressions when the results change more frequently — we currently use a width value of 5. The threshold is used for overall sensitivity control – we currently use 25. Lowering the threshold can see more test regressions caught, but it can also lead to more false positives.

If you want to configure in your own CI, you need:

Write benchmarks
Run them in the CI of a real machine, preferably with continuous performance support
Collect output metrics from JSON
When a result is ready, examine the result when it is twice as wide

If there is regression or improvement, issue an alert (email, question, or anything useful to you) to check the performance of the Build covered by the current WIDTH.

Submit in advance

So what is pre-submission? If you don’t want test regressions in your Build, you can capture regressions by pre-commit. Running a benchmark before submission can be a good way to prevent regression entirely, but remember first: Benchmarks, like Flaky tests, require an infrastructure like the algorithm above to solve instability problems.

For pre-commit tests that can interrupt the patch delivery workflow, you need to have greater confidence in the regression checks used.

Since running the benchmark once doesn’t give us enough confidence in ourselves, the stepwise fitting method above is necessary. Again, we can increase confidence in this area by getting more data — simply run it multiple times, without modification, to detect whether the patch introduces test regression.

If you can live with the increased resource consumption of multiple benchmarks each time you change the code, pre-commit will work well.

Full disclosure — We don’t currently use benchmarks in pre-submissions for Jetpack, but if you’re willing to give it a try, here are our recommendations:

Run the benchmark more than 5 times with or without a patch (the latter can usually be cached or retrieved from the submitted results);
Consider skipping extremely slow benchmarks;
Don’t block results-based patch submissions — just consider the results during code reviews. Regression is sometimes used as part of an improved code base!
Consider that previous results may not exist. Pre-commit cannot detect added benchmarks.

conclusion

Jetpack Benchmark provides an easy way to get accurate performance metrics from outside your Android device. Combined with the stepwise fitting algorithm above, you can solve unstable problems so that you can detect test regression problems before performance problems affect users — just as we did in Jetpack CI.

Some notes on where to start:

Capture key scrolling interfaces in the benchmark
Add performance tests for critical locations and high CPU consumption tasks that interact with third-party libraries
Treat improvements like test regression problems — they’re worth digging into

read

If you would like to read more, please refer to our presentation at the Android Developer Summit 2019:

To see more about how Jetpack Benchmark works, check out our Google I/O talk, Use Benchmarks to improve application Performance

We use the Skia Perf application to track the performance of the AndroidX library. The benchmark results can be found at AndroidX perf.skia.org. Since it is now running in our CI, you can see the actual source of the stepwise fitting algorithm described here. If you want to learn more, Joe Gregorio wrote another blog post about their more advanced K-means clustering detection algorithm, explaining the specific problems and solutions that the Skia project developed specifically for integrating multiple configurations (different operating systems and operating system versions, CPU/GPU chip/driver variants, compilers, etc.) designed.

Use Benchmark in CI for regression analysis

Benchmarking vs correctness testing

Simple regression detection method

Step by step fitting, an extensible solution

Submit in advance

conclusion

read

Related Posts

Stack widgets for the Flutter base layout

An Android ape editor on the road

Android View event stabilization