JAVA gAP-JMH with 8 test traps

preface

JMH is an acronym for the Java Microbenchmark Harness (Microbenchmark) framework (first released in 2013). Its distinctive advantage over many other testing frameworks is that it was developed by the same people who implemented JIT at Oracle. I would like to mention Aleksey Shipilev (JMH writer and evangelist) and his excellent blog post. I spent a weekend reading Aleksey’s blog, especially his articles related to JMH, together with an open class video called “The Lesser of Two Equivocations Story “, and summarized my findings in this article. Many of the images in this article are from Aleksey’s public lecture videos.

Before reading this article

This article doesn’t have a special section devoted to the syntax of JMH, which is great if you’ve used it, but if you haven’t heard of it, don’t worry (as I did a week ago). I’ll take a look at some of the common code testing pitfalls from a Java Developer perspective, analyze their relevance to the operating system underlayer and Java underlayer, and use JMH to help you escape these pitfalls.

To read through this article, you need some basic knowledge of the operating system and some JIT. If you are unfamiliar with the topic, check out the Wikipedia links in the section and the blogs I recommend.

Due to my limited ability, I cannot fully understand all the problems solved by JMH. If there are any mistakes or omissions, please leave a message and communicate with me.

I met JMH

Accuracy of the test

The figure above shows the order of magnitude of time taken for different types of tests, and shows that JMH can achieve microsecond accuracy.

The challenges of testing several orders of magnitude are also different.

Testing at the millisecond level is not too difficult
Testing at the microsecond level is challenging, but not impossible, as JMH did
At the nanosecond level, there is no way to test it accurately
Picosecond scale… Holy Shit

Illustration:

Linpack: Linpack Benchmark A basic test that measures the floating-point computing power of a system

SPEC: Standard Performance Evaluation Corporation

Pipelining: The time consuming of system bus communication

Benchmark classification

Testing can be divided into many categories in different dimensions: integration testing, unit testing, API testing, stress testing… A Benchmark is usually translated as a performance test. You can find Benchmarks in the package hierarchy of many open source frameworks to illustrate the Benchmark level of the framework and thereby quantify its performance.

Benchmarks can be further broken down into Micro Benchmark, Kernels, Synthetic Benchmark, Application benchmarks. The main character of this article belongs to Benchmark’s Micro Benchmark. Basic test classification is detailed here

Why do YOU need Benchmark

If you cannot measure it, you cannot improve it.

–Lord Kelvin

As the saying goes, there is no voice without practice. Benchmark provides data support for applications and serves as a Benchmark for evaluating and comparing the quality of methods. The accuracy and diversity of Benchmark are particularly important.

As the Benchmark portrait of application framework and products, there is a unified standard for Benchmark, which avoids the embarrassment of different evaluation objects talking on their own. It is definitely not advisable for application frameworks to use evaluation methods that are beneficial to their own scenes. For example, Standard Performance Evaluation Corporation (SPEC), the term “test accuracy” above, is one of the standards organizations in the industry, of which JMH author Aleksey is also a member.

JMH so long

@Benchmark
public void measure(a) {
    // this method was intentionally left blank.
}
Copy the code

As simple to use as unit tests

Its evaluation results

Benchmark                                Mode  Cnt           Score           Error  Units
JMHSample_HelloWorld.measure  thrpt    5  3126699413.430 ± 179167212.838  ops/s
Copy the code

Why is the JMH test needed

You might be thinking, what’s wrong with me testing this way?

long start = System.currentTimeMillis();
measure();
System.out.println(System.currentTimeMillis()-start);
Copy the code

Isn’t that how JMH tests?

@Benchmark
public void measure(a) {}Copy the code

In fact, this is the core question of this article, and it is recommended that you always read with this question in mind, why not use the first method. In the following sections, I’ll list a number of testing traps, all of which provide evidence for this question, and which will inspire developers who are not interested in “testing”.

preheating

At the end of the getting to know JMH section, a small section is devoted to getting a head start on the topics covered by JMH, introducing a well-worn topic in Java testing — warm up, which is present in all of the following tests.

«Warmup» = waiting for the transient responses to settle down

When writing Java tests, in particular, preheating is always an integral part of making the results more believable.

The figure above shows the execution time curve of a sample evaluation program as the number of iterations increases. It can be found that the performance tends to be stable after 120 iterations, which means that at least 120 iterations are needed in the warm-up phase to obtain accurate basic test reports. (Some preparation for JVM initialization and JIT optimization are the main reasons, but not the only ones). It should be noted that the JMH is relatively time-consuming to run because the warm-up is preceded by each assessment task.

Solve 12 test traps using JMH

Trap 1: Dead code elimination

MeasureWrong attempts to measure math.log performance, and results are consistent with the null baseline, while measureRight returns one more return than measureWrong.

This is because the JIT is good at removing “invalid” code, which threw our tests into a bit of a surprise. Once you are aware of the DCE phenomenon, you should consciously consume isolated code such as return. The JMH does not automatically implement the elimination of redundant code.

The concept of dead code elimination is familiar to many people: commented code, unreachable code blocks, reachable but not used code, etc. Here are some concepts Aleksey mentioned to illustrate why it is difficult to avoid dead code elimination on reference objects in general testing methods:

Fast object combinator.
Need to escape object to limit thread-local optimizations.
Publishing the object flicker Reference heap write flicker store barrier

Very desperate, personal level is limited, I did not get these points, can only be intact to show you.

The JMH provides a dedicated API, Blockhole, to avoid the dead code elimination problem.

@Benchmark
public void measureRight(Blackhole bh) {
    bh.consume(Math.log(PI));
}
Copy the code

Trap 2: Constant folding and constant propagation

Constant folding is the process of simplifying a Constant at compile time. A Constant represents only a simple value in a representation, such as the integer 2. A variable can be used as a Constant if it has never been modified, or a variable can be explicitly marked as a Constant, as in the following description:

  i = 320 * 200 * 32;
Copy the code

Most modern compilers do not actually generate the instructions to multiply two times and store the result. Instead, they recognize the structure of the statement and calculate the value at compile time (in this case, 2,048,000).

In some compilers, constant folding is handled early, such as in Java, where variables decorated with the final keyword are treated specially. Compilers that fold constants at a later stage are also quite common.

private double x = Math.PI;

// The compiler gives special treatment to final variables
private final double wrongX = Math.PI;

@Benchmark
public double baseline(a) { // 2.220 ± 0.352 ns/op
    return Math.PI;
}

@Benchmark
public double measureWrong_1(a) { // 2.220 ± 0.352 ns/op
    // Error, the result can be predicted, constant folding occurs
    return Math.log(Math.PI);
}

@Benchmark
public double measureWrong_2(a) { // 2.220 ± 0.352 ns/op
    // Error, the result can be predicted, constant folding occurs
    return Math.log(wrongX);
}

@Benchmark
public double measureRight(a) { // 22.590 ± 2.636 ns/op
    return Math.log(x);
}
Copy the code

This can be verified by JMH: Only the last measureRight correctly tests math.log performance, measureWrong_1 and measureWrong_2 are both affected by constant folding.

Constant propagation is a process to replace known constants in representations, which is also carried out at compile time, including those defined above. Built-in functions are also applicable to constants, as described below:

  int x = 14;
  int y = 7 - x / 2;
  return y * (28 / x + 2);
Copy the code

Propagation can understand the substitution of variables. If continuous propagation is carried out, the above equation will become:

  int x = 14;
  int y = 0;
  return 0;
Copy the code

Pitfall # 3: Never write loops in tests

This pitfall has a huge impact on our daily testing, so I’m going to use it as a title: Never write loops in testing!

This section design many knowledge points, loop unrolling, JIT & OSR optimization of the cycle. For the definition of loop expansion of the former, it is recommended that readers directly check the definition of Wiki, while for the optimization of loop by JIT & OSR, two R answers are recommended:

Two for loops with the same length and the same body code are 100 times different in execution time?

What is the mechanism of ON-stack Replacement (OSR)?

For the first answer, it is recommended to look directly at the answer rather than the question; The second answer explains what OSR does to the loop.

To test a shorter method, entry-level programmers (those of you who don’t know dynamic compilation) would write it this way, loop in and average it.

public class BadMicrobenchmark {
    public static void main(String[] args) {
        long startTime = System.nanoTime();
        for (int i = 0; i < 10 _000_000; i++) {
            reps();
        }
        long endTime = System.nanoTime();
        System.out.println("ns/op : "+ (endTime - startTime)); }}Copy the code

In practice, the results of this code are unpredictable, and too many impact factors can interfere with the results. Benchmark attempts to iterate over the REPS method a number of times in order to get an accurate picture of rePS performance. (Note that using loops in JMH is also a bad idea, and unless you’re an expert at Benchmark, you should never write loops at all.)

int x = 1;
int y = 2;

@Benchmark
public int measureRight(a) {
    return (x + y);
}

private int reps(int reps) {
    int s = 0;
    for (int i = 0; i < reps; i++) {
        s += (x + y);
    }
    return s;
}

@Benchmark
@OperationsPerInvocation(1)
public int measureWrong_1(a) {
    return reps(1);
}

@Benchmark
@OperationsPerInvocation(10)
public int measureWrong_10(a) {
    return reps(10);
}

@Benchmark
@OperationsPerInvocation(100)
public int measureWrong_100(a) {
    return reps(100);
}

@Benchmark
@OperationsPerInvocation(1000)
public int measureWrong_1000(a) {
    return reps(1000);
}

@Benchmark
@OperationsPerInvocation(10000)
public int measureWrong_10000(a) {
    return reps(10000);
}

@Benchmark
@OperationsPerInvocation(100000)
public int measureWrong_100000(a) {
    return reps(100000);
}
Copy the code

The results are as follows:

Benchmark                               Mode  Cnt  Score   Error  Units
JMHSample_11_Loops.measureRight         avgt    5  2.343 ± 0.199  ns/op
JMHSample_11_Loops.measureWrong_1       avgt    5  2.358 ± 0.166  ns/op
JMHSample_11_Loops.measureWrong_10      avgt    5  0.326 ± 0.354  ns/op
JMHSample_11_Loops.measureWrong_100     avgt    5  0.032 ± 0.011  ns/op
JMHSample_11_Loops.measureWrong_1000    avgt    5  0.025 ± 0.002  ns/op
JMHSample_11_Loops.measureWrong_10000   avgt    5  0.022 ± 0.005  ns/op
JMHSample_11_Loops.measureWrong_100000  avgt    5  0.019 ± 0.001  ns/op
Copy the code

Which of the above results would you believe if you didn’t have to give false or true cues? In fact, the running time decreased from 2.358 to 0.019 as the number of iterations increased. BadMicrobenchmark, the code for the manual test loop, has the same problem; it doesn’t actually warm up and is only more unreliable than the JMH test loop.

Aleksey concludes in the video: Assume that word iteration takes 𝑀 ns. Under the multiple effects of JIT, OSR, loop unrolling and other factors, the theoretical time of multiple iterations is 𝛼𝑀 ns, where 𝛼 ∈ [0; +∞).

The correct posture for the test cycle can be seen here: here

Pitfall 4: Isolate multiple test methods using forks

Believe me, the example involved in this trap is definitely the weirdest one in JMH sample, and I have not found a scientific explanation (to be honest, I have tried to listen to this part of the video for several times, but I did not understand, forgive me for listening).

We define a Counter interface and implement two identical implementation classes: Counter1 and Counter2

public interface Counter {
    int inc(a);
}

public class Counter1 implements Counter {
    private int x;

    @Override
    public int inc(a) {
        returnx++; }}public class Counter2 implements Counter {
    private int x;

    @Override
    public int inc(a) {
        returnx++; }}Copy the code

Then they were benchmarked in the same VM in first order:

public int measure(Counter c) {
    int s = 0;
    for (int i = 0; i < 10; i++) {
        s += c.inc();
    }
    return s;
}

/* * These are two counters. */
Counter c1 = new Counter1();
Counter c2 = new Counter2();

/* * We first measure the Counter1 alone... * Fork(0) helps to run in the same JVM. */
@Benchmark
@Fork(0)
public int measure_1_c1(a) {
    return measure(c1);
}

/* * Then Counter2... * /
@Benchmark
@Fork(0)
public int measure_2_c2(a) {
    return measure(c1);
}

/* * Then Counter1 again... * /
@Benchmark
@Fork(0)
public int measure_3_c1_again(a) {
    return measure(c1);
}

@Benchmark
@Fork(1)
public int measure_4_forked_c1(a) {
    return measure(c1);
}

@Benchmark
@Fork(1)
public int measure_5_forked_c2(a) {
    return measure(c2);
}
Copy the code

There is a Fork annotation in this example, so let me introduce it briefly. Fork, as the name suggests, is used to make a copy of the run environment, and in our previous tests, we actually used an isolated and identical test environment by default, thanks to JMH. Each test runs in a separate JVM process. You can also specify (additional) JVM parameters, such as Fork(0), which is the opposite of Fork(0) to illustrate the drawbacks of running in the same JVM. Imagine c1, C2, C1 again.

Benchmark                                 Mode  Cnt   Score   Error  Units
JMHSample_12_Forking.measure_1_c1         avgt    5   2.518 ± 0.622  ns/op
JMHSample_12_Forking.measure_2_c2         avgt    5  14.080 ± 0.283  ns/op
JMHSample_12_Forking.measure_3_c1_again   avgt    5  13.462 ± 0.164  ns/op
JMHSample_12_Forking.measure_4_forked_c1  avgt    5   3.861 ± 0.712  ns/op
JMHSample_12_Forking.measure_5_forked_c2  avgt    5   3.574 ± 0.220  ns/op
Copy the code

Would you be surprised to learn that c1 takes the least time to run for the first time? In my opinion, the JIT at least starts to warm up, and there is no way that the method that runs first is so much faster than the method that follows! But it’s also consistent with what Aleksey described in the video.

The example in JMH samples mainly wants to express that evaluation codes run in the same JVM will affect each other. From the results, it can be found that c1, C2,c1_again have the same implementation, but different running scores, because they are run in the same JVM. Forked_c1 and forked_C2 show consistent performance. So for no particular reason, the value of Fork generally needs to be set to >0.

Trap 5: methods are inlined

Those familiar with C/C++ will be familiar with method inlining, which “copies” the code of the target method into the method that initiates the call, avoiding the actual method call (reducing the operation instruction cycle). In Java, you can’t write inline methods manually, but the JVM automatically recognizes hot methods and uses method inlining optimization on them. How many times does a piece of code need to be executed to trigger JIT optimizations usually this value is set by the -xx :CompileThreshold parameter:

1. When using the Client compiler, the default is 1500.
When using the Server compiler, the default value is 10000.

But even if a method is highlighted by the JVM as a hot method, the JVM does not necessarily optimize it for method inlining. One of the more common reasons is that the method body is too large, and it falls into two cases.

If methods are executed frequently, methods smaller than 325 bytes are inlined by default-XX:MaxFreqInlineSize=NTo set the size)
If methods are infrequently executed, by default methods are inlined only if they are less than 35 bytes in size-XX:MaxInlineSize=NTo set the size)

We can increase this size so that more methods can be inlined; However, you are not recommended to change this parameter unless you can significantly improve performance. Because the larger method experience results in more code memory usage, fewer hot methods are cached, and the end result is not necessarily good.

If you want to know if a method is inlined, you can configure it using the following JVM parameters

- XX: + PrintCompilation / / in the console print compilation process information - XX: + UnlockDiagnosticVMOptions / / unlock on the diagnosis to the JVM option parameters. This is turned off by default and supports some specific arguments for JVM diagnostics. -xx :+PrintInlining // Prints the inline methodCopy the code

Other implied conditions for method inlining

Although the JIT claims to be optimized for the global health of the code, it is possible for a method to be inlined by the JIT, requiring type checking because the method is inherited

If you want to use inline optimization for hot methods, use it as often as possibleFinal, private, staticThese modifiers modify methods to prevent them from performing poorly because of inheritance, which requires additional type checking.

Method inlining may also affect Benchmark; Alternatively, we can use JMH to compare performance with non-inline methods when we deliberately trigger inlining in order to optimize code:

public void target_blank(a) {
    // this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void target_dontInline(a) {
    // this method was intentionally left blank
}

@CompilerControl(CompilerControl.Mode.INLINE)
public void target_inline(a) {
    // this method was intentionally left blank
}
Copy the code

Benchmark                                Mode  Cnt   Score    Error  Units
JMHSample_16_CompilerControl.blank       avgt    3   0.323 ±  0.544  ns/op
JMHSample_16_CompilerControl.dontinline  avgt    3   2.099 ±  7.515  ns/op
JMHSample_16_CompilerControl.inline      avgt    3   0.308 ±  0.264  ns/op
Copy the code

As you can see, the performance difference between inlining and not inlining is huge, there is some flavor of space for time, use compilerControl. Mode in JMH to control whether inlining is enabled.

Trap 6: Pseudo-share and cache lines

Once again we meet our old friend: CPU caches and Cache row populations. This concurrency killer, which I covered in a previous article, can be found here if you haven’t seen it: JAVA pickups – CPU Caches and Cache lines. In Benchmark, it’s sometimes important to ignore the impact of cached rows on your evaluation.

For lack of space, I don’t want to explore the pitfalls of fake sharing here. A full review can be found here: JMHSample_22_FalseSharing

JMH provides the @state annotation to solve the problem of pseudo sharing, but individual fields cannot be added within a single object. If necessary, the @ContEnded annotation in the packet can be used to handle this.

Aleksey has done some optimizations for Java and packages, including the @Contended annotation.

Trap 7: Branch prediction

Branch Prediction is the “trickster” in the last Benchmark introduced in this article. Again, look at the results from a specific Benchmark. The following code attempts to traverse two arrays of equal length, one ordered and one unordered, and adds a judgment statement during iteration, which is key to branch prediction: if(v > 0)

private static final int COUNT = 1024 * 1024;

private byte[] sorted;
private byte[] unsorted;

@Setup
public void setup(a) {
    sorted = new byte[COUNT];
    unsorted = new byte[COUNT];
    Random random = new Random(1234);
    random.nextBytes(sorted);
    random.nextBytes(unsorted);
    Arrays.sort(sorted);
}

@Benchmark
@OperationsPerInvocation(COUNT)
public void sorted(Blackhole bh1, Blackhole bh2) {
    for (byte v : sorted) {
        if (v > 0) { / / key
            bh1.consume(v);
        } else{ bh2.consume(v); }}}@Benchmark
@OperationsPerInvocation(COUNT)
public void unsorted(Blackhole bh1, Blackhole bh2) {
    for (byte v : unsorted) {
        if (v > 0) { / / key
            bh1.consume(v);
        } else{ bh2.consume(v); }}}Copy the code

Benchmark                               Mode  Cnt  Score   Error  Units
JMHSample_36_BranchPrediction.sorted    avgt   25  2.752 ± 0.154  ns/op
JMHSample_36_BranchPrediction.unsorted  avgt   25  8.175 ± 0.883  ns/op
Copy the code

As a result, traversal of ordered arrays is 2-3 times faster than traversal of unordered arrays. The best explanation for this comes from Stack Overflow’s amazing answer: Why is it faster to process a sorted array than an unsorted array?

Let’s say we’re in the 19th century, and you’re in charge of choosing a direction for a train, before telephones and cell phones, and you don’t know which way the train is going when it arrives. So what you do is, you stop the train, the train stops, you ask the driver, and you figure out which way the train is going, and you put the track on the right track.

Another thing to note is that the inertia of the train is very high, so the driver has to start slowing down far away. When you turn the track in the right direction, it takes a long time for the train to accelerate.

So is there a better way to reduce train waiting times?

In a very simple way, you turn the track in a certain direction in advance. So which way to turn, you use this trick — “blind” :

If the trick is correct, the train goes straight through, which takes zero time.
If it’s wrong, the train stops, it goes back, you flip the track in the opposite direction, the train restarts, speeds up, and goes.

If you are lucky and get it right every time, the train will never stop, it will just keep going! If you’re wrong, you’ll waste a lot of time.

While not exact, you can use the same reasoning to predict CPU branch predictions. Ordered arrays make such predictions correct most of the time, so ordered arrays traverse faster than unordered arrays with judgment conditions.

It is also instructive to avoid large numbers of judgments in large-scale loop logic (can we extract them out of the loop?). .

Trap 8: Multithreaded testing

Run a test method on a 4-core system and get the above test results. Ops/nsec represents the number of runs per unit time, and Scale represents the number of runs per 2,4 threads compared to 1 thread.

This graph allows us to ask two questions:

Why do 2 threads -> 4 threads barely change?
Why is it that 2 threads are only 1.87 times different than 1 threads, instead of 2?

1 Power Management

The first factor is that multi-threaded testing is affected by the Power Management of the operating system. Many systems have optimized Management of Power consumption and performance. (Ex: cpufreq, SpeedStep, Cool&Quiet, TurboBoost)

When we actively lowered the frequency of the machine, the overall performance decreased, but the Scale became a rigorous double in the process of thread number 1 -> 2.

This problem is not inescapable, and the remedy is to disable power management to ensure CPU clock frequency.

JMH ensures the accuracy of testing by running for a long time and ensuring that threads do not appear in park(time waiting) state.

2. Operating system scheduling and time-sharing call model

The second problem that leads to multithreaded testing traps needs to be understood in terms of thread scheduling models: time-sharing and preemptive scheduling.

The time-sharing scheduling model is to let all threads take turns to get the right to use the CPU, and equally allocate the CPU time slice occupied by each thread, this is also easier to understand; In the preemptive scheduling model, priority is given to the threads with the highest priority in the runnable pool. If the threads in the runnable pool have the same priority, a random thread is selected to occupy the CPU. A running thread runs until it has to abandon the CPU. A thread can abandon the CPU for the following reasons.

It is important to note that thread scheduling is not cross-platform and depends not only on the Java VIRTUAL machine but also on the operating system. On some operating systems, the CPU is not abandoned as long as the running thread is not blocked; On some operating systems, even if a thread is not blocked, it will run for some time and then abandon the CPU, giving other threads a chance to run.

In either model, switching thread context causes loss. So far, this has only answered the first question: why are 2 threads only 1.87 times more different than 1, instead of 2?

Since the above two figures are extracted from Aleksey’s video, I do not know his actual test cases. The performance difference between 2 -> 4 threads is not large, which can only be interpreted as system overload. According to the truth, on a 4-core machine, running 4 threads should not be faster than 2 threads.

JMH introduces the concept of bogus Iterations to address the instability caused by thread time-sharing calls and thread scheduling, which ensures that only the threads in the busy state are measured in the multi-threaded testing process.

This is worth mentioning as “bogus iterations”, which I understand as “pseudo iterations” and is only introduced in the NOTES of the JVM and several of Aleksey’s blogs, and can be understood as a special term for the inner workings of JMH.

conclusion

This article spends a lot of time explaining why JMH exists and the pitfalls mentioned in the JMH sample, which can easily be triggered by non-standard evaluation procedures. I feel that as a Java language user, it is necessary to at least understand that these phenomena exist. After all, JMH has solved a lot of problems for you. You don’t have to worry about warming up, you don’t have to write your own low loop to evaluate, and it is relatively easy to avoid these testing traps.

In fact, the knowledge points designed in this article are only the tip of the iceberg in Aleksey’s blog and the 38 SAMPLES of JMH. If you are interested, you can click here to see all the JMH samples

Trap heart OS: There are 30 traps like me!

Excellent open source frameworks such as Kafka provide a dedicated module to do basic testing of JMH. Try using JMH as your Benchmark tool.

Welcome to follow my wechat official account: “Kirito technology sharing”, any questions about this article will be answered, bring more Java related technology sharing.