Stop The World and SafePoint (OpenJDK 11)

This article is based on OpenJDK 11

When analyzing online JVM performance issues, we might encounter the following scenarios:

1. The GC itself did not take long, but the JVM paused for a long time, such as the following:2. The JVM does not have GC, but the program pauses for a long time, and this happens from time to time.

These questions are usually related to SafePoint and Stop the World.

What is SafePoint? What is Stop the World? What’s the relationship between them?

Consider the following scenario:

When GC is required, you need to know which objects are still in use or are no longer in use and can be reclaimed, which requires object usage per thread.
For Biased locking, the desire to unbiased at high concurrency requires precise information about the thread state and the thread that acquired the Lock.
Just-in-time compilation optimization (OSR stack replacement) or de-optimization (bailout stack de-optimization) of a method requires information about where the thread actually runs to the method.

All of these operations require various information about the thread, such as what is in the register, the heap usage, the stack method code, etc., and to do these operations, the thread needs to pause until the operation is complete, otherwise there will be concurrency problems. That’s where SafePoint comes in.

Safepoint can be understood as a special place in the execution of code where a thread can pause when it is executing. SafePoint holds some running information about the current thread that is not available elsewhere for other threads to read. This information includes any information about the thread context, such as internal Pointers to objects or non-objects, and so on. We generally understand SafePoint as follows: Only when the thread runs to the SafePoint location, all its state information is determined, and only then can we know what memory is used and what is not used by the thread. In addition, only if the thread is in a SafePoint position will it be aware of changes to the JVM stack, such as the retaking of unused memory, and then continue running. Each thread has a snapshot of its own memory usage, and is unaware of changes made by other threads. It’s only going to be sensed when you go to SafePoint.

Therefore, GC must require all threads to enter SafePoint at the same time and stay there until GC finishes processing memory before all threads continue. For example, when all threads enter SafePoint and wait, it is Stop the world. .

Why SafePoint and Stop The World?

The SafePoint location holds anything in the thread context, including objects, internal Pointers to objects or non-objects, which the thread is aware of when it is in SafePoint. Therefore, only when the thread is in SafePoint can GC be performed On the memory used by the thread and changes to the code being executed, such as OSR (On Stack Replacement, Replace existing code on stack as JIT-optimized code) or Bailout (replace JIT over-optimized code on stack as un-optimized code). Also, an important Java threading feature that is implemented with SafePoint is Thread.interrupt(), which means the Thread is not aware of interrupted until SafePoint.

Why do we need to Stop The World? Sometimes we need to have all of The threads in SafePoint globally so that we can figure out how much memory we can reclaim for GC, recycle code we don’t need anymore to clean up The CodeCache, And executing some Java Instrument command or JDK tool, such as jStack print stack, requires stopping the World to take a snapshot of all current threads.

How is SafePoint implemented?

SafePoint can be inserted at certain points in the code. Each thread running SafePoint code will proactively check whether it needs to start SafePoint. This proactive checking process is called Polling

In theory, you could place a machine command checking Safepoint at the boundary of every Java compiled bytecode. At this point in the thread’s execution, Polling will be executed to ask the JVM whether it needs to enter SafePoint. This query has a performance penalty, so the JIT will optimize to minimize SafePoint.

Jit-compiled optimized code will place a SafePoint before all methods return and before all non-counted loops (unbounded loops) jump back. In case GC needs to Stop the world, the thread cannot be paused. But for explicitly bounded loops, to reduce SafePoint, it is not possible to place a SafePoint before jumping back, i.e. :

for (int i = 0; i < 100000000; i++) {
    ...
}
Copy the code

SafePoint is not in there, which leads to some of the performance optimization issues I’ll discuss later. Note that for int bounded loops, for example, if int I is changed to long I, there will still be SafePoint;

SafePoint implementation related source code: SafePoint. CPP

As you can see, there are five scenarios for SafePoint threads; Suppose there is now an operation that triggers a VM thread and all threads need to enter SafePoint (for example, GC is now required). If the other threads are now:

Running bytecode: When running bytecode, the interpreter will see if the thread is marked as poll armed, if so, the VM thread callsSafepointSynchronize::block(JavaThread *thread)To carry on the block.
Run native code: When running native code, the VM thread will skip this thread, but set the thread to poll armed so that after executing native code, it will check whether it is poll armed and block directly if it needs to stop at SafePoint.
Running jIT-compiled code: Since you are running compiled machine code, check whether the local local Polling page is dirty. If it is dirty, you need a block. This feature was introduced after Java 10 introduced JEP 312: Thread-Local Handshakes, which allowed you to simply check whether a Local Local page was dirty.
In BLOCK state: Do not leave the BLOCK state until the operation that requires all threads to enter SafePoint is complete
In the thread switching state or the VM running state: polls the thread state until it is in the blocked state (the thread must be in any of the four states mentioned above, and any of them will block).

In what cases does Stop the World occur when all threads enter SafePoint?

Enter SafePoint periodically: each-XX:GuaranteedSafepointIntervalThe time is configured to allow all threads to enter Safepoint, and once all threads have entered Safepoint, it will immediately recover from Safepoint. This timing is for tasks that don’t necessarily Stop the World immediately, and can be set-XX:GuaranteedSafepointInterval=0Turn off this timer. I recommend turning it off.
Because commands such as JStack, jmap, and jstat, most of the commands processed by the Signal Dispatcher thread, cause Stop the world: These commands all collect stack information, so all threads need to go into Safepoint and pause.
Biased lock cancellation (this does not necessarily cause an overall Stop the world, see JEP 312: Thread-local Handshakes: Java believes that locks are mostly non-competitive (a synchronized block will most likely not have multiple threads competing at the same time), so you can use bias to improve performance. That is, in the case of non-contention, when the thread that acquired the lock obtains the lock again, it will determine whether the lock is biased to me, so that the thread will not need to acquire the lock again, and can directly enter the synchronized block. However, in the case of high concurrency, the bias lock will often fail, so you need to cancel the bias lock. When you cancel the bias lock, you need to Stop the world, because you need to obtain the status of each thread using the lock and the running status.
Agent loading and class redefinition caused by Java Instrument: Stop the World is required to modify the information related to this class on the stack due to class redefinition
Java Code Cache: When JIT compilation optimizations occur, OSR, Bailout, or cleanup of the Code Cache, you need to Stop the world because you need to read the method of thread execution and change the method of thread execution
GC: This needs to Stop the World because it requires object usage information for each thread, and it needs to reclaim some objects, free some heap memory or direct memory
Some events for JFR: If collection of OldObject is enabled for JFR, this is a periodic collection of longer-lived objects, so you need to Stop the World. In addition, when JFR dumps, each thread has a BUFFER of JFR events and needs to collect the events in the buffer. Therefore, the world needs to be stopped.

Other events, not often encountered, can refer to the source vmoperations.hpp

#define VM_OPS_DO(template) \template(None) \ template(Cleanup) \ template(ThreadDump) \ template(PrintThreads) \ template(FindDeadlocks) \ template(ClearICs) \ template(ForceSafepoint) \ template(ForceAsyncSafepoint) \ template(DeoptimizeFrame) \ template(DeoptimizeAll) \ template(ZombieAll) \ template(Verify) \ template(PrintJNI) \ template(HeapDumper) \ template(DeoptimizeTheWorld) \ template(CollectForMetadataAllocation) \ template(GC_HeapInspection) \ template(GenCollectFull) \ template(GenCollectFullConcurrent) \ template(GenCollectForAllocation) \ template(ParallelGCFailedAllocation) \ template(ParallelGCSystemGC) \ template(G1CollectForAllocation) \ template(G1CollectFull) \ template(G1Concurrent) \ template(G1TryInitiateConcMark) \ template(ZMarkStart) \ template(ZMarkEnd) \ template(ZRelocateStart) \ template(ZVerify) \ template(HandshakeOneThread) \ template(HandshakeAllThreads) \ template(HandshakeFallback) \ template(EnableBiasedLocking) \ template(BulkRevokeBias) \  template(PopulateDumpSharedSpace) \ template(JNIFunctionTableCopier) \ template(RedefineClasses) \ template(UpdateForPopTopFrame) \ template(SetFramePop) \ template(GetObjectMonitorUsage) \ template(GetAllStackTraces) \  template(GetThreadListStackTraces) \ template(GetFrameCount) \ template(GetFrameLocation) \ template(ChangeBreakpoints)  \ template(GetOrSetLocal) \ template(GetCurrentLocation) \ template(ChangeSingleStep) \ template(HeapWalkOperation) \ template(HeapIterateOperation) \ template(ReportJavaOutOfMemory) \ template(JFRCheckpoint) \ template(ShenandoahFullGC) \ template(ShenandoahInitMark) \ template(ShenandoahFinalMarkStartEvac) \ template(ShenandoahInitUpdateRefs) \ template(ShenandoahFinalUpdateRefs) \ template(ShenandoahDegeneratedGC) \ template(Exit) \ template(LinuxDllLoad) \ template(RotateGCLog) \ template(WhiteBoxOperation) \ template(JVMCIResizeCounters) \ template(ClassLoaderStatsOperation) \ template(ClassLoaderHierarchyOperation) \ template(DumpHashtable) \ template(DumpTouchedMethods) \ template(PrintCompileQueue) \ template(PrintClassHierarchy) \ template(ThreadSuspend) \ template(ThreadsSuspendJVMTI) \ template(ICBufferFull) \ template(ScavengeMonitors) \ template(PrintMetadata) \ template(GTestExecuteAtSafepoint) \ template(JFROldObject) \Copy the code

What causes Stop the World to take too long?

The Stop the World phase can be simply broken down (during this period, the JVM is in a Safepoint block state for all threads) :

An action that needs to Stop the world (i.e., the above mentioned cases where all threads enter SafePoint, i.e., Stop the World)
Send a Safepoint synchronization Signal to the JVM daemon thread Signal Dispatcher and send it to the corresponding module for execution.
Corresponding module, collect all thread information, and do different operations and tags for each thread according to the state (according to the description of the previous source code section, there are 5 cases)
All threads enter Safepoint and block.
Do the action that requires you to initiate Stop the world.
The operation is complete and all threads are restored from Safepoint.

Based on these stages, the reasons why Stop the World takes too long are as follows:

Phase 4 is too long, that is, waiting too long for some of the threads out of all to enter Safepoint, which is likely to be trueBig bound loops and JIT optimizationRelated, and most likely the class that OpenJDK 11 introduced to get the call stackStackWalkerIt may also be caused by CPU resource problems, too many dirty pages in the system memory, or swap.
Phase 5 takes too long, we need to see what operations are responsible for it, such as biased lock undo too much, GC takes too long, etc. We need to find ways to reduce the time spent by these operations, or we need to directly close these events (such as biased lock closing, OldObjectSample event collection of JFR) to reduce entry. This is not relevant to this post and will not be covered here.
Phase 2 and Phase 3 took too long. Since the Signal Dispatcher is single-threaded, you can see what the Signal Dispatcher thread is doing at that time, which may be caused by the Signal Dispatcher doing other operations. It may also be caused by CPU resource problems, too many dirty pages in the system memory, or swap.

What are the problems with bounded loops and JIT optimization for SafePoint?

Known: The current state of Thread.intterupted() is not known until the Thread has executed Safepoint code, not the Thread’s local cache.

Let’s look at the following code:

static int algorithm(int n) {
    int bestSoFar = 0;
    for (int i=0; i<n; ++i) {
        if (Thread.interrupted()) {
            System.out.println("broken by interrupted");
            break; } bestSoFar = (int) math.pow (I, 0.3);} bestSoFar = (int) math.pow (I, 0.3); }return bestSoFar;
}
public static void main(String[] args) throws InterruptedException {
    Runnable task = () -> {
        Instant start = Instant.now();
        int bestSoFar = algorithm(1000000000);
        double durationInMillis = Duration.between(start, Instant.now()).toMillis();
        System.out.println("after "+durationInMillis+" ms, the result is "+bestSoFar); }; // Interrupt Thread t = new Thread(task); t.start(); Thread.sleep(1); t.interrupt(); // Interrupt = new Thread(task); t.start(); Thread.sleep(10); t.interrupt(); // Interrupt = new Thread(task); t.start(); Thread.sleep(100); t.interrupt(); // Interrupt the algorithm after 1sforThe number of loop calls should be enough to cause code just-in-time compilation optimization and OSR t = new Thread(task); t.start(); Thread.sleep(1000); Tinnterrupt (); // Discover that the thread will no longer react to interrupt(); }Copy the code

After using the JVM parameter – Xlog: + jit compilation = debug: file % = jit_compile t.l og: uptime, level, tags: filecount = 10, filesize = 100 m to print the jit Compile the log to another file for easy viewing. Finally, the console output:

Broken by interrupted broken by interrupted after 10.0 ms, the result is 27 after 1.0 ms, The result is 10 broken by interrupted after 99.0 ms, the result is 69 after 29114.0 ms, the result is 501Copy the code

As you can see, the last run of the loop ended without seeing that the thread had interrupted. And the JIT compilation log shows that code just-in-time compilation optimization and OSR occurred while the last thread was executing the loop:

[s] 0.782 [debug] [jit compilation] 460% of 3 com. Test. TypeTest: : algorithm @ 4 bytes (44) [s] 0.784 [debug] [jit compilation] 468 3 com.. Test TypeTest: : algorithm (44 bytes) [s] 0.794 [debug] [jit compilation] 486% 4 com. Test TypeTest: : algorithm @ 4 (44 bytes) [s] 0.797 [debug] [jit compilation] 460% of 3 com. Test TypeTest: : algorithm @ 4 bytes (44) made not entrant [s] 0.799 [debug] [jit compilation] 503 4 com. Test TypeTest: : algorithm (44 bytes)Copy the code

3 and 4 represent the compilation level, and % represents the OSR stack replacement method, which is the machine code replacement of the executing code while the for loop is still executing. After this, the thread will not see that the thread has interrupted, which means that the Safepoint in the for loop will be removed **. The problem with this is obvious: when you need to Stop the World, all threads will wait for the loop to complete, because the thread can’t enter Safepoint until it completes the loop.

So, how do you optimize?

The first way is to change the code to make a for int loop of type for long:

for (long i=0; i<n; ++i) {
    if (Thread.interrupted()) {
        System.out.println("broken by interrupted");
        break; } bestSoFar = (int) math.pow (I, 0.3);} bestSoFar = (int) math.pow (I, 0.3); }Copy the code

The second is to use the -xx :+UseCountedLoopSafepoints argument to allow the JIT to optimize the code without removing safePoints from bounded loops

Console output after one of these two ways:

Broken by interrupted broken by interrupted after 0.0 ms, the result is 0 after 10.0 ms, The result is 29 broken by interrupted after 100.0 ms, the result is 73 broken by interrupted after 998.0 ms, the result is 170Copy the code

How do I analyze SafePoint through logs?

Currently, in OpenJDK 11, there are two major safePoint-related logs. One is basically only used for development, and the other can be used online using continuous collection.

XX: the first is – + PrintSafepointStatistics – XX: PrintSafepointStatisticsCount = 1, the timing acquisition, but acquisition will trigger all threads into Safepoint, so, Threads are generally not open (even less so since we had to close all threads to enter Safepoint on time). Also, it was removed in Java 12, and the following logging configuration basically replaces this, so we won’t go into this here.

Another is through – Xlog: safepoint = trace: stdout: utctime, level, tags, for its logging configuration, you can refer to my another article parsing the configuration format, here we use directly.

We have configured all SafePoint related JVM logs to be output to the console. When stopping the World, it will output something like this:

[2020-07-14t07:08:26.197 +0000][debug][safePoint] SafePoint synchronization. (112 Threads) [the 2020-07-14 T07:08:26. 197 + 0000] [info] [safepoint] Application time: 12.4565068 seconds [2020-07-14t07:08:26.197 +0000][trace][SafePoint] Setting threadlocal yield flag forThreads [2020-07-14t07:08:26.197 +0000][trace][safePoint] Thread: 0x0000022C7C494b30 [0x61dc] State: _at_poll_safePoint 0 [2020-07-14T07:08:26.197+0000][trace][SafePoint] Thread 0x0000022c7c497f30 [0x4ff8] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 ...... [2020-07-14T07:08:26.197+0000][trace][safePoint] Thread: 0x0000022c10c010b0 [0x5878] State: _call_back _has_called_back 0 _at_poll_safePoint 0 [2020-07-14T07:08:26.348+0000][trace][SafePoint] Thread: 0x0000022c10bfe560 [0x5038] State: _at_poll_safePoint 0 [2020-07-14t07:08:26.197 +0000][debug][safePoint] Waitingfor1 Thread (s) to block [2020-07-14t07:08:29.348 +0000][info][SafePoint] Entering safepoint region: G1CollectForAllocation [2020-07-14t07:08:29.350 +0000][info][SafePoint] Leaving SafePoint Region [the 2020-07-14 T07:08:29. 350 + 0000] [info] [safepoint] Total timefor whichStopping Threads took: 3.1467255 seconds For Stopping ThreadsCopy the code

First, phase 1 will print a log, which is debug level. This indicates that Safepoint is about to start for all threads globally. At this point, the JVM will start to fail to respond to requests.

[2020-07-14t07:08:29.347 +0000][debug][safePoint] SafePoint synchronization. (112 Threads)Copy the code

Phase 2 does not print logs, and Phase 3 prints:

[the 2020-07-14 T07:08:26. 197 + 0000] [info] [safepoint] Application time: 12.4565068 seconds [2020-07-14t07:08:26.197 +0000][trace][SafePoint] Setting threadlocal yield flag forThreads [2020-07-14t07:08:26.197 +0000][trace][safePoint] Thread: 0x0000022C7C494b30 [0x61dc] State: _at_poll_safePoint 0 [2020-07-14T07:08:26.197+0000][trace][SafePoint] Thread 0x0000022c7c497f30 [0x4ff8] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 ...... [2020-07-14T07:08:26.197+0000][trace][safePoint] Thread: 0x0000022c10c010b0 [0x5878] State: _call_back _has_called_back 0 _at_poll_safePoint 0 [2020-07-14T07:08:26.348+0000][trace][SafePoint] Thread: 0x0000022c10bfe560 [0x5038] State: _at_poll_safePoint 0 [2020-07-14t07:08:26.197 +0000][debug][safePoint] Waitingfor 1 thread(s) to block
Copy the code

Application Time: 12.4565068 seconds Indicates the length of time since the last Safepoint. The following trace indicates the status of each thread, only one of which is not in Safepoint:

Thread: 0x0000022c10c010b0  [0x5878] State: _call_back _has_called_back 0 _at_poll_safepoint 0
Copy the code

There is a detailed thread number, so you can know what the thread is doing through jStack. Waiting for 1 thread(s) to block also means how many threads need to wait to reach Safepoint.

At the end of phase 4, start phase 5 by printing:

[2020-07-14T07:08:29.348+0000][info][SafePoint] Entering safePoint region: G1CollectForAllocationCopy the code

After phase 5 is executed, print:

[2020-07-14T07:08:29.350+0000][info][SafePoint] Leaving SafePoint RegionCopy the code

When final phase 6 begins, it will print:

[the 2020-07-14 T07:08:29. 350 + 0000] [info] [safepoint] Total timefor whichStopping Threads took: 3.1467255 seconds For Stopping ThreadsCopy the code

The Total time for which application threads were stopped was from Phase 1 to Phase 6. Stopping Threads took for Stopping at Safepoint. Stopping threads took for Stopping at Safepoint. Stopping Threads took for Stopping at Safepoint. This depends on which trace threads are not in Safepoint, what they are doing, whether they have large loops, or whether they are using the StackWalker class.

How to analyze SafePoint through JFR?

For jFR-related configuration and instructions and how to analyze SafePoint-related events through JFR, see my other series of JFR Full Solutions series

Common SafePoint tuning parameters and explanations

1. It is recommended that all threads enter Safepoint periodically

For micro service highly concurrent applications, it is not necessary to enter Safepoint regularly, so close – XX: XX: + UnlockDiagnosticVMOptions – GuaranteedSafepointInterval = 0

2. It is recommended to cancel biased locking

In highly concurrent applications, bias locking does not lead to performance gains, but rather to unnecessary threads entering Safepoint or Stop the World because of bias locking. Therefore, you are advised to disable: -xx: -usebiasedlocking

3. It is recommended to open the loop and add the Safepoint parameter

-xx :+UseCountedLoopSafepoints to prevent large loop JIT compilation that causes internal SafePoints to be optimised and omitted, resulting in longer Safepoint entry time: -xx :+UseCountedLoopSafepoints

4. It is recommended to enable SafePoint logging at debug level (and the fifth option)

The debug level does not see which threads are waiting to enter Safepoint each time, but it is clear how long each phase takes. At the trace level, you can see which threads are there each time, but this increases the safePoint entry time by a few milliseconds.

-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M
Copy the code

5. It is recommended to open JFR collection for SafePoint (and choose a fourth)

Modify or create a JFC file:

<event name="jdk.SafepointBegin">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.SafepointStateSynchronization">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.SafepointWaitBlocked">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.SafepointCleanup">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.SafepointCleanupTask">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.SafepointEnd">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>

<event name="jdk.ExecuteVMOperation">
  <setting name="enabled">true</setting>
  <setting name="threshold">10 ms</setting>
</event>
Copy the code