This article was published on the official wechat account BaronTalk, welcome to follow!

Efficient concurrency is the final article in the JVM series, which focuses on how virtual machines implement multithreading, how they share and compete with each other, and the problems and solutions that arise from sharing and competing with data.

Java memory model and threads

It’s not just that the processors are more powerful, but that the computing speed of the computer is so different from the speed of its storage and communication subsystems that a lot of time is spent on disk I/O, network communications, and database access. In order not to waste the processor’s resources and time waiting for other resources, we must make the computer perform multiple tasks at the same time to make full use of the processor’s performance. At the same time, it is also to cope with the demand of high concurrency on the server side. The Design of the Java memory model and the existence of threads are precisely for better and more efficient implementation of multitasking.

1.1 Hardware efficiency and consistency

The vast majority of tasks in a computer cannot be completed by the processor alone. The processor must at least interact with memory, such as reading data, storing results, and so on. This I/O operation is difficult to eliminate. Since there is a difference of several orders of magnitude between the computer’s memory and the processor’s speed, the computer has to buffer between memory and the processor by adding a layer of cache that reads and writes as fast as possible to the processor’s speed: The data needed for an operation is copied to the cache so that the operation can proceed quickly and then synchronized from the cache to memory when the operation is complete, so that the processor does not have to wait for slow memory reads and writes.

Cache-based storage interaction is a good solution to the speed contradiction between processor and memory, but it also introduces higher complexity to computer systems because it introduces a new problem: cache consistency. In multiprocessors, each processor has its own cache, and they share the same main memory. When the computation tasks of multiple processors all involve the same main memory area, the cache data of each processor may be inconsistent. In order to solve the problem of consistency, it is necessary for each processor to follow some protocol when accessing the cache, and to operate according to the protocol when reading and writing.

In addition to increasing cache, in order to make the processor within the computing unit can be make full use of as far as possible, the processor may out-of-order execution of input code are optimized, the processor can be carried after calculating the out-of-order restructuring as a result, ensure that the results are consistent with the results of the order, but does not guarantee that all statements in the program computing the order in accordance with the order of the input code, Therefore, if there is a computation task that depends on the intermediate results of another computation task, the sequence is not guaranteed by the order of the code. Similar to the chaotic execution optimization of the processor, JIT compilers have similar instruction rearrangement optimization.

1.2 Java Memory model

The Java Virtual Machine specification defines the Java memory model, which is used to mask the differences in memory access between different hardware and operating systems, so that Java programs can achieve the same memory access effect on different platforms. Languages like C/C++ directly use the memory model of the physical hardware and the operating system, and therefore need to write code for different platforms due to differences in memory models on different platforms.

Main memory vs. working memory

The main goal of the Java memory model is to define the access rules for variables in a program, the low-level details of storing and reading variables from memory in the virtual machine. Variables are different from variables in Java code in that they include instance fields, static fields, and elements that make up array objects, but not variables and method parameters, which are thread private and not shared. To achieve better execution performance, the Java memory model does not limit the execution engine to use specific registers or caches of the processor to interact with main memory, nor does it restrict the JIT compiler to optimizations such as code execution order.

Java memory model provides for all variables are stored in main memory, each thread has its own separate working memory, thread’s working memory holds the variables used by the thread to the main memory of the copy of a copy of the thread of variables all the operations must be done in working memory, rather than main memory directly, speaking, reading and writing. The transfer of variable values between threads is done through main memory.

Intermemory operation

About the specific communication protocol between main memory and working memory, that is, a variable how to copy from main memory into the working memory, how from the working memory back to the main memory details, such as Java memory model defines the following 8 kinds of operations to complete, when the virtual machine implementation must ensure that each of the following types of operations are atomic, not points.

The eight operations are lock, unlock, read, load, use, assign, store, and write.

Special rules for volatile variables

Volatile is the lightest synchronization mechanism provided by the Java Virtual machine. When a variable is defined as volatile, it has two properties:

The first is to ensure that the variable is visible to all threads, where “visibility” means that when one thread changes the value of the variable, the new value is immediately visible to other threads. Normal variables can’t do this and need to pass data between threads through main memory. For example, if thread A modifies A common variable value and writes back to main memory, thread B will not be visible until thread A writes back to main memory.

The second is to prohibit instruction rearrangement optimization. Ordinary variables only guarantee that the correct result will be obtained in all parts of the method that depend on the result of assignment, but not in the same order of assignment as in the program code. Because this is not perceived during the execution of a method on a thread, this is what the Java memory model describes as “serialized semantics in a thread.”

Special rules for long and double variables

The Java memory model requires that lock, unlock, read, Load, assign, use, Store, and writer operations be atomic. However, for 64-bit data types (long and double), the model defines a relatively loose rule: This allows the VIRTUAL machine to divide reads and writes of 64-bit data that are not volatile into two 32-bit operations. This allows the virtual machine to implement atomicity of the load, store, read, and write operations that are not volatile. This is the so-called nonatomic protocol for long and double.

If multiple threads share a volatile long or double that is not declared as volatile and are reading and modifying them at the same time, some threads may read an incorrect value. Fortunately, this is rare, and operations on long and double are considered atomic in mainstream commercial virtual machines, so volatile is not necessary to modify variables in real development.

Atomicity, visibility, and order

The Java memory model is built around how atomicity, visibility, and orderliness are handled during concurrency.

  1. Atomicity: Atomic variable operations, including read, load, assign, use, Store, and write, are directly guaranteed by the Java memory model. If a wider atomicity guarantee is needed, the Java memory model also provides lock and UNLOCK operations to meet this requirement, although the virtual machine does not open them up directly to the user. But it provides higher-level bytecode instructions monitorenter and Monitorexit to implicitly use these two operations, which are reflected in Java code as the synchronized keyword, Therefore, operations between methods or blocks of code modified by Synchronize are atomic.
  2. Visibility: Visibility means that when one thread changes the value of a shared variable, other threads will immediately know that the change is made. The Java memory model provides visibility by relying on main memory as a transfer medium by synchronizing the new value back to main memory after the variable is modified and flushing the value from main memory before the variable is read. This applies to both volatile and common variables. The rules for volatile ensure that new values are immediately synchronized to main memory and flushed from main memory immediately before each use. Thus, it can be said that volatile guarantees visibility of multithreaded variables in a way that ordinary variables do not. In addition to volatile, Java has two other keywords: synchronized and final. The visibility of synchronized blocks is obtained by the rule that “before unlocking a variable, it must be synchronized back to main memory (store, write)”. The visibility of a final field means that the value of a final field is visible in other threads once the constructor has initialized it and no reference to “this” has been passed.
  3. Ordering: The natural Ordering in a Java program can be summarized as follows: If all operations are ordered within a thread, the Ordering is complete. If you observe another thread in one thread, all operations are out of order. The first part of the sentence refers to the “sequential semantics in the thread”, the latter part refers to the “instruction reordering” phenomenon and the “working memory and main memory synchronization delay” phenomenon. The Java language provides the keywords volatile and synchronized to ensure the order of operations between threads. Volatile contains semantics that forbid instruction reordering. Synchronized is defined by the rule that a variable can only be locked by one thread at a time. This rule determines that two synchronized blocks holding the same lock can only be entered serially.

Principle of antecedent

If all order in the Java memory model were guaranteed by volatile and synchronized alone, some operations would be cumbersome, but we don’t feel that way when we write Java concurrent code. This is because the Java language has a happens-before principle. This principle is very important because it is the primary basis for determining whether data is competing and whether threads are safe, and it allows us to solve all the problems of possible conflicts between two operations in a concurrent environment in a package of several rules.

First occurrence is defined in the Java memory model of partial order relation between two operations, if A first occurred in operation, B is actually happening B before operation, operation impact can be operating B observed, “affect” including modify the Shared memory variable values, sent messages, call the method, etc.

The Java memory model has some natural antecedents that exist without the assistance of any synchronizer and can be used directly in coding. If the relationship between two operations is not in this column and cannot be deduced from the following rules, they are not sequentially guaranteed and the VIRTUAL machine can reorder them at will.

  • Sequence rule: In a thread, code written earlier occurs first in order of code written later. To be precise, it should be the control flow sequence rather than the program code sequence, because branches, loops, and so on should be considered;

  • Pipe lock rule: an UNLOCK operation occurs first when a subsequent lock operation is performed on the same lock.

  • The volatile variable rule: Writes to a volatile variable are read later. Understanding this principle helps us understand why DCL singletons use volatile to identify instance objects.

  • Thread start rules: a thread’s start() method precedes all other actions of the thread;

  • Thread termination rule: All operations in a thread take place before the thread terminates.

  • Program interrupt rules: A call to interrupt() occurs before code in the interrupted thread detects that the interrupt has occurred.

  • Object finalization rule: The finalization of an object takes place before finalize();

  • Transitivity: if operation A occurs first in B and B occurs first in C, then A occurs first in C.

1.3 Java and Threads

When you talk about concurrency in Java, you’re usually talking about multithreading. In this section we will discuss the implementation of Java threads in virtual machines.

Implementation of threads

Major operating systems provide threading implementations, and the Java language provides uniform handling of threading operations on different hardware and operating system platforms. Each instance of the Thread class that has executed start() and has not yet finished represents a Thread. All key methods of the Thread class are Native. In Java apis, a Native method usually means that the method is not implemented or cannot be implemented using platform-independent means (it may also be used for execution efficiency, but the most efficient means are usually platform-specific means).

There are three main ways to implement threads: kernel threads, user threads, and a mixture of user threads and lightweight processes.

Java thread implementation

Java threads were implemented prior to JDK 1.2 based on user threads called “green threads.” In JDK 1.2, the threading model was replaced with one based on the operating system native threading model. Therefore, in the current JDK version, the operating system supports a thread model that largely determines how Java virtual machine threads are mapped. There is no agreement between platforms on this, and the virtual machine specification does not specify which thread model Java threads need to implement. The threading model only affects the concurrency scale and operation cost of threads, and these differences are transparent to the Java program’s coding and execution.

1.4 Java Thread scheduling

Thread scheduling refers to the process in which the system allocates processor rights to threads. There are two main scheduling methods, namely cooperative thread scheduling and preemptive thread scheduling.

Collaborative thread scheduling

If it is the use of collaborative scheduling multi-threaded system, the execution time of the thread is controlled by the thread itself, the thread after the completion of its own work, to take the initiative to inform the system to switch to another thread. The biggest advantage of collaborative multithreading is that it is simple to implement, and because threads have to finish their own things before the thread switch, switch operation is known to the thread itself, so there is no thread synchronization problem. But the downside is obvious: thread execution times are out of control, and even if a thread is written incorrectly and never tells the operating system to switch, the program will always block. Long ago, Windows 3.x systems used syndication to multitask with processes, which was so unstable that a process’s insistence on giving up CPU time could cause the entire system to crash.

Preemptive thread scheduling

In a multithreaded system using preemptive scheduling, each thread is allocated execution time by the system, and thread switching is not determined by the thread itself. In this way, the implementation of thread scheduling is controlled by the system, and there is no problem that a thread will cause the whole process to block. The thread scheduling method used in Java is preemptive. In contrast to the Windows 3.x example, the Windows 9x/NT kernel uses preemption to implement multiple processes, and when a process fails, we can use the task manager to “kill” the process without causing the system to crash.

1.5 State Transition

The Java language defines five thread states. At any point in time, a thread can have only one of these states:

  • New: a thread that has not been started since it was created is in this state;
  • Runnable: Runnable includes Running and Ready operating system thread states, where the thread may be executing or waiting for the CPU to allocate time for it to execute.
  • Waiting: threads in this state are not allocated CPU time. They are Waiting to be explicitly woken up by another thread. There are three ways to put a thread into an indefinite wait state:
    • Object.wait() without TimeOut;
    • Thread.join() without TimeOut;
    • LockSupport. Park ().
  • Timed Waiting: a thread in this state is also not allocated CPU execution time, but instead of Waiting to be explicitly woken up by another thread, they are automatically woken up by the system after a certain amount of time. The following methods cause the thread to enter the finite wait state:
    • Thread.sleep ();
    • Object.wait() with TimeOut;
    • Thread.join() with TimeOut set;
    • LockSupport. ParkNanos ();
    • LockSupport. ParkUntil ().
  • Blocked: A thread is Blocked. The difference between a Blocked state and a wait state is that a Blocked state is waiting to acquire an exclusive lock, an event that occurs when another thread abandons the lock. A “wait state” is waiting for a period of time, or for a wake up action to be sent. The thread enters this state while the program is waiting to enter the synchronization zone;
  • Terminated: The thread has Terminated execution.

The above 5 states will be converted to each other when certain events occur, as shown below:

Second, thread safety and lock optimization

The topic of this article is efficient concurrency, but the premise of efficient concurrency is to ensure the correctness and safety of the first, so in this section we start with how to ensure the safety of thread concurrency.

2.1 Java thread safety

So what is thread safety? It can be simply understood that when multithreading operates on the same memory area, the change of memory value is predictable, and the value stored in memory will not be uncontrollable because of multithreading’s operation and access to the same memory area.

Thread safety in the Java language

If we do not define thread-safety as an either-or concept (either thread-safe or thread-unsafe), then we can classify thread-safety in descending order of magnitude into the following five categories:

  1. Do not change;

  2. Absolute thread safety;

  3. Relative thread safety;

  4. Thread compatibility;

  5. Thread opposition.

Thread-safe implementation

Although thread safety is strongly related to coding implementation, synchronization and locking mechanisms provided by virtual machines also play an important role. Let’s take a look at thread safety at the virtual machine level.

  1. Synchronous mutex

Mutually exclusive synchronization is a common method to ensure concurrency correctness. Synchronization refers to ensuring that shared data is used by only one thread at a time when multiple threads concurrently access the data. Mutual exclusion is a means of achieving synchronization. The most basic mutually exclusive synchronization method in Java is the synchronized keyword, which, after compilation, forms two bytecode instructions, Monitorenter and Monitorexit, respectively, before and after the synchronized block. Both bytecodes require a reference type parameter to specify which object to lock and unlock. Synchronized in a Java program is an object reference if it explicitly specifies an object parameter. If not, use the corresponding object instance or class object as the lock object, depending on whether synchronized modifies an instance or class method.

As required by the virtual machine specification, when monitorenter is executed, it first attempts to acquire the lock of an object. If the object is not locked, or the current thread already owns the lock on that object, increment the counter of the lock by one; Accordingly, decrement the lock counter by 1 when monitorexit is executed, and when the lock counter reaches 0, the lock is released. If the lock object fails to be acquired, the current thread blocks and waits until the lock is released by another thread.

Another point to note is that synchronization blocks subsequent threads until the incoming thread is finished executing. Since Java threads are mapped to native operating system threads, the operating system needs to help block or wake up a thread, which requires a transition from user state to kernel state, which takes a lot of processor time. For simple synchronized blocks, such as getter() and setter() methods modified by synchronized, state transitions can take longer than user code. Synchronized is therefore a heavyweight operation in Java, and should be used only when necessary. Of course, the virtual machine itself will make corresponding optimizations, such as adding a spin wait process before the operating system blocks threads, to avoid frequent user to kernel mode transition process. We’ll talk more about this when we introduce lock optimization.

  1. Nonblocking synchronization

    The biggest problem with mutex synchronization is the performance problems associated with thread blocking and waking up, so it is also called blocking synchronization. In terms of how it works, mutex is a pessimistic concurrency strategy that assumes that problems are bound to occur if the correct synchronization measures (such as locking) are not done, and locking is done regardless of whether the shared data is competing (of course, the virtual machine optimizes to remove unnecessary locks). As hardware instruction sets evolve, we have another option: an optimistic concurrency strategy based on collision checking. In plain English, the operation is performed first, and if no other threads compete, the operation succeeds. Many implementations of this optimistic concurrency strategy do not require threads to be suspended, so this synchronization operation is called non-blocking synchronization.

    The reason we need hardware instruction set development is because we need atomicity in the two steps of operation and collision detection.

    What guarantees this atomicity? It makes no sense to use mutual-exclusion synchronization to guarantee atomicity, so we have to rely on hardware to do this, ensuring that a semantically multiple operation can be performed by a single processor instruction. Common examples of this kind of instruction are:

    • Test and Set (test-and-set)
    • Fetch and Increment
    • Swap
    • Compare and Swap (CAS)
    • Load Linked/ store-conditional (LL/SC)

    The first three are from the previous processor instruction set, and the last two are new additions.

    The CAS instruction requires three operands, which are the memory location (which in Java can be understood simply as the memory address of A variable, represented by V), the old expected value (represented by A), and the new value (represented by B). When CAS executes the instruction, the handler updates the value of V with the new value B if and only if V matches the old expected value A, otherwise it does not perform the update, but returns the old value of V whether or not it has updated the value of V. The above processing is an atomic operation.

    CAS operations were not available in Java programs until JDK 1.5, and are wrapped in several methods, such as compareAndSwapInt() and compareAndSwapLong(), in the sun.misc.Unsafe class. The virtual machine does special processing for these methods internally, and the result of instant compilation is a platform-specific processor CAS instruction, with no method invocation or, if you like, unconditional inlining.

    Since the Unsafe class is not intended for user program calls, without reflection, the Unsafe class can only be used indirectly through other Java apis, such as the integer atom class in the J.U.C package, Methods such as compareAndSet() and getAndIncrement() use the CAS operation of the Unsafe class.

    As nice as CAS looks, this operation does not cover all scenarios of mutex synchronization, and CAS is not semantically perfect. If A variable V is first read with A value of A and is checked to be still A when it is ready to be assigned, can we say that its value has not been modified by other threads? If it was changed to B during that time and later changed back to A, the CAS operation will assume that it was never changed. This vulnerability is called the “ABA” problem for CAS operations.

    To solve the “ABA” problem, the J.U.C package provides a tagged AtomicStamoedReference class that guarantees CAS correctness by controlling versions of variable values. In most cases, ABA problems do not affect the concurrency of programs. If ABA problems need to be solved, switching to traditional mutex synchronization may be more efficient than atomic classes.

  2. Asynchronous scheme

    Synchronization is not necessary to be thread-safe. If a method does not inherently involve sharing data, it naturally does not require any synchronization, so some code is inherently thread-safe, including reentrant code and thread-local storage described below.

    Reentrant Code: Also called pure Code, it can interrupt Code at any time it is executing and switch to another Code (including recursively calling itself) without any errors in the original program once control is regained. Reentrant code has some common characteristics, such as not relying on data stored on the heap and common system resources, using state quantities passed in by parameters, and not calling non-reentrant methods. If a method returns a predictable result that returns the same output as long as the input is the same, it is reentrant code, which is of course thread-safe.

    ThreadLocal Storage: That is, this data is unique to the Thread, and ThreadLocal is used to implement thread-local Storage.

2.2 lock optimization

The HotSpot VIRTUAL machine development team has put a lot of effort into various lock optimizations such as spin locking and adaptive spin, lock elimination, lock coarser, lightweight locking, biased locking, etc.

Spin-locking and adaptive spin

As mentioned earlier in the discussion of mutex synchronization, the biggest impact of mutex synchronization on performance is the implementation of blocking. Both suspended threads and resumed threads involve the transition from user state to kernel state, which puts a lot of pressure on the system’s concurrency performance. But in most scenarios, shared data is locked for only a short period of time, and it is not cost-effective to suspend and resume threads for this short period of time. If the physical machine has more than one processor and two or more threads can work in parallel at the same time, we can tell the later thread that requests the lock to “hold on,” but not give up the processor’s execution time, and see if the thread that holds the lock will release the lock soon. To make the thread wait, we simply perform an idling loop (spin), which is known as a spin lock.

Spin waiting avoids the overhead of thread switching, but it takes up processor time. If the lock is held for a short period of time, then the spin wait works well. On the other hand, if the lock is held for a long time, the spinning thread will consume processor resources for nothing, resulting in negative optimization. So there must be a limit to the spin wait, but a fixed value is not the best choice, so the virtual machine development team designed an adaptive spin lock so that the spin wait time is not fixed, but determined by the last time the spin was on the same lock and the state of the lock owner. If the spin wait has just successfully acquired the lock on the same lock object and the thread holding the lock is running, the virtual machine will assume that the spin wait is also likely to succeed and will extend the spin wait time. If the spin wait is rarely successfully acquired for a lock, the spin will be abandoned when the lock is acquired later. With adaptive spin, as application execution and performance monitoring information improves, the virtual machine will become more accurate in predicting the condition of the application lock.

Lock elimination

At runtime, the just-in-time compiler removes locks that require synchronization on some code but detect that there is no possibility of competing for shared data. The main decision that is eliminated is based on data support from escape analysis. If it is determined that no data on the heap in a piece of code can escape and be accessed by other threads, it can be treated as stack data, as thread private, and synchronization locking is not necessary.

Lock coarsening

When coding, we always recommend keeping the scope of the synchronized block to a minimum and only synchronizing in the actual scope of the shared data. This is to keep the number of synchronized operations as small as possible so that the thread waiting for the lock can acquire the lock as quickly as possible if there is a contention. In general, this is true, but if a series of consecutive operations repeatedly lock and unlock the same object, even if the locking operations occur in the body of the loop, frequent mutex synchronization can lead to unnecessary performance losses, even if there is no thread contention. If a lock occurs in the loop body, for example, it will extend the lock synchronization scope (coarsening) to the outside of the loop, so that only one lock is required. This is lock coarsening.

About lightweight lock and biased lock will not be introduced here, if you are interested, you can leave a message feedback, I will be introduced in a separate document.

conclusion

This completes the entire JVM series, which is basically a compilation of my reading notes. I hope you found it helpful. Due to the limitation of space and my limited level, the essence of the book can not be presented one by one. Students who want to further Java virtual machine recommend to read Zhou Zhiming’s original work.

References:

  • Understanding the Java Virtual Machine in Depth: Advanced JVM Features and Best Practices (version 2)

If you like my articles, follow my public account BaronTalk, my zhihu column or add a Star on GitHub.

  • Wechat official account: BaronTalk
  • Zhihu column: zhuanlan.zhihu.com/baron
  • GitHub:github.com/BaronZ88