The original author: doug lea: gee.cs.oswego.edu/dl/papers/a… Reference: www.jianshu.com/p/c5a18eb6d… Related good article: www.sohu.com/a/232590588… Lock not slow; Lock contention is slow

There will be some “[number]” marks that correspond to the number of Doug Lea’s references in the original text

Abstract

J2SE1. 5 Java. Most of util synchronizer (lock, barrier, etc.), based on the lightweight framework structure AbstractQueuedSynchronizer the framework for atomic management synchronization state, threads blocked awaken, and queuing provides a general mechanism. This paper introduces the principle, design, implementation, use and performance of the framework.Copy the code

Category and topic descriptors

D.1.3 [Programming Techniques]: Concurrent Programming -- Parallel Programming D.1.3 [Programming Techniques]: Concurrent Programming -- Parallel ProgrammingCopy the code

keywords

Synchronous (Synchronization), JavaCopy the code

1. Introduction

Java release J2SE-1.5 introduces package java.util.concurrent, Includes a set of intermediate concurrency support classes created through JCP(Java Community Process) JSR(Java Specification Request) 166.Copy the code
Among these components is a set of synchronizers, abstract Data type (ADT) classes, that maintain internal synchronization state (for example, to represent lock/ UNLOCK), update and check state, and wake up threads with other thread state changes if state needs to be blocked. Component examples: Various forms of Exclusion locks, read and write locks (read- Write locks), semaphores, barriers, Event Indicators, Handoff QueuesCopy the code

ADT: the ADT here refers to the abstract class AbstractQueuedSynchronizer / / abstract classes

It is well known (as in [2]) that synchronizers are almost mutually realizable. For example, implement semaphores with reentrant locks and vice versa. However, doing so usually requires enough complexity, overhead, and flexibility to not be the best solution. Moreover, the scheme is conceptually unattractive. If these structures are not more primitive than others, developers should not choose one as the basis for building other components. As a result, JSR 166 creates a"AbstractQueuedSynchronizer(aqs)"For the central class library, AQS provides most"synchronizers"Common mechanisms,"synchronizers"You just need to complement and implement the respective features.Copy the code
The rest of this article discusses the requirements, the main ideas behind the design and implementation of this framework, sample usage, and some tests that show its performance characteristicsCopy the code

2. Demand

2.1 features

Synchronizer has two classes of methods [7]:"acquire"Method blocks the calling thread unless/until the synchronization state allows it to continue"release"Action changes the synchronization state and wakes up one or more threads. JUC does not define a uniform API for synchronizers. Some use a common interface (Lock), while others use their specialized versions. So acquire Release is called in different methods of different classes. For example, the Lock. The Lock and Semaphore. Acquire CountDownLatch. Await, FutureTask. Get all acquire operating in the framework is used.Copy the code

FutureTask does not use acquire,release in jdk1.8 code

If necessary, the framework provides some support for synchronizers: non-blocking synchronization attempts (tryLock) and blocking versions Optional timeout: tryAcquireNanos(time) Response to thread interrupts: Acquire acquireInterruptibly Exclusive mode, shared mode. JUC defines an interface"Condition"To support monitor-style await/signal operations.Copy the code

Shared mode: Multiple threads acquire locks synchronize. monitor: Each Java object has a monitor, synchronize(obj){}

2.2 Performance Objectives

Java built-in locks (using synchronized method and synchronized blocks) have long had a performance problem. about"synchronized"([1],[3]). Optimization of built-in locks has focused on minimizing the space overhead (since every Java object has a monitor) and the time overhead of context in a single processor. Neither of these tasks is particularly important to the synchronizer :1. The programmer constructs the synchronizer and creates the monitor only when needed. 2. With more and more programs running in multi-core and multi-thread environment, resource competition is inevitable. The usual JVM strategy is only for"Zero-contention"Cases are optimized for locking, leaving other cases to slow paths that are less predictable [12].Copy the code

Synchronized: after jdk1.6, synchronized locks have been optimized, including biased locks, lightweight locks, spin locks, and heavyweight locks

The main performance goal of JUC is scalability: you can predict and maintain efficiency when competing with synchronizers. Ideally, no matter how many lines pass through the synchronization point, the cost of passing through the synchronization point should be constant. The main goal is to shrink"Total time allowed to pass through the synchronization point but not yet."(to reduce"Critical region"You need to balance resources while achieving your goals: total CPU time, memory flow, thread scheduling overhead. For example, a spin lock usually takes less time than a blocking lock, but usually causes CPU idling and memory contention, and blocks for short spins and long spins. Two styles: Most applications should maximize total throughput and tolerate a certain probability of hunger. In applications such as resource control, the most important thing is to ensure fairness of access between threads and tolerate poor aggregate throughput. There is no framework on behalf of the user to make decisions between these conflicting goals; Fairness policies need to be left to the users to choose. No matter how well a synchronizer is designed internally, it will encounter performance bottlenecks. Therefore, the framework provides monitoring and inspection of basic operations, which conveniently allows users to discover and alleviate bottlenecks, providing a way to determine how many threads are blocked that is most efficient.Copy the code

Allowed but not yet: The synchronization state is available to the thread, but no thread retrieves the key word: resource balance Memory contention fair/unfair monitoring and checking

3. Design and implementation

The basic idea behind the synchronizer is very simple. Acquire Operation process:whileAcquire) {if it is not in the queue, the current thread may block} If the current thread is in the queue release operation flow: update the synchronization stateifUnlocks threads in one or more queuesCopy the code
The process needs to coordinate between the three basic components: atomic management synchronization state blocking/waking thread maintenance queues It is possible to create a framework that allows the three components to change independently, but this does not work very well. For example, the information stored in the node must match the information needed to wake up, and the open method depends on the synchronization state. The core of the synchronizer framework design is how to choose the concrete implementation of these three components.Copy the code

3.1 Synchronization State

Class AbstractQueuedSynchronizer use (32 bit) int state to maintain synchronization state, opening to the outside world getState,setState and compareAndSetState to access and update State. These methods rely on volatile reads and writes provided by JSR-133(JMM) support, CompareAndSetState is implemented by accessing the local compare-and-swap or loadLinked /store-conditional instruction, and only when the state is equal to the given expected value, the state is set to a given new value atomically.Copy the code
The synchronization status is a 32-bit int. Although JSR-166 provides long atomic operations, there are enough platforms on which internal locks must be used to simulate atomic operations that synchronization does not perform well. 64-bit base classes are possible in the future, and 32bit is sufficient for now. Only cyclicbarriers require 64bit, so cyclicbarriers are locked internally (like most advanced tools in JUC)Copy the code
Based on the concrete class AbstractQueuedSynchronizer method must be according to the state of opening to the outside world, rewrite tryAcquire and tryRelease method, in order to control acquire and release operation. If the tryAcquire method returnstrue, then get synchronization; If the tryRelease method returnstrue, the synchronization state allows other threads to obtain synchronization. These methods accept an int parameter that can be used for the desired state of communication; For example, in a reentrant lock, the recursive count is re-established when the lock is reacquired, and many synchronizers do not need such an argument, so ignore it.Copy the code

3.2 Blocking/waking up

Prior to JSR-166(JDK 1.5), only the Java API based on monitor blocking/wake up was used to create synchronizers. Thread.suspend and thread.resume are two problems. The lock will not be released. 2suspend, the thread will hang foreverCopy the code

Wait/obJ. notify methods were added in Java 1.5. Park, unpark is better. Aqs in scenarios do not use locks based on monitors

Juc.locks LockSupport provides locksupport.park () and locksupport.unpark () for the second problem. The park method blocks the current thread unless or until unpark(allowing false awakenings) is emitted. Unpark does not"Count", no matter how many times before park"unpark"Can only wake up once"park".Copy the code
This simple mechanism is somewhat similar to those used in the Solars-9 threading library [11], WIN32 "Consumable Events", and LinuxNPTL threading library, and therefore can be effectively mapped to every threading library on the most common platforms where Java runs, providing good runtime performance. (Current on Solaris and Linux"Sun Hotspot JVM"Actually using"pthread condvar"To fit the current runtime design). Park supports both relative and absolute timeouts and is integrated with JVM Thread.Interrupt ————'Interrupt a thread and wake it up'(Thread throws an exception when calling a blocking method)Copy the code

Park: hg.openjdk.java.net/aarch32-por… Park/Unpark uses POSIX multiprocess in Linux

3.3 the queue

At the heart of the framework is maintaining queues of blocked threads, which are limited here to FIFO queues. Therefore, AQS does not support priority-based synchronization. It is indisputable that non-blocking data structures are the best choice for implementing synchronous queues and are not required"lower-level"Lock to construct. There are two options for non-blocking data structures: MCS [9] and CLH [5][8][10]. Historically, CLH has only been used for spinlocks. CLH is easier to use with the synchronizer framework because it is easier to handle cancellations and timeouts, so CLH was chosen as the basis for synchronization queues. The final implementation is nowhere near CLH. CLH is not much like a queue, because the operation of a CLH in and out of a queue is closely related to its properties as a lock. It is a linked list queue that updates head and tail by atomic operations, with two nodes pointing to virtual nodes.Copy the code

A new node,"node"Join the team using atomic operations:do { 
        pred = tail;
    } while(! tail.compareAndSet(pred, node));Copy the code
Whether each node can acquire the lock depends on the state of the previous node:while(pred.status ! = RELEASED) ; // spinCopy the code
The current node is set to by the spin-out operation"head":
    head = node;
Copy the code
One of the advantages of CLH locking is that it is fast in and out of queues, without locking, and without blocking (even in contention, a thread will always win the insert race and thus progress); It is also easy to check if threads are waiting (head and tail are the same), and the RELEASED state is distributed to reduce contention. In the original version of CLH, the nodes were not even connected. In spinlocks, pred is used as a local variable. Scott and Scherer[10] found that for explicitly maintained pre-nodes, CLH can handle timeouts and node cancellations (skip this node if the previous node cancels).Copy the code
The main change to the CLH is to provide a method for one node to locate subsequent nodes. In a spinlock, a node only needs to change its state for subsequent nodes to notice. So there is no need to connect the successor node. But in a blocking synchronizer, the node needs to explicitly park(unpark) its successor. A node of AQS has a next attribute associated with its successors. But what technology can"Unlocked""Atomically"Insert a node into the bidirectional queue, so the genus is not part of the atomic insert node (pred and next only need to ensure that one atom is updated successfully, and the other one is assigned directly): pred.next = node; The next connection is usually only used as an optimized access path. If a node's successor does not exist (perhaps cancelled), you can find an actual node by iterating through the pred property of each node from tail.Copy the code
The second change is to use a status attribute on each node to control the blocking and waking of threads. In the synchronizer framework, queued threads can only be returned in acquire operations through the tryAcquire method; The RELEASED state is no longer enough. 1. Ensure that active threads are allowed to call tryAcquire only when they are in head; In this case the fetch may fail and re-block. 2. In this case, you can determine whether to call tryAcquire by checking whether the previous node is head. This reduces memory contention for reading the head (read only once, whereas the spin lock reads in a loop) compared to a spin lock. However, the cancelled state still needs to be maintained in the node.Copy the code
The queue node status field is also used to avoid unnecessary calls to park and unpark. Although these method and block the primitives (blocking the primitives operating system level) performance is quite, but call park between JAVA | | JVM OS when there will be some unavoidable overhead. Set a program first"signal me"Node, check synchronization and node status before calling park. If the RealSE thread calls unpark and resets synchronization status, unnecessary blocking is avoided (blocking wake up is time-consuming). Aqs implements CLH and relies on garbage collection to manage storage collection of nodes, avoiding complexity and overhead. When exiting the queue, null is set for attributes that are determined to be no longer needed.helpThere are some further minor optimizations in GC, including deferring initialization of head and tail until the first time a thread is waiting.Copy the code
Ignoring these details, for a rough implementation of basic Acquire (exclusive, no interrupt response, no timeout) :if(! tryAcquire(arg)) { node = create and enqueue new node; // Create new nodes pred = node effective predecessor; // Valid front nodewhile(pred is not head node || ! tryAcquire(arg)) {if (pred signal bit is set// The pred state is signal park(); / / blockingelse
            compareAndSet pred signal bit to true; Signal pred = node effective predecessor; } head = node; // pred is head node && tryAcquire(arg) } release :if (tryRelease(arg) && head node signal bit is set) {
    compareAndSet head signal bit to false;   // cas head signal 
    unpark head successor, if one exists
}
Copy the code
Although tryAcquire in Acquire may loop multiple times, a single operation in Acquire and Release is apportion constant time O(1) to each thread, regardless of cancelled nodes and each platform's different implementation of Park. The cancellation function is supported when timeout and interrupt need to be checked after wake up. The canceled thread due to timeout or abort sets the node state and wakes up the next node. It may take O(n) traversals (n is the queue length) to find a valid front node, back node, and reset the state because of the cancellation node. Because the thread does not block again to cancel the operation, the link and state are quickly rebuiltCopy the code

3.4 Condition the Queues

Synchronization framework provides"ConditionObject"Class for the synchronizer to implement exclusive locks, and follow"Lock"The API of the interface. A lock can bind multiple conditional objects. Provides classic monitor-style await, signal, and signalAll, and time-out versions of these operations, checking and monitoring methods"ConditionObject"Class can be used to work with other synchronizers to fix design flaws. have"condition"Is provided when the lock is held by the current thread (see[4] alternative)"Java-style monitor"Access rules differ only in method names, extrinsic functions, and the ability for users to declare multiple conditions for each lock."ConditionObject"The same internal queue nodes as the synchronizer are used, but they are maintained in a separate conditional queue. The signal operation moves the node from the conditional queue to the lock queue rather than waking up the waiting thread. basic"await"Action: Create add lock in conditional queue; Releases the lock; Block until the node moves to the lock queue; Re - acquire the lock;"signal"Action: Move from wait queue to lock queueCopy the code
Because these operations are only performed while holding the lock, they can maintain the wait queue using sequential linked list operations (nextWaiter), break the first node of the wait queue, and then attach it to the lock queue using CLH inserts. The main complexity of implementing these operations is dealing with condition Waits cancellations caused by timeouts and Thread.interrupt. Cancellations that occur very close to signal may result in contests that meet the specification of Java built-in Monitors. In revision JSR-133, if the interrupt occurs before signal, await must throw InterruptedException upon being awakened; If the interrupt occurs after signal, then await must return without exception, but its thread interrupt status needs to be set.Copy the code
To preserve order, the queue node status records whether or not a transfer is in progress. Both signal and cancel go to the CAS state. If signal fails, signal the next node (if any). If the cancel operation fails, you must abort the transfer and then re-obtain await lock. The latter case introduces potentially unbounded spin. Wait to cancel The lock cannot be obtained until the node is successfully inserted into the LOCK queue. Therefore, you must rotate and wait until the signalling Thread executes the CAS to successfully insert the CLH queue. There is very little need for spin and it takes up one thread. Thread.yield is used to yield the CPU time slice of the Thread that canceled the node. Ideally rotated by the signal thread, the spin will obviously end very quickly. There are strategies that can be implemented to help with spin cancellation, but they are too rare to be meaningful and add some extra overhead. In any other case, there is no spin or yield, so good performance is guaranteed in the case of a single core.Copy the code

4. Use (USAGE)

    Class "AbstractQueuedSynchronizer"Using the template method, you try to put the above functionality together as a base class for the synchronizer, and subclasses just need to implement it"state"To control acquire and Release. However, these subclasses do not inherit directly from AQS, and AQS cannot be used as synchronizer ADTs. Aqs opens up internal control methods of acquire and release, and should not make these classes visible to users. All JUC synchronizers declare private internal AQS subclasses and delegate all synchronization methods to it.Copy the code
For example, here's one of the smallest"Mutex"Class, state 0 indicates unlocked, and 1 indicates locked. class Mutex { class Sync extends AbstractQueuedSynchronizer { public boolean tryAcquire(int ignore) {return compareAndSetState(0, 1);
    }
    public boolean tryRelease(int ignore) {
      setState(0); return true;
    }
  }
  private final Sync sync = new Sync();
  public void lock() { sync.acquire(0); }
  public void unlock() { sync.release(0); }} The full version of this example, along with additional usage guidelines, can be found in the J2SE documentation. Of course, there are some variations. For example, tryAcquire can be used"test-and-test-and-set". Check it before CAS.Copy the code

Test-and-test-and-set: TTAS spinlock algorithm

Surprisingly, mutex works this way"performance-sensitive"The synchronizer is actually used"delegation"and"virtual"To achieve. These have long been the concerns of modern dynamic compilers"OO"Design the structure, and they will optimize for this kind of overhead, especially with these synchronizers that are used a lot.Copy the code

OO: Object-oriented Delegation: One object requests the functionality of another object, captures an operation and sends it to another object. Virtual virtual: is the C virtual equivalent of Java’s abstract method. For example, tryAcquire subclasses need to override performance-sensitive

Aqs provides various strategies to help synchronizers control synchronization. For example, the acquire method includes two versions: timeout&InterrupTIBLE. Aqs offers"exclusive-mode""share-mode".share-mode is available in and through"cascading signals"Wake up multiple waiting threadsCopy the code

Cascading signals: In computer science, cascade is the mapping of multiple objects to each other. Creating cascading relationships between data improves management efficiency. Static final int PROPAGATE = -3; The current node state is PROPAGATE, and the next node wakes up unconditionally.

Although it is unwise to serialize (permanent storage or transport) synchronizers, these classes are often used to construct other classes. For thread-safe collections, they usually need to be serialized. Aqs ConditionObject Classes offers serialization"Synchronization status"Method, but without serializing blocking queues and transient states. Most of the"Synchronizer"All reset when deserializing"Synchronization status"To the initial value, this is the same as"build-in monitor"The deserialization strategy is consistent: always deserialize to the unlocked state. This is equivalent to"no-op", but must show support for deserialization of final fields.Copy the code

No-op: indicates that no operation is performed. Is an assembly instruction that takes up little space but does nothing.

4.1 Controlling Fairness

Resources: m.aliyun.com/yunqi/artic…

Even if the synchronizer is a fifO queue, it is not necessarily fair. Note that the basic Acquire algorithm (Section 3.3),tryAcquire is checked & executed before joining the queue. If a thread tryAcquire happens to be unlocked, it will"Steal (steal)"Team leader accessCopy the code

-Penny: I’m stealing. Steal access, use critical sections more fully, but cause starvation

This kind of"barging FIFO"Policies typically provide higher overall throughput than other technologies. Allow Acquire to Acquire locks during release, reducing overall time. At the same time,release only wakes up the queue leader thread to avoid excessive, unproductive contention.Copy the code

If the incoming thread arrives faster than the unpark wake thread, the team leader has little chance of winning the race, resulting in almost always re-blocking. for"briefly-held"Synchronizer, during the first thread wake up, multiple bargings and releases usually occur on multiple processors. The net effect is that the Barging strategy maintains a high rate of one or more threads, with a probability of avoiding starvation.Copy the code

Net effect: An economic action may have both positive and negative effects, and the net effect is the balance of the two effects. 3) Briefly held: briefly held: briefly held: briefly held: briefly held: briefly held

It is also easy to implement when there is a need for higher fairness. If strict fairness is required, the programmer can define the tryAcquire method to fail immediately (return) if the previous thread is not the head node of the queue (which can be checked by the getFirstQueuedThread method, one of the few checks provided by the framework)false). A faster but less strict variant allows tryAcquire execution to succeed if the queue is empty. In this case, multiple threads will compete to be the first to acquire the lock. In this case, at least one thread does not need to be enqueued. All JUC synchronizers use this fair strategy.Copy the code
'JLS'The fairness scheduling guarantee is not provided and can only be implemented by users themselves. Even with strictly fair synchronizers, if a group of threads does not block, the JVM will run those threads according to its own scheduling rules ("JVM thread scheduling"Context is switched after a thread runs a time slice on a single-core processor. If a thread holds the lock, wait until it releases the lock and wakes up the next waiting node. It can be switched when the thread has finished its task and no longer needs the lock but has not released it, because it is fair policy to wait until the thread runs again to release the lock and wake up the next node in the waiting queue. JVM thread scheduling therefore implicitly increases the idle time of locks. Synchronizer fairness patterns tend to have a greater impact on the architecture of multi-core processor systems. If I have eight cores, eight equals one,Copy the code

JLS: the Java Language Specification, the Java Language Specification docs.oracle.com/javase/spec… JVM Thread scheduling: Preemptive scheduling in which each Thread is assigned execution time by the system, and Thread switching is not determined by the Thread itself (in Java, it is possible to yield execution time by thread.yield (), but not by obtaining execution time). Thread execution time is controlled by the system, and no single thread will cause the entire process to block.

Although fair lock in'briefly-held'the'High contention'In the scenario of poor performance. There are also times when fair locks perform well, such as when they protect relatively long code bodies and/or long interlock intervals, in which case barging provides little pick-me-ups and increases them"Hunger"The risk. You can select an appropriate one based on specific scenarios"Synchronizer"Type.Copy the code

Briefly held: Locks are held at very short intervals (high frequency locks) high contention: relevant depth good article www.sohu.com/a/232590588… Lock not slow; Lock contention is slow

4.2 Synchronizers

This section Outlines which synchronizers in J.U.C use the AQS framework:

ReentrantLock: Use synchronization state to record the number of times a thread recursively calls the lock. The current thread ID is recorded when the lock is acquired for recursive counting and to check that the thread is the current thread when attempting to unlock. ConditionObject is also used in this class, along with other monitoring and validation methods. ReentrantReadWriteLock Provides fair and unfair (default) modes that can be used instead of synchronizer. ReentrantReadWriteLock allows read/write locks to be flushed using the synchronization state, where 16bits are used to maintain write lock counts and 16bits are used to maintain read locks. WriteLock is similar to ReentrantLock in that ReadLock uses the acquireShared method to support multiple reads of Semaphore: AcquireShared is used to reduce the count. If state <= 0 the thread is blocked. TryRelease is used to increase the count (if the count is an integer after release) to wake up the waiting node CountDownLatch. When the count reaches zero, all threads release the lock at the same time. FutureTask(JDK 1.8 does not use AQS) uses synchronous variables to indicate the Future's running state (initial, run, cancelled, complete). The call to get() blocks the calling thread with park. When running state to cancel | completed, the task completion status wake blocking threads. SynchronousQueue:"CSP-style handoff"To synchronize objects between threads. The internal wait node matches the producer consumer, which uses a synchronous state to allow a producer to continue producing the next object after a consumer has consumed it, and vice versa. Users can customize synchronizers using J.U.C. For example, many classes that were considered but not accepted into J.U.C, They provide semantics and functionality similar to WIN32 events, binary latches, centrally managed locks, and tree-based barriers.Copy the code