On Java concurrency of books and articles have a lot of, but for my own learning down feelings, there are some seemingly simple knowledge point, that the great god and authors of the article are ignored, but these knowledge is very important, if you don’t understand, it is difficult to thoroughly understand and digest “, this kind of feeling or make me sick, So I conclude this article, which may not have any great technology or profound theory, but these thoughts have made my understanding of Java concurrency a step further, for you.

Let’s first mention a few problems that have troubled me. It seems very simple, and there may be many students who still have misunderstandings. Let’s take a look.

  • Question 1: we often hear “main memory”, “working memory”, what do they mean? Or in what form do they exist?
  • Question 2: We also hear about “visibility” a lot. What is visibility? Why is it “invisible”?
  • Question3: you must have heard of atomicity, what is atomicity? What operations can be considered atomic?
  • Q4: “Orderliness”, does the code really execute in the order we wrote it? What’s behind it?

If these questions have ever bothered you, then this article is for you. Let’s take a look at the world of Java.

What is main memory, working memory

These two concepts is the Java Memory Model (Java Memory Model) is put forward, in the detailed introduction about the Memory Model stab here, we only need to know the Memory Model is help us to block the underlying hardware details, programmers only need to write code in accordance with its rules, writing program can realize the cross-platform operation, very clever design.

Getting back to the memory model, we know that the JVM divides memory into the following chunks

  • Heap (shared by all threads in the process)
  • Method area (shared by all threads in the process)
  • Virtual machine stack (thread independent)
  • Native method stack (thread independent)
  • PC counter (each thread independent)

What’s the relationship between main memory and working memory? So I’m going to go straight to the conclusion.

  • The main memory is the heap + method area
  • Working memory is the virtual machine stack + native method stack + PC counter

This may seem like a trivial point, but it’s important because it’s the only conclusion that can be combined with the actual code examples that we’ll see later, otherwise it will feel like the theory and the actual practice are out of step.

For example

The requirement is that you have a Counter Counter class with a count member variable int that records the current total, as defined below.

public class Counter {
    private int count;
    public void increment() {
        this.count++;
    }
    public int getCount() {
        returnthis.count; }}Copy the code

Our task now is to call increment 100 million times and print the count, so obviously the correct output should be 100 million.

Public static void main(String[] args) {Counter Counter = new Counter(); int loopCount = 100000000; long startTime = System.currentTimeMillis();for (int i = 0; i < loopCount; i++) {
       counter.increment();
   }
   System.out.println("count:" + counter.getCount());  
   System.out.println("take time:" + (System.currentTimeMillis() - startTime) + "ms"); } // Output count:100000000 take time:577msCopy the code

So easy! This code is all too familiar, but this time I want to understand more about how the program works by looking at how the code executes in the virtual machine stack. Check the bytecode implementation of the increment method of the Counter class using the Javac and Javap commands.

Public void increment(); javac counter.java javap - increment. class public void increment(); descriptor: ()V flags: ACC_PUBLIC Code: stack=3, locals=1, args_size=1 0: aload_0 1: dup 2: getfield#2 // Field count:I
         5: iconst_1
         6: iadd
         7: putfield      #2 // Field count:I
        10: return
Copy the code

We know that method calls in Java are implemented based on stack frames, each stack frame consists of the operand stack + the run-time constant pool reference of the class used + the local variable table. I’ve drawn a diagram to help you understand the process, and I’ve broken down the bytecode steps of one count++ operation in the operand stack (count=10, for example). Note that the following diagram runs in working memory (in the virtual machine stack of the main thread).

Well, here, I want to sum up the methodology! The above code works correctly in a single thread because the following three conditions are met!

  • Cycles from 0 to 100 million are strictly sequential (ordered)
  • During the loop, the previous change to count is visible later (visibility)
  • Because it is strictly sequential, there is no crossover between count++ operations, so in fact, in a single-threaded environment, you can consider atomicity (atomicity).

If any of the above conditions are broken, the result of execution may be incorrect, which is why concurrency problems are common in Java multithreaded environments because all three conditions are not met.

Why does multithreading have concurrency problems?

Already mentioned above, we Count on instances of the class, to meet the order at the same time, visibility, atomicity, the orderliness and atomicity, because we tend to think of multiple threads to cross, if not add synchronization control, orderliness and atomic certainly can’t guarantee, but it is difficult to understand the visibility, bones to pick up tough face, So let’s talk about visibility first.

What is visibility? Why is it invisible?

Again, use our example above to illustrate. We already know that

counter.increment();
Copy the code

Compiles to bytecode as

getfield      # 2
iconst_1
iadd
putfield      # 2
Copy the code

As mentioned above, the bytecode execution process here is in the working memory, but the getField and putField instructions actually interact with the main memory. Here, we still take the increment method of the Counter class as an example.

  • The getField directive reads the count value from main memory, but it is not always read from main memory. Because of the CPU cache, the count value may be read from the cache, resulting in the reading is not the latest
  • The putField command writes the new count value to main memory, but it does not take effect immediately. The count in the cache of other cpus is not updated immediately, and the CPU uses the cache consistency protocol for synchronization, which is transparent to us.

Because of the CPU cache, there are visibility issues in multi-core environments. Modern cpus are much faster than our memory. If we lock the bus to write to main memory every time, the execution speed will be much slower. This is not acceptable. And I drew a picture here just to help you understand it.

volatile

Volatile variables have the following properties

  • When volatile is written, the semantics of monitor release are used to flush the variable’s cache on each CPU and store the latest value
  • A volatile read with Monitor acquire semantics invalidates the variable’s cache in the current CPU’s cache and reads the latest value from main memory
  • Volatile has the semantics to prohibit instruction reordering

Monitor releases locks, monitor acquire acquire locks, and read and write volatile variables directly into main memory, sacrificing performance for visibility. This part of the performance sacrifice is usually negligible, just need to know that it is happening.

How volatile works

The only change at the bytecode level is the addition of the ACC_VOLATILE flag to count. At run time, the memory barrier is automatically inserted to ensure volatile visibility semantics. There are four types of memory barriers:

  • LoadLoad
  • LoadStore
  • StoreStore
  • StoreLoad

There’s a document here that’s pretty authoritative and detailed about memory barriers, and you can go into that on your own. Here is an example from the documentation that visually illustrates how the memory barrier is inserted.

Returning to the previous example, does adding a volatile modifier to count result in the correct summation in multiple threads? Let’s try it out. For simplicity’s sake, let’s just open two threads, each with half the computation.

// Counter.java private volatile int count; // main method Counter Counter = new Counter(); int loopCount = 100000000; int halfCount = loopCount / 2; Thread thread1 = new Thread(() -> {for(int i = 0; i < halfCount; i++) { counter.increment(); }}); Thread thread2 = new Thread(() -> {for(int i = halfCount; i < loopCount; i++) { counter.increment(); }}); thread1.start(); thread2.start(); try { thread1.join(); thread2.join(); } catch (InterruptedException e) { e.printStackTrace(); } System.out.println("count:" + counter.getCount());
    System.out.println("take time:" + (System.currentTimeMillis() - startTime) + "ms"); // Result count:51743664 take time:2335msCopy the code

The results were clearly still wrong, and the program ran several times longer. This is because volatile only guarantees visibility, but no atomicity semantics, as in the following case

In time t1-T6, the initial count is 0, and after two ++ operations, the final count is still 1. In our example above, 50 million cycles will produce a large number of similar error overwrites. According to the semantics of volatile, Thread1’s changes to count are visible to Thread2 at T5. This means that if Thread1 calls getField, the value will be the last 1 changed by Thread1. Thread2 knows nothing about this and just writes the wrong 1 to count at its own pace.

If the count is not the latest, read it again and try again. If it is the latest, write the updated value directly. This seems to solve the problem of the error writing above. This may seem like a good idea, but it is important to note that the whole checking process must be atomic, or there will still be concurrency problems. In fact, the unaddressed CAS method in the JDK does just that, iterating over and over again, a process called spin. Its underlying implementation relies on the CMPXCHGL and CMPXCHGQ assembly instructions. Cpus on different platforms have different implementations, but the code is similar. Here I use opekJDK8 as an example to take a look at CAS source code, source code is more I will only post key code blocks.

// unsafe. class has three CAS methods, Public final Native Boolean compareAndSwapObject(Object VAR1, Long VAR2, Object VAR4, Object VAR5); public final native boolean compareAndSwapInt(Object var1, long var2, int var4, int var5); public final native boolean compareAndSwapLong(Object var1, long var2, long var4, long var6);Copy the code

It corresponds to the native implementation in the hotspot/SRC/share/vm/prims/unsafe. The CPP

UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapInt(JNIEnv *env, jobject unsafe, jobject obj, jlong offset, jint e, jint x))
  UnsafeWrapper("Unsafe_CompareAndSwapInt");
  oop p = JNIHandles::resolve(obj);
  jint* addr = (jint *) index_oop_from_field_offset_long(p, offset);
  return (jint)(Atomic::cmpxchg(x, addr, e)) == e;
UNSAFE_END
Copy the code

I’m just going to post the implementation of compareAndSwapInt here. You can see that the Atomic:: CMPXCHG method is called again

unsigned Atomic::cmpxchg(unsigned int exchange_value,
                         volatile unsigned int* dest, unsigned int compare_value) {
  assert(sizeof(unsigned int) == sizeof(jint), "more work to do");
  return (unsigned int)Atomic::cmpxchg((jint)exchange_value, (volatile jint*)dest,
                                       (jint)compare_value);
}

jbyte Atomic::cmpxchg(jbyte exchange_value, volatile jbyte* dest, jbyte compare_value) {
  assert(sizeof(jbyte) == 1, "assumption.");
  uintptr_t dest_addr = (uintptr_t)dest;
  uintptr_t offset = dest_addr % sizeof(jint);
  volatile jint* dest_int = (volatile jint*)(dest_addr - offset);
  jint cur = *dest_int;
  jbyte* cur_as_bytes = (jbyte*)(&cur);
  jint new_val = cur;
  jbyte* new_val_as_bytes = (jbyte*)(&new_val);
  new_val_as_bytes[offset] = exchange_value;
  while (cur_as_bytes[offset] == compare_value) {
    jint res = cmpxchg(new_val, dest_int, cur);
    if (res == cur) break;
    cur = res;
    new_val = cur;
    new_val_as_bytes[offset] = exchange_value;
  }
  return cur_as_bytes[offset];
}
Copy the code

We trace the call to the CMPXCHG method, which is not defined in atomic.cpp. Look at atomic.hpp and see the definition of the inline function corresponding to CMPXCHG

inline static jint     cmpxchg    (jint     exchange_value, volatile jint*     dest, jint     compare_value);
// See comment above about using jlong atomics on 32-bit platforms
inline static jlong    cmpxchg    (jlong    exchange_value, volatile jlong*    dest, jlong    compare_value);
Copy the code

CMPXCHG is defined in hotspot/ SRC/OS_CPU /solaris_x86/ VM/atomic_solaris_x86.inline-hpp

#define LOCK_IF_MP(mp) "cmp $0, " #mp "; je 1f; lock; 1:"
inline jint _Atomic_cmpxchg(jint exchange_value, volatile jint* dest, jint compare_value, int mp) {
    __asm__ volatile (LOCK_IF_MP(%4) "cmpxchgl %1,(%3)"
                    : "=a" (exchange_value)
                    : "r" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp)
                    : "cc"."memory");
    return exchange_value;
  }
Copy the code

This is embedded assembly code, to be honest, assembly this piece of knowledge I also returned to the teacher. According to the macro definition of LOCK_IF_MP to determine whether it is multi-core, if it is multi-core need to Lock, but the Lock is CPU bus Lock, its cost is much smaller than our application layer Lock cost. At the same time we see the key instruction CMPXCHGL. Getting to this level, I think, is enough for application developers. Understanding the underlying implementation, let’s learn to sell a real wave.

Modify our adder Counter with CAS to make it thread-safe

To use CAS, you definitely need to use the Unsafe class. To get the Unsafe object again, we need to look at the UnsafeUtil implementation

    // UnsafeUtil.java
    public static Unsafe getUnsafeObject() {
        Class clazz = AtomicInteger.class;
        try {
            Field uFiled = clazz.getDeclaredField("unsafe");
            uFiled.setAccessible(true);
            return (Unsafe) uFiled.get(null);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    
    public static long getVariableOffset(Object target, String variableName) {
        Object unsafeObject = getUnsafeObject();
        if(unsafeObject ! = null) { try { Method method = unsafeObject.getClass().getDeclaredMethod("objectFieldOffset", Field.class);
                method.setAccessible(true);
                Field targetFiled = target.getClass().getDeclaredField(variableName);
                return(long) method.invoke(unsafeObject, targetFiled); } catch (Exception e) { e.printStackTrace(); }}return- 1; }Copy the code

Let’s look at the implementation of the Counter class

public class Counter {
    private volatile int count;
    private Unsafe mUnsafe;
    private long countOffset;
    public Counter() {
        mUnsafe = UnsafeUtil.getUnsafeObject();
        countOffset = UnsafeUtil.getVariableOffset(this, "count");
    }
    public void increment() {
        int cur = getCount();
        while(! mUnsafe.compareAndSwapInt(this, countOffset, cur, cur+1)) { cur = getCount(); } } public intgetCount() {
        returnthis.count; }}Copy the code

Start two more threads and execute our accumulator

Count :100000000 take time:5781msCopy the code

And you can see that we get the right sum, but it’s running longer, but it’s still an order of magnitude. It is important to note that in the example above, I added volatile to the count variable. In fact, CAS would still work without volatile, but it would be a little less efficient. When I tested it, the performance would be about 5% lower. And you can think about why it would be inefficient not to volatile?

The volatile keyword also has a semantics that prohibits instruction reordering. A classic use of the volatile keyword is the DCL singleton pattern.

Now that we’re done talking about visibility, let’s talk about atomicity.

Atomicity. How do you guarantee atomicity?

We’ve already mentioned some of them, for example CAS itself is atomic, so what else is atomic? I’m going to use the increment method of the Counter class as an example.

     0: aload_0
     1: dup
     2: getfield      #2 // Field count:I
     5: iconst_1
     6: iadd
     7: putfield      #2 // Field count:I
    10: return
Copy the code

These 7 bytecode instructions are all atomic, no problem, but if we think further, there will still be a question, are the individual bytecode instructions atomic? If none of them are atomic, I think all of our previous judgments have been wrong, because the theoretical foundation from which we came to our conclusion has been broken. The fact is, a single bytecode is atomic, and who guarantees that atomicity? Guaranteed by the Java Memory model JMM, programmers don’t need to know the details.

Although a single bytecode is atomic, multiple bytecodes combined are not. This is also the source of many concurrency problems. So what do we programmers do to ensure atomicity? There are probably three kinds.

  • CAS + spin
  • The synchronized keyword
  • A Lock provided by the Concurrent package, with specific implementation classes such as ReentrantLock

CAS we’ve already discussed above, but we won’t repeat it here. Let’s look at synchronized

synchronized

Synchronized is one of the most common forms of synchronization. There are three main ways to use synchronized

// Synchronized publid voidinvoke() {} // synchronized public static voidinvoke() {} // Synchronized (object) {}Copy the code

The difference between these three methods is that the objects we synchronize are different. The ordinary synchronized Class synchronizes the object itself, the static method synchronizes the Class itself, and the code block synchronizes the objects we fill in parentheses. In essence, the principle is the same. There is a Monitor object associated with the object to be synchronized. When one thread holds the lock of monitor, the other threads must wait until the thread releases the monitor before it can be acquired by another thread. Its corresponding implementation class in native layer is ObjectMonitor. HPP, which maintains many synchronization related variables internally. We focus on two variables

void *  volatile _owner;          // pointer to owning thread OR BasicLock
ObjectWaiter * volatile _WaitSet; // LL of threads wait()ing on the monitor
Copy the code

_owner represents the thread currently holding the lock, and — WaitSet represents the queue of threads waiting for the lock. From the data structure of the ObjectWaiter, you can assume that this is a two-way circular list.

Going back to our example, this time we implement our Counter class with synchronized.

public void increment() { synchronized (Counter.class) { this.count++; Count :100000000 take time:4881msCopy the code

The output is correct, and from our printed runtimes synchronized executes faster than our CAS, which is a little counterintuitive, since synchronized introduced biased locking after Java1.7, lightweight locking, Synchronized performance has been greatly improved, so there is no need to have too much psychological burden when using synchronized. In general, if the performance is not extremely high, there is no problem with using synchronized.

Here we still paste the bytecode implementation of increment method plus synchronized.

Synchronized (counter.class) {this.count++; } // Compiled, equivalent to the following pseudo-code monitorenter counter.class; try { this.count++; monitorexit Counter.class; } finally { monitorexit Counter.class; }Copy the code

In addition to the synchronized keyword, another common class used for synchronization is ReentrantLock

ReentrantLock

ReentrantLock ReentrantLock ReentrantLock ReentrantLock ReentrantLock ReentrantLock

ReentrantLock ReentrantLock = new ReentrantLock(false); // a standard way to use reentrantlock. lock(); try { //do something
} finally {
    reentrantLock.unlock();
}
Copy the code

The important thing to note here is that the lock method must be called before the try-catch-finally method, not inside it, because if the lock is called inside the try-catch-finally method, if the code fails before the lock is called, then we execute the release lock without the lock.

For comparison, let’s use ReentrantLock to do Counter accumulation again.

        Lock lock = new ReentrantLock(false);
        Thread thread1 = new Thread(() -> {
            for(int i = 0; i < halfCount; i++) { lock.lock(); try { counter.increment(); } finally { lock.unlock(); }}}); Thread thread2 = new Thread(() -> {for(int i = halfCount; i < loopCount; i++) { lock.lock(); try { counter.increment(); } finally { lock.unlock(); }}}); Count :100000000 take time:12754msCopy the code

The output was correct, but the runtime was the longest, almost three times as long as synchronized. Of course, we are not conducting performance tests here, and the running time is of no reference significance. It is only posted to give you an intuitive understanding that CAS may not necessarily have high performance, and synchronized may not necessarily have poor performance. Specific problems should be analyzed, and there must be a sense of questioning.

Since said Lock, we still scratch a scratch its implementation source code, here with unfair Lock as an example.

    static final class NonfairSync extends Sync {
        private static final long serialVersionUID = 7316153563782823691L;

        /**
         * Performs lock.  Try immediate barge, backing up to normal
         * acquire on failure.
         */
        final void lock() {
            if (compareAndSetState(0, 1))
                setExclusiveOwnerThread(Thread.currentThread());
            else
                acquire(1);
        }

        protected final boolean tryAcquire(int acquires) {
            return nonfairTryAcquire(acquires);
        }
    }
    
    abstract static class Sync extends AbstractQueuedSynchronizer {
    //......

        public final void acquire(int arg) {
        if (!tryAcquire(arg) &&
            acquireQueued(addWaiter(Node.EXCLUSIVE), arg))
            selfInterrupt();
    }
    }
Copy the code

If the State is set successfully, set the exclusiveOwnerThread to the current thread. If the State is set unsuccessful, call acquire method, and call tryAcquire method. We can override this method to customize the lock logic, and the default implementation in ReentrantLock is as follows

        final boolean nonfairTryAcquire(int acquires) {
            final Thread current = Thread.currentThread();
            int c = getState();
            if (c == 0) {
                if (compareAndSetState(0, acquires)) {
                    setExclusiveOwnerThread(current);
                    return true; }}else if (current == getExclusiveOwnerThread()) {
                int nextc = c + acquires;
                if (nextc < 0) // overflow
                    throw new Error("Maximum lock count exceeded");
                setState(nextc);
                return true;
            }
            return false;
        }
Copy the code

First try to get the state if ==0 try to get the lock with CAS if! =0, check if the current thread owns the lock, if so, state+1, return true, if not return false. So that’s why it’s reentrant lock, allows the lock to be nested, state+1 if the lock has been acquired and then returns true.

Lock.lock (); // reentrant lock(); lock.lock(); //do something ...
lock.unlock();
lock.unlock();
Copy the code

To summarize, the realization of the already is based on the queue synchronizer AbstractQueuedSynchronizer (AQS), and internal also encapsulates the CAS AQS, dig deeper still has a lot of content, It can be said that the entire Concurrent package is built on the two building blocks of CAS+AQS. We will not discuss more implementation details here. You can refer to the relevant source code analysis article for further understanding.

This concludes our discussion of atomicity in Java, and the three general ways in which it can be guaranteed. Let’s move on to the final issue, Java orderliness.

order

To illustrate, let’s start with a simple example

int a = 1; ① int b = 2; ② int c = a; 3.Copy the code

We would expect the order of execution to be ①②③, but the actual order of execution could be ②①③ or ①③② because of instructions reordering. There are basically two kinds of reordering

  • CPU level instruction reordering
  • Instruction reordering at the compiler level

The purpose of reordering is to optimize the execution speed of our program, but the premise of optimization is not to break the as-if-serial semantics, for example, ③ because of the result of ①, so it always needs to be executed after ①. Ordering is relatively easy to ensure in a single thread, but becomes more complicated in multi-threading. So the JMM abstracts out a happen-before principle, which is a promise the JMM makes to us developers to write code with the right expectation of order in multithreaded situations. This principle has the following five principles.

  • In the same thread, the preceding code happens -before the following code in the program
  • The lock on a monitor happens before the lock on the monitor
  • Write to a volatile variable happens -before read to a volatile variable
  • The thread start method calls all actions within the thread happen-before
  • If thread A calls thread B’s join, the operation in thread B happens before the subsequent operation in thread A

Of course happen-before is transitive, if A happen-before B, B happen-before C, then A also happen-before C. It should be noted that happening-before is not exactly equivalent to execution before in the sense of time. For example, in the above example, according to the first Happening-before principle, int A = 1; This statement happens before int b = 2; Int b = 2; int b = 2; Execute first, it’s legal, it doesn’t violate the happen-before principle.

After understanding these coin-before principles, many concurrent codes we often write have theoretical basis. For example, the second one is coin-coin-before unlocking, so the code within the synchronization range of lock is guaranteed to have atomicity and orderliness. At the same time, lock and unlock will be inserted into the memory barrier, and visibility is also guaranteed. So locked code is thread-safe. Third, volatile writes happen-before volatile reads. With this rule, visibility of volatile modified shared variables is guaranteed between multiple threads.

A few other principles are easier to understand, but you can use actual code to deepen your understanding, so I don’t want to repeat them here.

conclusion

Java concurrency is a more advanced topics, but this one is senior engineers must master the knowledge, again difficult also to chew bone, hope some summary of this article can help students to want to have a thorough understanding of concurrent Java, even can have a little, can let you feel enlightened in the reading, my goal is reached.

Finally, to needle chicken blood, “afraid of what infinite truth, into an inch of joy”.

Code word is not easy, if you like a thumbs up!