This article was compiled from: JRockit’s Definitive Guide to understanding the JVM in Depth

By Marcus Hirt and Marcus Lagergren

Published time: December 10, 2018

start

Migrate the application to JRockit

Command line options

In the JRockit JVM, there are three main types of command-line options: system properties, standard options (starting with -x), and non-standard options (starting with -xx).

1. System properties

There are several ways to set JVM startup parameters. Parameters starting with -d are used as system properties that provide configuration information for Java class libraries such as RMI. For example, JRockit Mission Control prints debugging information at startup if the -dcom.rockin.mc. debug=true parameter is set. However, the JRockit JVM versions after R28 deprecated many of the system attributes previously used in favor of non-standard options and VM Flag Settings similar to those found in HotSpot.

2. Standard options

Options starting with -x are common Settings supported by most JVM vendors. For example, the option -xmx for setting the maximum heap size is the same in most JVMS, including JRockit. There are exceptions, of course, such as the -xverbose option in JRockit, which prints optional submodule logs, whereas in HotSpot, the similar (but actually more limited) option is -verbose.

3. Non-standard options

Command-line options starting with -xx are customized by each JVM vendor. These options may be deprecated or modified in a future release. If the JVM’s parameter configuration contains command-line options starting with -xx, when migrating Java applications from one JVM to another, these non-standard options should be removed before starting M and the new VM options determined before starting the Java application.

Adaptive code generation

The Java virtual machine

Bytecode format

Opcodes for the Java Virtual Machine

Constant pool

Program, including data and code, where data as operands used. For bytecode programs, if the operands are very small or very common (such as the constant 0), they are embedded directly in the bytecode instructions.

Larger pieces of data, such as constant strings or large numbers, are stored in the constant pool at the beginning of the class file. When such data is used as operands, the index position of the data in the constant pool is used, not the actual data itself.

In addition, metadata for methods, properties, and classes in Java programs are stored in the constant pool as part of clAS files.

Adaptive code generation

Optimized dynamic program

In assembly code, method calls are made with the Call instruction. The specific form of call instruction varies from platform to platform, and the specific format varies with different types of Call instruction.

In object-oriented languages, virtual method dispatchment is typically compiled as an indirect call to an address in the dispatch table (indirect call, the actual call address that needs to be read from memory). This is because there may be multiple receivers when dispatching virtual invocations based on different class inheritance structures. Each class has a dispatch table that contains information about the recipients of its virtual calls. Static methods and virtual methods that know only one receiver can be compiled as direct calls to fixed calling addresses. In general, this can greatly speed up execution.

Assuming that the application is developed in C++, all the structural information about the program is already available to the code generator at compile time. For example, because the code does not change as the program runs, you can tell from the code at compile time whether there is only one implementation of a virtual method. Because of this, not only does the compiler not need to record additional information because of deprecated code, it can also turn virtual methods that have only one implementation into static calls.

If the application is developed in Java, there may initially be only one implementation of a virtual method, but Java allows the method implementation to be modified as the program runs. When the JIT compiler needs to compile a virtual method, it prefers those that only one implementation exists forever, so that the compiler can make many of the same optimizations as the C++ compiler mentioned earlier, such as converting a virtual call to a direct call. However, because Java allows code to be changed while the program is running, if a method does not declare a final modifier, it is possible to change it at run time, even if it seems almost impossible to implement otherwise, and the compiler cannot optimize it for a direct call.

In the Java world, there are scenarios where everything seems fine right now, and the compiler can optimize the code a lot, but if the program changes one day, it needs to undo all the optimizations. In order for Java to be able to match the speed of C++ programs, some special optimizations are required.

The strategy used by the JVM is “gambling.” The JVM code generation strategy assumes that running code never changes. In fact, most of the time it is. But if the running code changes in a way that violates code optimization assumptions, it triggers a callback function in its bookkeeping system. At this point, the code generated based on the original assumptions needs to be discarded and regenerated, for example for a virtual call that has been converted to a direct call. Therefore, the cost of “losing” is high, but if the probability of “winning” is very high, the performance gain from the bet is large enough to be worth a try.

In general, typical assumptions made by JVMS and JIT compilers include the following:

  • Virtual methods are not overridden. Because there is only one implementation of a virtual method, it can be optimized for a direct call.
  • The value of a floating point number is never a NaN. In most cases, hardware instructions can be used instead of calls to local floating-point libraries.
  • Few exceptions are thrown in some try blocks. Therefore, code in a catch block can be treated as a cold method.
  • For most trigonometric functions, the hardware instruction fsin is accurate enough. If it does not, it throws an exception and calls the local floating-point library to complete the calculation.
  • Lock competition will not be too fierce, the initial use of spinlock (spinlock) instead.
  • The lock may be acquired and released periodically by the same thread, so the repeated acquisition and release of the lock can be omitted.

Drill into the JIT compiler

Optimized bytecode

Sometimes optimizing Java source code can backfire. Most write very poor readability of the code are claimed to be in order to optimize performance, is according to the conclusions of some benchmark report write code, but these performance tests often involves only the bytecode interpretation, without the JIT compiler optimization, so does not represent the application’s real performance at runtime. For example, a server application contains a large number of array elements of iterative access operations, programmers reference the report’s conclusions, not set loop condition, but write a for loop, infinite in the try block, and capture ArrayIndexOutOfBoundsExcepti in the catch block On exception. Not only does this bad writing make the code extremely unreadable, but it is also much less efficient than normal looping once it is optimized for compilation by the runtime. The reason is that one of the underlying assumptions of the JVM is that exceptions are rare. Based on this assumption, the JVM does some optimization, so when exceptions do occur, they are expensive to handle.

Code pipelining

Code Generation Overview

How registers are allocated is very important when generating optimized code. Compiler textbooks treat register allocation as a graph coloring problem because two variables used at the same time cannot share the same register, which is the same as coloring problem in this respect. Multiple variables used at the same time can be represented by connected nodes in the graph, so that the problem of register allocation can be abstracted to “how to color the nodes in the graph so that the connected nodes have different colors”. Here the number of available colors is equal to the number of registers available on the specified platform. Unfortunately, the coloring problem is nP-hard in terms of computational complexity, meaning that there is no efficient algorithm that can solve it. However, coloring problems can give approximate solutions in linear logarithmic time, so most compilers use some variation of the coloring algorithm to deal with register allocation problems.

Adaptive memory management

Fundamentals of Heap Management

Object allocation and release

In general, when allocating memory for objects, memory is not allocated directly on the heap, but is first buffered in thread local Buffer or other similar structures, and garbage collection is performed as the application runs, new objects are allocated, and these objects may eventually be promoted to the heap or released as garbage.

In order to be able to find a suitable place for newly created objects in the heap, the memory management system must know which parts of the heap are free, that is, not occupied by living objects. Memory management systems use free lists — lists that concatenate available memory blocks in memory to manage the available free areas of memory and prioritize them in a particular dimension.

When searching the free list for enough free blocks to store a new object, you can select either the most suitable size or the first free block that will fit. Several different algorithms will be used to achieve this, each with its own advantages and disadvantages, which will be discussed in detail later.

Garbage collection algorithm

Later, the root set refers specifically to the initial input set of the search algorithm described above, which is the set of viable objects when reference tracing begins. In general, the root collection contains all the objects in the current stack frame of the application suspended because of garbage collection, and contains all the information available from the user stack and registers in the current thread context. In addition, the root collection contains global data, such as static attributes of the class. Simply put, the root collection contains all objects that can be retrieved without tracing references.

Java uses an exact garbage Collector to distinguish object pointer type data from other types of data by informing the garbage collector of metadata information, which is typically obtained from Java method code.

The use of signals to suspend threads has been controversial in recent years. It is found in practice that on some operating systems, especially Linux, the application program does not use and test the signal properly, and some third-party local libraries do not comply with the signal convention, leading to the occurrence of signal conflicts and other events. Therefore, signal-related external dependencies are no longer reliable.

Generational garbage collection

In fact, garbage collection can be made more efficient by dividing the heap into two or more Spaces called generations and storing objects with different life-cycles. In JRockit, newly created objects are stored in a space called the nursery, which is generally much smaller than the old collections, to which longer-lived objects are promoted as garbage collection is repeated. Therefore, two different garbage collection methods, the new generation garbage collection and the old generation garbage collection, emerged at the historic moment, which were used to carry out garbage collection on the objects in their respective Spaces.

The speed of the new generation garbage collection is several orders of magnitude faster than that of the old generation garbage collection. Even though the frequency of the new generation garbage collection is higher, the execution efficiency is still better than that of the old generation garbage collection, because the life cycle of most objects is very short and there is no need to upgrade to the old generation garbage collection. Ideally, new generation garbage collection can greatly improve system throughput and eliminate potential memory fragmentation.

Write barriers

When implementing generational garbage collection, most JVMS use a technique called write barriers to keep track of which parts of the heap need to be traversed to perform garbage collection. When object A points to object B, that is, when object B becomes the value of an attribute of object A, the write barrier is triggered and some auxiliary operations are performed after the attribute field assignment is completed.

The traditional implementation of write barriers is to divide the heap into small contiguous Spaces (for example, 512 bytes each) called cards, so that the heap is mapped to a coarse-grained card table. When a Java application assigns an object to an object reference, it sets the dirty bit through the write barrier, marking the card on which the object is located as dirty.

In this way, the time to traverse references from the old generation to the new generation is shortened, and the garbage collector only needs to check the memory region corresponding to the card marked dirty in the old generation to do the new generation garbage collection.

Garbage collection in JRockit

Old age garbage collection

JRockit applies card tables not only to generational garbage collection, but also to cleanup at the end of the concurrent markup phase, avoiding searching the entire graph of viable objects. This is because JRockit needs to find out which objects are created by the application when the concurrent markup operation is performed. When the reference relationship is modified, the card table can be updated through the write barrier. Each region in the living object graph is represented by a card in the card table. The card state can be clean or dirty. At the end of the concurrent tagging phase, the garbage collector only needs to examine the heap regions corresponding to the cards marked dirty to find objects that were newly created and updated during the concurrent tagging.

Performance and Scalability

Thread local allocation

In JRockit, a technique called Thread Local Allocation is used to dramatically speed up the allocation of objects. Normally, it is much faster to allocate memory for an object in a buffer within a thread than to allocate memory directly on the heap where synchronous operations are required. When the garbage collector allocates memory directly on the heap, it needs to lock the entire heap, which can be a disaster for multi-threaded applications. Therefore, if each Java thread can have a local object buffer, then most object allocation operations can be done by moving the pointer. On most hardware platforms, only one assembly instruction is required. The area reserved for allocation is called the Thread Local area (TLA).

In order to make better use of the cache for higher performance,TLA sizes are typically between 16KB and 128KB, although they can also be specified explicitly with command line arguments. When the TLA is filled, the garbage collector promotes the contents of the TLA to the heap. Therefore, TLA can be thought of as the new generation of memory space in threads.

After the New operator is present in Java source code and the JIT compiler performs advanced optimization for memory allocation, the pseudo-code for memory allocation looks like this:

object allocateNewobject(Class objectclass){
	Thread current getcurrentThread():
	int objectSize=alignedSize(objectclass)
	if(current.nextTLAOffset+objectSize> TLA_SIZE){ current.promoteTLAToHeap(); Current. NextTLAOffset =0; } Object ptr= current.TLAStart+current.nextTLAOffset: current.nextTLAOffset + objectSize;return ptr:	
}
Copy the code

To illustrate the memory allocation problem, many other associated operations have been omitted from the above pseudocode. For example, if the object to be allocated is too large, exceeds a certain threshold, or is too large to be stored in TLA, memory is allocated for the object directly in the heap.

The NUMA architecture

The emergence of the NON-Uniform Memory Access (NUMA) architecture has brought more challenges to garbage collection. Under the NUMA architecture, different processor cores typically access their own memory address Spaces to avoid bus delays caused by multiple CPU cores accessing the same memory address. Each CPU core has its own dedicated memory and bus, Therefore, the CPU core accesses its own memory quickly, but accesses the memory of neighboring CPU cores slowly. The farther apart the CPU cores are, the slower the access speed is (depending on the configuration). Traditionally, multi-core cpus follow uniform Memory (UMA) The Access (Unified Memory Access Model) architecture runs in which all CPU cores access all memory indiscriminately according to a uniform pattern.

To take advantage of the NUMA architecture, the organization of garbage collector threads should be adjusted accordingly. If a CPU core is running a tagged thread, the portion of heap memory that the thread accesses is best placed in that CPU’s proprietary memory to get the most out of the NUMA architecture. In the worst case, the garbage collector usually requires a heuristic object movement algorithm if the object being accessed by the tag thread is in the proprietary memory of another NUMA node. This is to ensure that objects that are used at the same time can be stored at the same location. If this algorithm works correctly, it can lead to a significant performance improvement. The main problem here is how to prevent objects from being moved repeatedly in the proprietary memory of different NUMA nodes. In theory, adaptive run-time systems should handle this problem well.

Large pages

Memory allocation is done by the operating system and the page tables it uses. Operating systems manage physical memory by dividing it into pages, which are the smallest unit of actual allocated memory at the operating system level. Traditionally, page sizes are divided in basic units of 4KB, page operations are transparent to processes, and processes use virtual address Spaces rather than real physical addresses. To facilitate the translation of virtual pages into actual physical memory addresses, a cache called translation lookaside Buffer (TLB) can be used to speed up address translation operations. From an implementation point of view, if the size of the page is very small, this can lead to frequent bypass conversion buffer loss.

One way to fix this problem is to increase the size of the page by several orders of magnitude, such as in MB. Modern operating systems generally tend to support this large memory page mechanism.

Obviously, when multiple processes allocate memory in their own addressing space and pages are large, fragmentation becomes more serious as the number of pages in use increases, wasting a lot of storage if, for example, the process allocates memory slightly larger than the page capacity. For runtimes that manage their own memory allocation recycling within the process and have a lot of memory available, this is not a problem because the runtimes can be solved by abstracting out virtual pages of varying sizes.

Typically, using large memory pages can improve overall system performance by at least 10% for applications with frequent memory allocation and reclamation. JRockit has good support for large memory pages.

Near real-time garbage collection

JRockit Real Time

The trade-off for low latency is increased overall garbage collection time. Concurrent garbage collection while the program is running is more difficult than parallel garbage collection, and frequent interruptions to garbage collection can cause more trouble. In fact, this is not a big deal, as most users of JRockit Real Time are more concerned with the predictability of the system than with reducing the overall Time of garbage collection. Most users consider a sudden increase in pause times to be more harmful than an increase in overall garbage collection time.

Soft real-time effectiveness

Soft real-time is the core mechanic of JRockit Real Time. But how do nondeterministic systems provide a specified degree of determinism, such as how can a system like the garbage collector ensure that an application’s pause time does not exceed a certain threshold? Strictly speaking, no such guarantee can be provided, but since such extreme cases are rare, it does not matter.

Of course, there is no foolproof solution, and there are certainly scenarios where the pause time is not guaranteed. But it turns out that for applications where the heap is about 30-50% alive, JRockit Real Time performs as well as the service needs, and the 30% to 50% threshold increases with each release of JRockit Real Time, while the permissible pause Time threshold decreases.

The working principle of

  • Efficient parallel execution
  • Subdivide the garbage collection process into several roll-back, breakable work packets
  • Efficient heuristic algorithm

In fact, the key to achieving low latency is still to get as many Java applications running as possible, keeping heap utilization and fragmentation low. At this point, JRockit Real Time is using the greedy strategy of delaying STW garbage collection as long as possible in the hope that the problem will be solved by the application itself, or reducing the number of STW operations that have to be performed, preferably with as few objects as possible.

In JRockit Real Time, the work of the garbage collector is divided into several subtasks. If the application pauses for longer than a threshold while performing one of these subtasks (such as defragmenting a portion of memory in the heap), the subtask is abandoned and the application resumes execution. Users according to the business need to specify the overall time can be used to complete the garbage collector, in some cases, some tasks have been completed, but there is no enough time to finish the whole recycling work, in order to guarantee the application running at this moment, had to scrap subtasks, has yet to be completed until the next garbage collection when performing again, specify the shorter the response time, the abandoned child The more things you can do.

The marking phase, described earlier, is easy to tune and can be executed concurrently with the application. But the cleanup and collation phases require suspending the application thread (STW). Fortunately, the tagging phase accounts for 90% of the total garbage collection time. If the application is suspended for too long, the current garbage collection task has to be terminated and re-executed concurrently, with the expectation that the problem will resolve itself. Garbage collection is divided into several subtasks to facilitate this goal.

Apis for memory operations

destructor

The design of destructors in Java is a mistake and should be avoided.

This is not only our opinion, but also the consensus of the Java community.

Differences in JVM behavior

For the JVM, keep in mind that the programming language can only remind the garbage collector to work. Java, for its part, is not designed to precisely control the memory system itself. For example, it would be impractical to assume that soft references implemented by two ⅳ M vendors have the same lifetime in the cache.

Another problem is the misuse of the system.gc () method by many users. The system.gc () method simply reminds the runtime that it is time to do garbage collection. In some JVM implementations, frequent calls to this method result in frequent garbage collection operations, while in some JVM implementations, the call is ignored most of the time.

In my past work as a performance consultant, I have seen this approach abused many times. Many times, just removing a few calls to the System.gc method can improve performance significantly, which is why JRock has the command-line argument -xx:AllowSystemGC=False to disable the System, GC method.

Traps and pseudo-optimizations

Some developers sometimes write “optimized” code in the hope that it will help with garbage collection, but in reality this is just their illusion. Remember, premature optimization is the root of all evil. In the case of Java, it is difficult to control the behavior of garbage collection at the language level. The main problem here is that developers mistakenly think the garbage collector has a fixed mode of operation and try to control it.

In addition to garbage collection, object polling is also a common form of false optimization in Java. Some people think that keeping a pool of living objects to reuse created objects can improve garbage collection performance, but in reality, object pooling not only adds complexity to the application, but is also error-prone. For modern garbage collectors, use the java.lang.ref.Reference family of classes to implement caching, or simply null references to useless objects.

In fact, with modern VMS, it is possible to write applications that run well if you take advantage of textbook techniques, such as using the java.lang.ref.Reference family of classes correctly and paying attention to the dynamic nature of Java. If applications really had real time requirements, they would not have been written in Java in the first place. Instead, they would have been implemented in static programming languages that allow programmers to manually control memory.

Memory management in JRockit

It’s important to note that tinkering with JVM parameters won’t necessarily improve application performance, and may actually interfere with JVM performance.

Threads and Synchronization

The basic concept

Each object holds information related to the synchronization operation, such as whether the current object is used as a lock and the specific implementation of the lock. Typically, this information is stored in the Lock word of each object’s object header for quick access. JRockit uses bits of the lock word to store garbage collection status information, which is referred to in this book as a lock word even though it contains garbage collection information.

Object headers also contain Pointers to type information. In JRockit, these are called class blocks. The following image shows the memory layout of JRockit Java objects on different CPU platforms. To save memory and speed up dereferencing, all words in the object header are 32 bits long. A class block is a 32-bit pointer to another external structure that contains information such as the current object’s type and virtual Dispatch table.

An atomic operation is a local instruction that is executed entirely or not at all. When an atomic instruction is fully executed, the result of its operation needs to be visible to all potential visitors.

Atomic operations, used to read and write locks, are exclusive and are the basis for implementing synchronized blocks in the JVM.

Difficult to debug

A deadlock is when two threads are waiting for each other to release the resources they need, causing both threads to go to sleep. It was clear they were never going to wake up. The concept of a live lock is similar to that of a deadlock, except that a thread can take active action in a contest but cannot acquire the lock. It is like two people walking face to face in a narrow corridor. They move sideways in order to move forward, but they can’t because they are moving in opposite directions.

Java API

The synchronized keyword

In Java, the keyword synchronized is used to define a critical section, which can be either a block of code or a complete method, as shown below:

public synchronized void setGadget(Gadget g){
	this.gadget = g;
}
Copy the code

The method definition above contains the synchronized keyword, so only one thread at a time can modify the gadget field for a given object.

In synchronous methods, the monitor object is implicit, that is, the current object this, whereas for statically synchronous methods, the monitor object is a class object of the current object. The above example code is equivalent to the following:

public void setGadget(Gadget g){ synchronized(this){ this.gadget = g; }}Copy the code

Java. Lang. Thread class

Threads in Java also have the concept of priority, but it depends on the implementation of the JVM. The setPriority method is used to set the priority of a thread, indicating to the JVM that the thread is more or less important. Of course, for most JVMS, explicitly changing thread priorities doesn’t help much. The JRockit JVM even ignores the priority of Java threads when “there’s a better way” at runtime.

A running thread can actively give up the remaining time slice by calling yield so that another thread can run, sleep itself (by calling wait), or wait for another thread to finish before running (by calling join).

The volatile keyword

In a multithreaded environment, a write to a property field or memory address may not be immediately visible to other running threads. In some scenarios,Java provides the keyword volatile to address the need for all threads to know the most recent value of a property while executing.

Using volatile to modify attributes ensures that writes to the field are directly applied to memory. Originally, data operations only write data to the CPU cache and later to memory, and because of this, different threads may see different values on the same property field. Currently, the JVM implements the volatile keyword by inserting memory barrier code after writing property operations, although this approach has a performance penalty.

It is often difficult to understand why different threads can see different values on the same property field. In general, the memory model of today’s machines is strong enough, or the structure of the application is not inherently volatile. However, given that the JIT optimized compiler can make major changes to the program, problems can still occur if the developer is not careful. The following example code explains why memory semantics are so important in Java programs, especially if the problem isn’t already apparent.

public class My Thread extends Thread{
	private volatile boolean finished;
	public void run() {while(! finished){ // } } public voidsignalDone(){
		this.finished = true}}Copy the code

If the finished variable had been defined without the volatile keyword, the JIT compiler could, in theory, have optimized it to load finished only once before the loop began, but this would have changed the meaning of the code. If finished had been false, the program would have plunged into nothing Limit the loop, even if other threads call the signalDone method. The Java language specification specifies that, if the compiler sees fit, copies of non-volatile variables can be created in the thread for subsequent use.

Caution should be exercised because memory barriers are commonly used to implement the semantics of the volatile keyword, which can invalidate CPU cache and reduce overall application performance.

Java thread and synchronization mechanism implementation

Java memory model

Currently, the data caching mechanism is widely used in CPU architecture to greatly improve the speed of CPU reading and writing data and reduce the degree of processor bus competition. As with all caching systems, there are consistency issues, which are particularly important for multiprocessor systems because it is possible for multiple processors to access the same location in memory at the same time. The memory model defines whether different cpus will see the same value when accessing the same location in memory at the same time.

The strong memory model (such as x86 platforms) means that when one CPU changes the value of a memory location, other cpus can see the value almost automatically. Under this memory model, in-memory writes are executed in the same order as they are sorted in code. Weak memory model (such as IA – 64 platform) refers to, when a CPU to modify a memory location after the value of the other CPU does not necessarily can see this is just to keep the value of the (unless the CPU to perform write operations with special memory barrier class instruction), and more generally, all caused by the Java program memory access should be visible to all other CPU , but in fact there is no guarantee of immediate visibility.

Implementation of synchronization

Underlying mechanism

In terms of the CPU structure at the bottom of the computer, synchronization is implemented using atomic instructions, which may vary from platform to platform. The x86 platform, for example, uses a dedicated lock prefix to implement atomicity of instructions in a multiprocessor environment.

In most CPU architectures, standard instructions (such as addition and subtraction instructions) can be implemented as atomic instructions.

At the micro-architecture level, atomic instructions are executed differently across platforms. Typically, it pauses the CPU pipeline of instruction dispatch until all existing instructions have been executed and the result of the operation is flushed into memory. In addition, the CPU blocks other cpus from accessing the relevant cache line until the atomic instruction finishes executing. On modern x86 hardware platforms, if a fence instruction interrupts more complex instruction execution, it may take many clock cycles for the atomic instruction to complete execution. Therefore, not only too many critical sections will affect the specific implementation of system performance locking, but also the performance of the system. When the locking and unlocking operation of small critical sections is performed frequently, the performance loss is huge.

Implementation of synchronization in bytecode

Java bytecode has two instructions for synchronization, Monitorenter and Monitorexit, each of which pops an object from the execution stack as its operand. When compiling source code with Javac, the monitorenter directives and Monitorexit directives are generated for synchronized code that explicitly uses monitor objects.

Optimization for threading and synchronization

Lock expansion and lock contraction

By default, JRockit uses a small spin lock to implement a newly inflated fat lock that lasts only a short time. At first glance, this may seem counterintuitive, but it can be very helpful. If lock contention is really intense and causes the thread to spin for a long time, you can disable this with the command line argument -xx :UseFatSpin=false. As part of fat locks, spin locks can also take advantage of feedback from the adaptive runtime. This feature is disabled by default and can be enabled with the command line argument -xx :UseAdaptiveFatSpin=true.

Delay to unlock

How to analyze that many thread-local unlocking and re-locking operations only reduce program execution efficiency? Is this normal for programs to run? Can the runtime assume that each individual unlock operation is actually unnecessary?

The runtime can make this assumption if a lock is immediately acquired by the same thread every time it is released. But this assumption is no longer true as long as another thread tries to acquire the monitor object that appears to be unlocked (which is semantically appropriate). In order for the monitor object to look like everything is fine, the thread that originally holds the monitor object needs to force the lock to be released. This implementation is called delayed locking and, in some descriptions, biased locking.

Even if a lock is completely uncontested, the overhead of doing lock and unlock is still greater than doing nothing. Using an atomic instruction imposes additional execution overhead on all Java code surrounding the instruction.

As you can see from the above, it makes sense to assume that most locks are only in place at thread level without contention. In this case, using delayed unlock optimization can improve system performance. , of course, there is no such thing as a free lunch, if a thread attempts to obtain a monitor has been delayed to unlock optimization object, then the execution of the overhead can be directly obtained ordinary monitor object, because this seemingly unlocked monitor object must be forced to release, therefore, cannot always assume that unlock operation is unnecessary, need to do different runtime behavior Targeted optimization.

1. The implementation

Implementing the semantics of delayed unlock is actually quite simple.

Implement the Monitorenter directive.

  • If the object is unlocked, the thread that successfully locked will continue to hold the lock and mark the object as lazily locked.
  • If the object is already marked as lazy-locked:
    • If the object is locked by the same thread, do nothing (basically a recursive lock)
    • If the object is locked by another thread, suspend the thread’s holding of the lock and check whether the object is actually locked or unlocked, an expensive step that involves traversing the call stack. If the object is locked, the lock is converted to a thin lock, otherwise the lock is forced to be released so that it can be acquired by a new thread.

Implement the Monitorexit directive: If it is a lazy-locked object, do nothing and retain its locked state, that is, perform a lazy unlock.

In order to release the thread from holding the lock, the thread must first suspend execution, which is expensive. After the lock is released, the actual state of the lock is determined by examining the lock symbol in the thread stack. Lazy unlock uses its own lock symbol to indicate that “this object is delayed locked.”

If the object being latently locked is never revoked, that is, all locks are only used locally, then the use of latently locked objects can greatly improve system performance. In practice, however, if our assumptions are not true, the runtime will have to release lately-locked objects over and over again, which is an unaffordable performance cost. Therefore, the runtime needs to keep track of how many times the monitor object is acquired by different threads, and this information is stored in the lock word of the monitor object, called the Transfer bit.

If a monitor object is moved too many times between threads, the object, its class object, or all instances of its class may be disabled for late-locking and only the standard fat and thin locks are used to handle locking or unlocking operations.

As mentioned earlier, the object is first unlocked, and then thread T1 executes the Monitorenter instruction to bring it into the late-locked state. However, if thread T1 executes monitorexit on the object, it pretends to be unlocked, but is actually still locked. The lock object still contains the thread ID of thread T1. If thread T1 performs the lock operation after this point, it does not need to perform the related operation.

If another thread T2 tries to acquire the same lock, the assumption that “most of the lock is used by thread T1” is no longer valid and will be punished by performance. The thread ID in the lock word will be replaced by the ID of thread T1 with that of thread T2. If this happens frequently, you might disable the object as a deferred lock and use it as a normal thin lock.

Traps and pseudo-optimizations

Thread.stop, Thread.resume, and Thread.suspend

Never use the Thread.stop, Thread.resume, or Thread.suspend methods and be careful with legacy code that uses them.

It is generally recommended to use wait, notify, or volatile variables for synchronization between threads.

Double check the lock

Even flat runs into problems if you lack an understanding of the memory model and CPU architecture. The following code, for example, aims to implement the singleton pattern.

public class Gadget Holder{
	private Gadget theGadget;	
	public synchronized Gadget cetGadget() {if (this.theGadget == null){
			this.theGadget = new Gadget();
		}
		returnthis.theGadget; }}Copy the code

The code above is thread-safe because the getGadget method is synchronized, but after the Gadget constructor has already been executed once, perform the same optimization to transform it into the code below.

public Gadget getGadget() {if (this.theGadget == null){
		synchronized(this){
			if(this.theGadget == null)){ this.theGadget = new Gadget(); }}}return this.theGadget;
}
Copy the code

The code above uses a seemingly “clever” trick: if rows operate synchronously, instead of returning existing objects directly; If the object has not yet created a value. This keeps things “thread-safe.”

The above code is called double checked locking. The following is an analysis of the problem of this code. Suppose a thread has passed an inner null check and starts initializing the value of theGadget field. The thread needs to allocate memory for a new object and assign a value to theGadget field. However, this sequence of operations is not atomic, and the order in which they are performed is not guaranteed. If a thread context switch happens at this point, the value of theGadget field seen by another thread may not be fully initialized, potentially inactivating the outer control check and returning the partially initialized object. It’s not just creating objects that can be problematic; you have to be careful with other types of data as well. For example, on 32-bit platforms, writing a long typically requires two 32-bit writes, while writing int data has no such concern.

This problem can be solved by declaring theGadget field volatile (note that this only works with the newer version of the memory model), which increases the overhead of execution, although less than that of synchronized. Do not use double-checked locks if you are not sure that the current version of the memory model is implemented correctly. There are many articles on the web explaining why you should not use double-checked locks, not just in Java, but in other languages as well.

The danger of a double-checked lock is that it rarely crashes a program in a strong-memory model. The Intel IA-64 platform is a classic example, with its notorious weak memory model, and otherwise well-functioning Java applications failing. If an application runs well on x86 but fails on X64, it’s easy to suspect a BUG in the JVM, ignoring the possibility that the Java application itself is at fault.

Using static attributes to implement the singleton pattern can achieve the same semantics without using double-checked locks, as shown below:

public class GadgetMaker{
	public static Gadget theGadget= new Gadget();
}
Copy the code

The Java language guarantees that class initialization is atomic, and there are no other fields in the GadgetMaker class, so an instance of the Gadget class is automatically created the first time the class is actively used. And assign to theGadget field. This method works well under both the old and new memory models.

In summary, there are a number of pitfalls to using Java for parallel development, and you can avoid them if you understand the Java memory model correctly. Developers tend not to care much about the current hardware architecture, but failing to understand the Java memory model will sooner or later shoot themselves in the foot.

Benchmarking and performance tuning

Wait, notify, and fat lock

Java is not a panacea

Java is a powerful general-purpose programming language because of its friendly semantics and built-in development schedule, but Java is not a panacea. Here are some scenarios that should not be addressed using Java:

  • To develop a telecom application with near real-time requirements, and in which there will be tens of thousands of threads executing concurrently.
  • The data returned by the database layer of an application is often 20MB byte arrays.
  • Deterministic application performance and behavior are completely dependent on the underlying operating system’s scheduler, and even small changes in the scheduler can have a large impact on application performance.
  • Develop device drivers.
  • There is so much legacy code developed in languages like C/Fortran/COBOL that the team doesn’t have a handy tool to convert it to Java code.

In addition to the examples above, there are many other scenarios where Java is not appropriate. The “write once, run anywhere” implementation of abstract Java to the underlying operating system through the JvM has also received a lot of attention. To exaggerate,ANSI C can do the same, but it takes a lot of effort to address portability issues when writing the source code. Therefore, choose appropriate tools based on actual scenarios. Java is easy to use, but don’t abuse it.

blogger

Personal wechat official Account:

Individual making:

github.com/jiankunking

Personal Blog:

jiankunking.com