Abstract: BYSEN JDK is huawei OpenJDK customized open source version, is a high performance, can be used in production environment OpenJDK distribution.

This article is shared from Huawei cloud community “[Cloud In Co-creation] Bi Sheng JDK:” Legend reproduces “Huawei how to create the best JDK on ARM? White deer is the first handsome.

preface

I don’t know if you have heard of or used bicol JDK, are you engaged in Java work? Are you engaged in low-level JVM development? The vast majority of Java developers use Oracle JDK or OpenJDK. In this article, we will introduce Huawei BIsheng JDK and relevant technical optimization, hoping to provide you with a new choice in addition to the above two.

What is the Biliter JDK?

1.1 Development history of BI Sheng JDK

Huawei OpenJDK is a high-performance OpenJDK release that can be used in production environments. The bISON JDK is widely used in Huawei. The team has accumulated rich development experience and solved many difficult problems encountered in the actual business operation. We have solved the related problems such as crash internally.

1.2. Support architecture of BISHeng JDK

  • Currently, only Linux/AArch64 architecture is supported. Developers are welcome to download and use them.
  • Currently, bISON JDK supports LTS versions 8 and 11, and has been all open source.

1.3, The difference between the Bishop JDK, OpenJDK and Oracle JDK

We compare and analyze the Biliter JDK, OpenJDK and Oracle JDK to help you choose a better JDK.

As shown in the figure below, the OpenJDK is represented in blue, the Oracle JDK in light yellow and the Picol JDK in red.

For reference, we can find:

  • The Bisheng JDK, like the Oracle JDK, is based on OpenJDK customization, but with different commercial features. For example, we all know that OpenJDK 12 adds a new garbage collection (GC) algorithm called Shenandoah, but it was not shipped with the Oracle JDK release.
  • On the basis of OpenJDK customization, there are some differences, mainly from the product features of some enhancements, fixes and upstream features.

Why do you want to do biliter JDK?

2.1 Oracle JDK authorization has changed

  • Aside from the “well known” reasons, I don’t know if you know that the Oracle JDK is charged after 8U212. For the company, considering the security vulnerability of JDK itself and considering commercial factors, the result is to develop JDK in line with its own development.

Note: The above data are from Oracle’s official website.

2.2. The desire for valuable features in older JDKS

A new version of the JDK is released every six months, and there are many JDK versions with different functions/features in different JDK versions. Programmers expect to use as many valuable features from older versions as possible on the JDK they are most familiar with. For example, the G1 GC has introduced a feature in JDK12 to return unused memory to the operating system. This feature is very valuable in cloud scenarios. Currently, JDK8 is still the mainstream use.

2.3 Application customization and optimization demands

Applications have special demands on running hardware and scenarios, but these demands are difficult to enter the community in the short term. For example, the application of big data in mathematics has a high demand, in the development of the JDK for mathematical calculation cycle development, instruction optimization and other compiler optimization technology, accelerate the calculation.

Third, bi Sheng JDK status quo

3.1 Development status of BI Sheng JDK

  • The Bisheng JDK, like the Oracle JDK, is customized based on the open source OpenJDK. At the same time, the team contributes a lot of value to the upstream community

Patch, involving: garbage collection, JIT, runtime content, etc.

  • The BISON JDK is open source under the GPLv2 copyright and can be downloaded from the official binaries for free.
  • Bison JDK adopts community development and operation, biweekly conference, currently ARM, Boland, Kirin and other partners participate. The BISON JDK community not only supports THE ARM platform, any question about the JDK can be discussed in the BISON JDK community, will be the first response.
  • In the upstream community, the team currently has a total of 10 colleagues (1 Reviewer, 1 Committer, 8 authors) submitting code to the community.
  • Bi Sheng JDK has excellent performance and stability on ARM.

3.2 Example of BISHeng JDK performance improvement

We analyzed its advantages by running the Biliter JDK in a test environment as follows:

  • Model: Taishan 2280 v2
  • OS: openEuler20.09
  • HW: Kenpeng 920-64262600MHz, 128 cores
  • The JDK: JDK8U262

By comparing the data on SPECjbb, we found that the BIliter JDK showed significant improvements in both critical and Max: 55% improvement in critical and 16% improvement in Max.

On the other hand, the SPECjvm data, while not particularly impressive, still showed an average improvement of 4.6 percent.

Iv. GC algorithm optimization of BI Sheng JDK

4.1 concept of parallel replication algorithm

We all know that copying is an important part of GC algorithms, especially for the new generation of copying: copying an active object from the FROM region to the TO region. Serial copying is done by only one thread, which is not enough for us. So we use a parallel copy algorithm, so what is a parallel copy algorithm?

  • Objects A and B are copied by different threads in the parallel replication algorithm, possibly because: Objects A and B have different arrival paths, and different threads are copied. Threads can steal replication tasks from other threads because of task balancing problems.
  • For example, if two threads T1 and T2 copy objects A and B respectively, T1: A→A´; T2: B – B ´.
  • In addition to copying the contents of the object, a ForwardingPointer is also used to record the address of the object after it is transferred to prevent the object from being copied repeatedly.

4.2 Influence of architecture on parallel replication algorithm

  • The parallel work of multithreading requires consideration of memory models of different architectures. X86 is a strong memory sequence architecture, while ARM is a weak memory sequence. Their memory sequence is shown in the following table:

  • For parallel replication algorithms, in a weak memory order architecture, other threads may first observe that the transfer pointer has been updated but the object has not been copied because of the memory order design. To ensure consistency, membars need to be inserted between copying and updating object headers, which are abstracted into CAS functions in the JVM
  • CAS is implemented differently in different architectures. The CMPXCHGL instruction is used in X86. ARM uses Ldaxr/Stlxr instruction.

4.3 Process of parallel replication algorithm

The flow chart of the parallel replication algorithm is as follows:

  • Copy object obj to new_obj;
  • Insert the Memory Barrier object obj through CAS to set the transfer pointer. If it succeeds, execute (3); if it fails, execute (4);
  • Pushing a reference to new_obj onto the stack returns new_obj;
  • Undo the previously allocated object and return the new_obj of the SUCCESSFUL CAS thread.

In the hot spot analysis, we found that 60% of the CPU consumption of the replication operation was on the insertion Memory Barrier.

4.4 Algorithm optimization to reduce MEMbar Q&A

Q: If a Memory barrier is not inserted and multiple threads observe Memory inconsistencies, under what circumstances will a problem arise?

A:

  • T1: Object replication has not been completed, but objects have been pushed.
  • T2: Steal the object to be copied from the thread stack of T1, and copy and update the member variables of the object that has not been copied, resulting in data inconsistency.

Q: For objects that do not need to copy member variables (e.g., all member variables of an object are non-reference types; Objects whose member variables reference types are all NULL, and objects themselves are primitive arrays.

A: NO!

Q: How do you identify these objects?

A:

  • Statically analyze an object: You can find an array of non-reference types and primitive types in the member variables of the object. Open source.
  • Dynamic analysis objects: identified by barrier technology.

By optimizing the parallel replication algorithm, we achieved good expected results in SPECjbb and SPECjvm respectively, as shown in the figure below:

4.5 optimization of G1 and GC

For G1 Full GC optimization, Full GC is divided into four stages, which are:

  • Mark: Mark active objects throughout the heap space and record active objects.
  • Prepare: Calculates the position of each active object after compression in place.
  • Adjust: Adjusts the reference position of an object member variable based on the object’s new address.
  • Compact: Replicates the memory data of an object.

The Compact phase is generally the most time consuming and involves the movement of memory data. Then, on the premise of allowing a certain amount of space waste, can the partition with many active objects not move or move less, so as to improve the efficiency of the algorithm? Let’s draw the following image for the active object:

We can find:

  • The proportion of partitioned active objects conforms to u-shaped distribution.
  • According to Benchmark, there are 41.27% partition active objects accounting for 98%.
  • Reducing object movement also conforms to the hypothesis of strong generation theory to a certain extent.
  • Tests found a 3 to 5 percent improvement in performance for similar applications.

We have contributed the relevant code to the community, you are welcome to check it out.

4.6 Optimization of ZGC

  • The BISON JDK 11 is the first JDK to support ZGC in the ARM architecture.
  • The goal of the ZGC is to manage terabytes of memory with a pause time of 10 milliseconds for garbage collection. The ZGC recycling process consists of three steps: concurrent Mark, concurrent Relocate, and concurrent Remap. In the process of transfer, in order to improve the efficiency of transfer, only when the garbage collection space of the page reaches a certain percentage will participate in the transfer. In current implementations, the scale is controlled by the parameter ZFragmentLimit, which defaults to 25.
  • How do I set ZFragmentLimit? Too large, memory waste; Too small, low recycling efficiency.
  • Gather information about transitions during GC execution (rate of memory transitions, transition time) and predict the amount of memory that can be transferred by the next GC, using the predicted values to control which pages can participate in transitions. As shown below:

  • Calculate memory transfer rate:

  • To predict the GC transfer rate:

  • Normal distribution is used with 99% confidence.
  • Prediction of GC transfer time:

  • Predict the transfer bytes for this GC:

  • Benchmark’s tests show a 3-5% improvement, and the code is open source and being synchronized to the community.

5. JIT optimization — SVE algorithm optimization

5.1 Introduction to SVE algorithm optimization

SVE(ScalableVector Extension) is the next generation SIMD instruction set of ARM AArch64 architecture.

  • SVE1 instruction set is supported.
  • Automatic judgment ADAPTS to SVE1/NEON
  • Supports Z0 to Z31 registers.
  • Supports full-size SVE registers ranging from 128 to 2048 bits.
  • Support for PO~P7 predicate registers.
  • Support for most Automatic vectorization (SuperWord) nodes.

5.2 optimization results of SVE algorithm

VectorAPI new nodes are all contributed to the upstream community, bi Sheng JDK is not currently incorporated. So far, SVE has submitted a total of 11 patches to the upstream community with over 3,000 lines of code.

public static float sumReductionImplement(float[] a, float[] b, float[] c, float[] d, float total) {
		for (int i = 0; i < a.length; i++) {
			d[i] = (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
			total += d[i];
		}
		return total;
	}
Copy the code

The optimized NEON machine code looks like this:

The optimized SVE machine code is shown below:

6. Hardware and software coordination — Hardware acceleration of Kunpeng KAE

  • KAE(Kunpeng Accelerator Engine) is a hardware Accelerator provided by Huawei Kunpeng server. There is an independent I/O DIE in Kunpeng chip for encryption and decryption.
  • Bisheng JDK provides KAEProvider, which gives full play to hardware capabilities. Applications only need simple adaptation without code development to use hardware capabilities of Kunpeng servers to provide application operation efficiency.
  • In the latest version of BISCen JDK, four encryption and decryption algorithms (AES, Digest, HMAC, RSA) have been released. In the test against Benchmark, some algorithms can speed up 40%, which will greatly save running time in the field of security. It is currently under joint development with Boland. Support for the second batch of algorithms will be released in Q2.
  • Encryption and decryption scheme is based on JCA(Java Encryption Architecture), which is an important part of The Java platform. KAE is based on JCA to provide encryption and decryption services, called KAEProvider in the Bishop JDK. The process is as follows:

  • The JCA provides two ways to select different providers, specifying them in code or in configuration files. As follows:

Method 1: Use the Security API to add the KAE Provider and set its priority.

Method 2: modify the jre/lib/security/Java security file, add KAE Provider, and set its priorities.

What else can bi Sheng JDK add value?

  • After evaluation and testing, the BISON JDK currently has a number of valuable features based on community features.
  • G1 NUMA – Aware, a feature that takes full advantage of NUMA and works better on multi-core hardware platforms. The BISON JDK also fixes some issues on a community basis: for example, threads can migrate from multiple nodes due to thread scheduling in the operating system, which can cause some memory partitions to not be effectively reclaimed by NUMA features; Enhanced NUMA – Aware functionality for large objects. The effect is improved as shown below:

  • The AppCDS feature in JDK 10 is designed to store String and metadata-like objects in a shared file so that multiple JVM processes can share information to reduce the loading and parsing of metadata-like objects.
  • Bisheng JDK through porting this feature, the test found good results, for some scenarios of big data can be optimized by nearly 10%.
  • G1 Uncommit: When the memory usage is low, GC is periodically triggered for garbage collection and the recovered memory is returned to the operating system. This feature can significantly reduce the amount of private memory in cloud scenarios. The Bishen JDK builds on the community version by changing the serial memory release to concurrency (the same implementation was adopted in the latest JDK 16).

After G1 Uncommit is enabled, we can see in the figure below a steady decline in memory usage scenarios:

In the actual business scenario, the effect is even more obvious, as shown in the figure below:

  • The parallel task stealing mechanism is optimized, and the task stealing proportion is very high in some applications. For parallel task theft Google has contributed a valuable design to the community that greatly optimizes parallel task theft. In bisheng JDK, PS, ParNew, G1, Shenandoah, etc all benefit from this.

  • At present we are stealing for multi-core server optimization task, will continue to open source when mature.

8, The future development of BI Sheng JDK

8.1 Upcoming features

  • Improve KAE hardware acceleration algorithm, expected to be released in Q2.
  • G1 GC parallel NUMA-aware, Full GC will be implemented in Bisheng JDK8, Q2.
  • Jmap enhancement, parallel dump for CMS.

8.2. Future direction

  • Actively participate in the development and evolution of SVE and Vector API features in the community. Currently over 3000 lines of code have been submitted.
  • Optimized memory management, in progress: ZGC generation, Thread Local GC, AOT and other projects.

How to get BTC JDK and help?

Download JDK 8 and JDK 11: kunpeng.huawei.com/#/developer…

9.1 JDK 8 code repository

Gitee.com/openeuler/b…

9.2 JDK 11 code repository

Gitee.com/openeuler/b…

conclusion

In this article, we introduce what is the development history of bi Sheng JDK, what is the situation under which Huawei wants to do BI Sheng JDK, and what has been done in the bottom optimization? At the same time, what are the hidden values worth developing? Just as Peng Chenghan, senior technical expert of Huawei Compiler, said, it is our pursuit to bring the digital world into every person, every family and every organization and build an intelligent world connected with everything.

Click to follow, the first time to learn about Huawei cloud fresh technology ~