Exploration and practice of new generation garbage collector ZGC

Reprinted from: Exploration and Practice of ZGC – Meituan Technical Team (meituan.com)

ZGC (The Z Garbage Collector) is a low-latency Garbage Collector released in JDK 11. Its design goals include:

  • The pause time does not exceed 10ms;
  • The pause time does not increase with the size of the heap, or with the size of the active objects;
  • Supports 8MB to 4TB heaps (16TB in the future).

From a design point of view, we know that ZGC is suitable for memory management and reclamation of large memory and low latency services. This paper mainly introduces the application and excellent performance of ZGC in low-delay scenarios, which is mainly divided into four parts:

  • GC pain: Introduces the GC pain points encountered in the real business, and analyzes the pause time bottlenecks of the CMS collector and G1 collector;
  • ZGC principle: Analyze the essential reason why ZGC pause time is shorter than G1 or CMS, and the technical principle behind it;
  • ZGC tuning practice: focus on sharing understanding of ZGC tuning and analyzing several practical tuning cases;
  • ZGC Upgrade results: Shows the results of applying ZGC in a production environment.

The pain of GC

System availability for many low-latency, high-availability Java services is often plagued by GC pauses. GC pause refers to Stop The World (STW) during garbage collection, when all application threads Stop being active and wait for The GC pause to end. Taking The Risk control service of Meituan as an example, some upstream services require the risk control service to return results within 65ms, and the availability should reach 99.99%. But because of GC pauses, we failed to meet the above availability goals. At that time, THE CMS garbage collector was used. The single Young GC was 40ms, 10 times per minute, and the average interface response time was 30ms. According to the calculation, the response time of the request with (40ms + 30ms) * 10 times / 60000ms = 1.12% will increase by 0 ~ 40ms, and the response time of the request with 30ms * 10 times / 60000ms = 0.5% will increase by 40ms. It can be seen that GC pauses have a great impact on response time. To reduce the impact of GC pauses on system availability, we tuned for reducing the time of a single GC and the frequency of GC, and tested the G1 garbage collector, but all three measures failed to reduce the impact of GC on service availability.

CMS and G1 pause time bottleneck

Before introducing ZGC, let’s review the GC process for CMS and G1 and the pause time bottleneck. The Young GC, G1 and ZGC of the new generation of CMS are all based on mark-copy algorithm, but the different implementation of the algorithm results in huge performance differences.

The mark-copy algorithm is applied in CMS New Generation (ParNew is the default CMS New generation garbage collector) and G1 garbage collector. The mark-copy algorithm can be divided into three stages:

  • Marking phase, that is, starting from GC Roots collection, marking active objects;
  • The migration phase, where the active object is copied to a new memory address;
  • During relocation, all Pointers to the old address of the object are adjusted to the new address of the object.

The following takes G1 as an example to analyze the main bottleneck of G1 pause time through the mark-copy algorithm process in G1 (both Young GC and Mixed GC adopt this algorithm). G1 garbage collection cycle is shown in the figure below:

G1 mixed recycling process can be divided into marking stage, cleaning stage and replication stage.

Marker phase pause analysis

  • Initial labeling stage: The initial labeling stage refers to the process of labeling all direct child nodes from GC Roots, which is STW. Due to the small number of GC Roots, this stage usually takes very short time.
  • Concurrent marking stage: The concurrent marking stage refers to the reachability analysis of objects in the heap starting from GC Roots to find the surviving objects. This phase is concurrent, meaning that the application thread and the GC thread can be active at the same time. Concurrent tagging takes a lot longer, but since it’s not STW, we don’t care much about how long this phase takes.
  • Re-mark phase: re-mark objects that have changed during the concurrent mark phase. This stage is STW.

Analysis of cleanup phase pauses

  • In the cleanup phase, partitions with and without live objects are counted. Garbage objects are not cleaned up and replication of live objects is not performed. This stage is STW.

Replication phase pause analysis

  • The transition phase in the replication algorithm requires the allocation of new memory and the member variables of the replication object. The transition phase is STW, where memory allocation usually takes a very short time, but replication of object member variables can take a long time, because the replication time is proportional to the number of living objects and their complexity. The more complex the object, the longer the replication takes.

In the four STW processes, the initial labeling takes a short time because only GC Roots are marked. Remarking takes less time because there are fewer objects. Due to the small number of memory partitions, the cleaning phase takes less time. The transition phase takes a long time to process all the living objects. Thus, the bottleneck for G1 pause times is mainly STW, the transition phase in mark-copy. Why can’t the transition phase execute concurrently as the tag phase does? The main problem is that G1 fails to accurately locate object addresses during transfer.

The whole mark-copy process STW of G1’s Young GC and CMS’s Young GC will not be elaborated here.

ZGC principle

Fully concurrent ZGC

Like ParNew and G1 in CMS, ZGC uses a mark-copy algorithm, but with a major improvement: ZGC is almost always concurrent at the mark, transfer, and relocation phases, which is the most critical reason for ZGC to achieve its goal of less than 10ms pause times.

The ZGC garbage collection cycle is shown below:

The ZGC has only three STW phases: initial mark, re-mark, and initial pass. Among them, initial labeling and initial transfer only need to scan all GC Roots respectively, and the processing time is proportional to the number of GC Roots, which generally takes very short time. The STW time of re-marking stage is very short, 1ms at most, and the concurrent marking stage will be entered again if the time exceeds 1ms. That is, almost all of the PAUSES in ZGC depend only on the GC Roots set size, and the pause time does not increase with the heap size or the size of the active object. Compared with ZGC, G1’s transition phase is completely STW, and the pause time increases with the increase of the size of the surviving object.

ZGC key technology

ZGC solves the problem of accurate access to objects in the process of transfer by means of coloring pointer and read barrier technology, and realizes concurrent transfer. The general principle is described as follows: “concurrency” in concurrent migration means that the application thread is constantly accessing the object while the GC thread is moving the object. If the object is moved and the object address is not updated in time, the application thread may access the old address, causing an error. In ZGC, an application thread accessing an object triggers a “read barrier” that updates the read pointer to the object’s new address if the object has been moved, so that the application thread always accesses the object’s new address. So how does the JVM know that an object has been moved? It uses the address of the object reference, the coloring pointer. The technical details of the shader pointer and read barrier are described below.

Coloring pointer

Coloring Pointers is a technique for storing information in Pointers.

ZGC supports only 64-bit systems and divides the 64-bit virtual address space into multiple subspaces, as shown in the following figure:

[0 to 4TB) corresponds to the Java heap. [4 to 8TB) is called the M0 address space, [8 to 12TB) is called the M1 address space, [12 to 16TB) is reserved, and [16 to 20TB) is called Remapped space.

When an application creates an object, it first applies for a virtual address in the heap space, but the virtual address is not mapped to a real physical address. The ZGC applies for a virtual address for the object in the M0, M1, and Remapped address Spaces. The three virtual addresses correspond to the same physical address, but only one of the three Spaces is valid at the same time. ZGC sets up three virtual address Spaces because it uses the idea of “space for time” to reduce GC pause times. The space in “Space for Time” is a virtual space, not a real physical space. Subsequent chapters will describe the process of switching between these three Spaces in detail.

Corresponding to the above address space partition, ZGC actually uses only the 041 bit of the 64-bit address space, while the 42nd 45 bit stores metadata, and the 47th to 63rd bits are fixed as 0.

ZGC stores object survival information in 42 to 45 bits, as opposed to traditional garbage collection and placing object survival information in object headers.

Read barrier

Read barriers are a technique by which the JVM inserts a small piece of code into the application code. This code is executed when the application thread reads the object reference from the heap. Note that this code is triggered only by “reading the object reference from the heap”.

Examples of read barriers:

Object o = obj.FieldA   // To read a reference from the heap, add a barrier
<Load barrier>
Object p = o  // There is no need to add barriers because references are not being read from the heap
o.dosomething() // There is no need to add barriers because references are not being read from the heap
int i =  obj.FieldB  // There is no need to add a barrier because it is not an object reference
Copy the code

The code function of the read barrier in ZGC: in the process of object marking and transfer, it is used to determine whether the reference address of the object meets the condition and make the corresponding action.

ZGC concurrent processing demonstration

The following details the switching process of the address view in a ZGC garbage collection cycle:

  • Initialization: After ZGC is initialized, the address view of the entire memory space is set to Remapped. The program runs normally, allocates objects in memory, and garbage collection starts when certain conditions are met. At this time, it enters the marking stage.
  • Concurrent marking phase: The first time you enter the marking phase the view is M0. If the object has been accessed by the GC marking thread or application thread, adjust the address view of the object from Remapped to M0. So, after the marking phase, the object’s address is either M0 view or Remapped. If the object’s address is the M0 view, then the object is active; If the object’s address is a Remapped view, the object is inactive.
  • Concurrent transfer phase: After the tag is finished, the transfer phase is entered, at which point the address view is again set to Remapped. Adjust the address view of the object from M0 to Remapped if the object has been accessed by the GC transfer thread or application thread.

In fact, there are two address views M0 and M1 in the tagging phase, and the above procedure shows only one address view. They are two to distinguish the previous tag from the current tag. That is, after the second concurrent marking phase, the address view is adjusted to M1 instead of M0.

Coloring Pointers and read barriers are used not only in the concurrent transfer phase, but also in the concurrent mark phase: to set an object to be marked, the traditional garbage collector requires a memory access and puts the object’s survival information in the object header; In ZGC, you only need to set bits 42 to 45 of the pointer address, and because it is register access, it is faster than memory access.

ZGC tuning practices

The ZGC is not a “silver bullet” and needs to be tuned for the specific characteristics of the service. The practical experience can be searched on the Internet, and the tuning theory needs to be explored by ourselves. We also spent a lot of time in this stage, and finally reached the ideal performance. One purpose of this article is to list some common problems with ZGC to help you improve service availability with ZGC.

Tuning basics

Understand important ZGC configuration parameters

Take ZGC parameter configuration of our service in production environment as an example to illustrate the functions of each parameter:

Example for setting important parameters:

-Xms10G -Xmx10G 
-XX:ReservedCodeCacheSize=256m -XX:InitialCodeCacheSize=256m 
-XX:+UnlockExperimentalVMOptions -XX:+UseZGC 
-XX:ConcGCThreads=2 -XX:ParallelGCThreads=6 
-XX:ZCollectionInterval=120 -XX:ZAllocationSpikeTolerance=5 
-XX:+UnlockDiagnosticVMOptions -XX:-ZProactive 
-Xlog:safepoint,classhisto*=trace,age*,gc*=info:file=/opt/logs/logs/gc-%t.log:time,tid,tags:filecount=5,filesize=50m 
Copy the code

-xms-xmx: The maximum and minimum memory of the heap, here set to 10G, the program heap memory will remain 10G. -xx :ReservedCodeCacheSize -xx :InitialCodeCacheSize: Sets the size of the CodeCache. Jit-compiled code is stored in the CodeCache. Generally 64 MB or 128 MB is sufficient. Our service has certain particularity, so the setting is larger, which will be described in detail later. + UnlockExperimentalVMOptions – – XX: XX: + UseZGC: enable ZGC configuration. -xx :ConcGCThreads: indicates a thread that collects garbage concurrently. The default value is 12.5% of the total number of cores. The default value is 1 for an 8-core CPU. The GC will be faster, but it will occupy CPU resources when the program is running, and the throughput will be affected. -xx :ParallelGCThreads: The number of threads used in the STW phase. The default is 60% of the total number of cores. -xx :ZCollectionInterval: indicates the minimum interval for ZGC, in seconds. – XX: ZAllocationSpikeTolerance: ZGC trigger the correction coefficient of adaptive algorithm, the default 2, the larger the value, the earlier the trigger ZGC. + UnlockDiagnosticVMOptions – – XX: XX: – ZProactive: whether to enable active recycling, the default open, here is the configuration of the said shut down. -Xlog: Sets the content, format, location, and size of each GC log.

Understand when the ZGC triggers

The GC triggering mechanism in ZGC is quite different from the GC triggering mechanism in CMS and G1. The core feature of ZGC is concurrency, with new objects being created all the time. How to ensure that newly generated objects do not fill the heap before GC completes is the first major goal of ZGC parameter tuning. Because in ZGC, when garbage is not collected enough to fill the heap, running threads can pause for as long as seconds.

ZGC has multiple GC triggering mechanisms, summarized as follows:

  • Blocking memory allocation request trigger: When garbage is too late to collect, garbage will fill the heap, causing some threads to block. This kind of trigger should be avoided. The keyword in the log is Allocation Stall.
  • Adaptive algorithm based on allocation rate: The most important GC triggering method. Its algorithm principle can be simply described as “ZGC will trigger the next GC according to the recent object allocation rate and GC time and calculate the threshold when the memory usage reaches”. The detailed theory of adaptive algorithm can be referred to the book “Design and Implementation of a New Generation of Garbage Collector ZGC” by Peng Chenghan. Through ZAllocationSpikeTolerance parameters control the size of threshold value, the parameter default 2, the larger the value, the earlier the trigger GC. We solved some problems by adjusting this parameter. The keyword in the log is Allocation Rate.
  • Based on fixed time interval: Controlled by ZCollectionInterval, it is suitable for dealing with sudden increase of traffic. When the traffic changes smoothly, the adaptive algorithm may not trigger GC until the heap utilization exceeds 95%. When traffic surges, the adaptive algorithm may trigger too late, causing some threads to block. You can adjust this parameter to solve the problem of traffic surge scenarios, such as timed activity and second kill scenarios. The keyword in the log is Timer.
  • Active trigger rule: Similar to a fixed interval rule, but the interval is not fixed and is calculated by ZGC itself. Since our service has already added a trigger mechanism based on a fixed interval, this function is turned off with the -Zproactive parameter to avoid frequent GC and service availability. The keyword in the log is “Proactive”.
  • Warm-up rule: When the service is started, it does not need attention. The keyword in the log is Warmup.
  • External trigger: The system.gc () trigger is invoked explicitly in the code. The keyword in the log is system.gc ().
  • Metadata allocation trigger: This trigger occurs when the metadata area is insufficient. The keyword in logs is Metadata GC Threshold.

Understand ZGC logs

A complete GC process, the points needing attention have been marked in the diagram.

Note: This log filters the information entering the security point. Normally, a GC is interspersed with access to safety points.

Each line in the GC log contains information about the GC process. The key information is as follows:

  • Start: Starts the GC and indicates the cause of the GC trigger. In the figure above, the trigger is an adaptive algorithm.
  • Phase-pause Mark Start: indicates the initial flag, which is STW.
  • Phase-pause Mark End: indicates that the phase-pause Mark ends.
  • Phase-pause Relocate Start: indicates the initial Relocate, which will STW.
  • Heap information: Records Heap size changes before and after marks and Relocate during GC. High and Low record the maximum and minimum values, and we generally pay attention to the Used value in High. If it reaches 100%, there must be insufficient memory allocation in the GC process, and it is necessary to adjust the trigger time of GC to carry out GC earlier or faster.
  • GC information statistics: Can periodically print garbage collection information and observe all statistics within 10 seconds, 10 minutes, and 10 hours from the start to the present. Using these statistics, you can locate anomalies.

Log content is more, key points have been marked with red line, meaning better understand, more detailed explanation we can consult information on the Internet.

Understand the reason for the ZGC pause

In practice, we found six scenarios that caused the program to stop, as follows:

  • During GC, the initial flag is: Pause Mark Start in logs.
  • During GC, Pause Mark End is marked in the log.
  • During GC, initial move: Pause Relocate Start in log.
  • Memory Allocation blocking: When there is insufficient memory, the thread blocks waiting for the GC to complete. The keyword is “Allocation Stall”.

  • Safe point: GC can be performed only after all threads enter the safe point. ZGC periodically enters the safe point to determine whether GC is needed. The thread that enters the safe point first has to wait until the thread that enters the safe point is suspended.
  • Dump thread, memory: such as jstack, jmap commands.

Tuning case

The service we maintain is called Zeus, which is Meituan’s rules platform and is often used for rule management in risk control scenarios. Rule execution is based on the open source expression execution engine Aviator. Internally, Aviator converts each expression into a Java class that implements the expression logic by calling its interface.

Zeus has more than 10,000 rules in its service, with several million requests per machine per day. These objective conditions cause Aviator to generate classes and methods that generate a lot of classLoaders and CodeCache, which can be a performance bottleneck for GC when using ZGC. There are two types of tuning cases.

Memory allocation is blocked, and system pauses can reach the level of seconds

Case 1: Performance burr occurs due to sudden increase in traffic during seckilling activities

Log information: Comparing the GC logs and service logs at the time when the performance burr occurs, it is found that the JVM has paused for a long time, and there are a large number of “Allocation Stall” logs in the GC logs.

Analysis: This kind of case occurs in scenarios where “adaptive algorithms” are the main GC triggering mechanism. ZGC is a concurrent garbage collector, where the GC thread and the application thread are active at the same time, and new objects are generated during the GC process. Before the GC is complete, the heap is full of newly generated objects, and the application thread may block because it fails to allocate memory. When the seckill activity starts, a large number of requests enter the system, but the GC triggering interval calculated by the adaptive algorithm is long, which leads to the delay of GC triggering, resulting in memory allocation blocking and pause.

Solutions:

(1) Enable the GC triggering mechanism based on fixed time interval: -xx :ZCollectionInterval. Make it five seconds or less. (2) increasing the modification coefficient – XX: ZAllocationSpikeTolerance, earlier trigger GC. ZGC memory allocation rate by using the normal distribution model prediction model correction coefficient ZAllocationSpikeTolerance the default value is 2, the larger the value, the sooner the trigger GC, all the cluster setup in the Zeus is 5.

Case 2: Performance burr occurs when the flow rate gradually increases to a certain extent during pressure measurement

Log information: On average, GC occurs once a second, with almost no interval between GC and GC.

Analysis: GC triggers in a timely manner, but memory marking and reclamation speed is too slow, resulting in memory allocation blocking, resulting in pauses.

Solution: Increase -xx :ConcGCThreads to speed up concurrent marking and recycling. ConcGCThreads defaults to 1/8 of the number of cores for 8-core machines. The default value is 1. This parameter affects the system throughput. If the GC interval is longer than the GC cycle, you are not advised to change this parameter.

The number of GC Roots is large and the single GC pause time is long

Case 3: The single GC pause time is 30ms, which is quite different from the expected pause time of about 10ms

Log message: It takes a long time for Pause Roots ClassLoaderDataGraph to observe ZGC log statistics.

Analysis: Dump memory files and find tens of thousands of ClassLoader instances in the system. We know that ClassLoader belongs to GC Roots, and the pause time of ZGC is proportional to GC Roots. The larger the number of GC Roots, the longer the pause time. Further analysis shows that the class name of the ClassLoader indicates that these ClassLoaders are generated by the Aviator component. Analyzing Aviator source code, we found that when Aviator generates a new class for each expression, it creates a ClassLoader, which leads to a huge number of ClassLoaders. In higher Aviator versions, this problem has been fixed by creating only one ClassLoader to generate classes for all expressions.

Solution: Upgrade the Aviator component version to avoid generating redundant ClassLoaders.

Case 4: After the service is started, the longer the running time, the longer the single GC time

Log message: Observe ZGC log statistics. The Pause Roots CodeCache duration increases with the service running time.

Analysis: CodeCache space is used to store JIT compilation results of Java hotspot code, and CodeCache is also part of GC Roots. By adding – XX: + PrintCodeCacheOnCompilation parameters, print the CodeCache by optimization method, found a lot of code Aviator expression. To locate the root cause, each expression is a method in a class. As the run time increases and the number of executions increases, these methods will be jit-optimized and compiled into the CodeCache, resulting in larger and larger CodeCache.

Solution: JIT has some parameter configurations that adjust JIT compilation conditions, but none of them are suitable for our problem. We finally solved the problem by business optimization, removing the Aviator expression that didn’t need to be executed, and thus preventing a large number of Aviator methods from entering the CodeCache.

It is worth noting that we did not fully deploy all clusters until all of these issues were resolved. Even with the various initial burrs, the calculation showed that a ZGC with various problems had less impact on service availability than the previous CMS. So it took about two weeks from the time you were ready to use the ZGC to full deployment. Over the next three months, we followed up on these issues while working on the business requirements, and finally solved them one by one, resulting in better ZGC performance on each cluster.

Upgrade the ZGC effect

Reduced latency

Top Percentile (TP) is a measure of system latency. TP999 indicates the minimum time that 99.9% of requests can be responded. TP99 indicates the minimum time that 99% of the requests can be answered.

In different clusters of Zeus services, ZGC benefits more in low latency (TP999 < 200ms) scenarios:

  • TP999: decreased by 12~142ms, 18%~74%.
  • TP99: decreased by 5~28ms, decreased by 10%~47%.

Ultra-low latency (TP999 < 20ms) and high latency (TP999 > 200ms) services do not benefit because the response time bottleneck for these services is not GC but externally dependent performance.

Throughput fell

For throughput-first scenarios, ZGC may not be appropriate. For example, one offline cluster at Zeus, which used CMS, experienced a significant decrease in system throughput after upgrading the ZGC. There are two reasons: first, ZGC is a single-generation garbage collector, while CMS is a generational garbage collector. Single-generation garbage collector processes more objects at a time, which consumes more CPU resources. Second, ZGC uses read barriers, which require additional computing resources.

conclusion

As the next generation garbage collector, ZGC has excellent performance. The ZGC garbage collection process is almost entirely concurrent, and the actual STW pause time is extremely short, less than 10ms. This is thanks to the use of shaded Pointers and read barriers.

Zeus achieved its upgrade to the JDK 11+ZGC by sorting risks and problems and then attacking them individually, and GC pauses barely affected system availability.

Finally, WE recommend you to upgrade ZGC. Zeus system encountered many problems due to its business characteristics, while other risk control teams were very smooth in the upgrade. Welcome to the “ZGC Usage Communication” group.

reference

  • ZGC website
  • Peng Chenghan. Design and Implementation of New Generation Garbage Collector ZGC. China Machine Press, 2019.
  • Let’s talk about GC optimization for Java applications from a practical example
  • Some key technologies for the Java Hotspot G1 GC

The appendix

How to use the new technology

When upgrading JDK 11 to ZGC in a production environment, the biggest concern is probably not how effective it will be, but how unreliable and stable the new version will be due to the lack of use and online practice. The second is whether the cost of the upgrade will be large, and if it does not succeed, it will not be a waste of time. Therefore, the first thing to do before using a new technology is to evaluate the benefits, costs and risks.

assess

For programs of the world’s interest, such as the JDK, the new technology introduced by a major release upgrade is usually already theoretically proven. What we need to do is determine if the current system bottleneck is a problem that the new JDK version can solve, rather than taking action without diagnosing the problem. After evaluating the benefits, we should evaluate the costs and risks. If the benefits are too large or too small, the weight of the other two items will be much smaller.

Using the example mentioned at the beginning of this article, assume that the number of GCS remains the same (10 GCS/min) and that the single GC time decreases by 10ms from 40ms. By calculation, GC is performed 100/60000 = 0.17% of the time in one minute, and all requests are paused for only 10ms, which reduces the number of requests affected during GC and the increased latency due to GC.

Evaluate cost

This mainly refers to the labor cost of upgrading. This is relatively mature, according to the new technology manual to determine the changes. It’s not that different from other projects, so I won’t go into details.

In our practice, it takes two weeks to complete online deployment and achieve safe and stable operation. The subsequent iterations lasted for 3 months, and the ZGC was optimized and adapted according to the business scenarios.

To assess risk

The risks of upgrading the JDK fall into three categories:

  • Compatibility risk: Java program JAR package depends a lot, upgrade the JDK version can run the program. For example, our service is upgraded from JDK 7 to JDK 11, and we need to solve many JAR package incompatibilities.
  • Functional risk: Whether there will be any component logic changes that affect the logic of existing functionality after it is run.
  • Performance risk: If the function has no problem, whether the performance is stable and can run stably online.

After classification, the response to each type of risk is transformed into a common test problem and is no longer an unknown risk. Risk refers to uncertain things, if uncertain things can be transformed into certain things, it means that the risk has been eliminated.

To upgrade the JDK 11

JDK 11 was chosen because ZGC was first supported in JDK 11, and JDK 11 is a Long Term Support (LTS) release that will be maintained for at least three years, while regular releases (such as JDK 12, JDK 13, and JDK 14) only have a six-month maintenance cycle. Not recommended.

Local test environment installation

Download JDK 11 from the two OpenJDK and OracleJDK sources. The main difference between the two versions is that they are free and paid for long periods of time, and free for short periods of time. Note that the ZGC in JDK 11 does not support Mac OS, so you can only use JDK 11 on Mac OS with other garbage collectors, such as G1.

Installation in production environment

Upgrading the JDK 11 is not only about upgrading the JDK version of your own project, but also about compiling, releasing, deploying, running, monitoring, performance memory analysis tools, and other project support. Internal practice of Meituan:

Compilation and packaging: The Meituan distribution system supports JDK 11 for compilation and packaging. Online running & Full deployment: JDK11 installed on the online machine is required in three ways:

1. Apply for a VM that has JDK 11 installed by default: Use this method to try JDK 11. In full deployment, if too many new machines are applied for, there may be insufficient machine resources. 2. Installing JDK 11 on existing VMS using handwritten scripts: It is not recommended because business students are too involved in O&M. 3. Use the image deployment function provided by the container to install JDK 11 when packaging the image. This method is recommended without applying for new resources.

Monitoring indicators: Mainly the time and frequency of GC. We support the collection of ZGC data through MEituan’s CAT monitoring system (CAT is open source). Performance memory analysis: If you encounter performance problems online, you may also need to use Profiling tools. Scalpel, Meituan-based performance diagnostics and optimization platform, supports JDK 11 performance memory analysis. If your company does not have such tools, JProfier is recommended.

Resolving Component Compatibility

Our project consisted of over 200,000 lines of code that needed to be upgraded from JDK 7 to JDK 11, with numerous dependencies. While it looked like the upgrade would be complicated, the compatibility issue was resolved in just two days. The specific process is as follows:

1. The build configuration in the POM file needs to be modified according to errors. There are two main types of modifications:

A. Some classes are removed: for example, “sun.misc.BASE64Encoder”. Replace java.util.base64.

B. The dependent version of the component is not compatible with JDK 11. Problem: Find the dependent component and search for the latest version, which generally supports JDK 11.

2. After the compilation is successful, the system starts to run. In this case, the component may still have version-dependent problems.

Dependencies modified by the upgrade:

<dependency>
    <groupId>javax.annotation</groupId>
    <artifactId>javax.annotation-api</artifactId>
    <version>1.3.2</version>
</dependency>
<dependency>
    <groupId>javax.validation</groupId>
    <artifactId>validation-api</artifactId>
    <version>2.0.1. The Final</version>
</dependency>
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <version>1.18.4</version>
</dependency>
<dependency>
    <groupId>org.hibernate.validator</groupId>
    <artifactId>hibernate-validator-parent</artifactId>
    <version>6.0.16. The Final</version>
</dependency>
<dependency>
    <groupId>com.sankuai.inf</groupId>
    <artifactId>patriot-sdk</artifactId>
    <version>1.2.1</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.9</version>
</dependency>
<dependency>
    <groupId>commons-lang</groupId>
    <artifactId>commons-lang</artifactId>
    <version>2.6</version>
</dependency>
<dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-all</artifactId>
    <version>4.1.39. The Final</version>
</dependency>
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>
Copy the code

JDK 11 has been out for two years, and common dependencies are available in compatible versions. However, company-level components provided internally may be incompatible with JDK 11 and need to be upgraded. If the upgrade is difficult, consider splitting functionality, deploying functionality that relies on these components separately and continuing to use older JDK versions. As the performance of JDK11 becomes known, more teams will use JDK11 to solve GC problems, and the more users there are, the more incentive there will be to upgrade each component.

Verifying function Correctness

Through complete single test, integration and regression test, ensure the function correctness.

Author’s brief introduction

  • Wang Dong, Meituan information security senior engineer.
  • Wang Wei, meituan information security technology expert.

Security, basic R&D platform, ZGC, Operation and Maintenance, G1, CMS, GC

# Look at other

How does Kubernetes change Meituan’s cloud infrastructure?

# Let’s talk

If you find any mistakes in your article or have any questions about the content, you can follow the wechat official account of Meituantech and leave a message to us in the background.

Every week, we will select a warm-hearted friend and send a nice small gift. Scan the code to pay attention to us!

One line of code, billions of lives.

  • homepage
  • The article archive
  • About us