A background.

After we pass the verification of more than ten service test environments, we are ready to conduct pre-release environment drills. One of the services kept getting stuck during startup.

Two. Preliminary analysis

This service is a bit special. It serves as a secondary role in our overall service architecture, including scheduled tasks, workflows, OSS storage, and other services. At first I suspected that the heap was running out of memory and was being GC all the time. After all, the service includes so many capabilities, such as scheduled tasks and workflows, that are resource-intensive. So GC is going on all the way through startup!

Based on this, I went into the container and observed the GC log, and sure enough, IT was always GC! And the Young GC is as long as 14s.

Then I began to wonder: the configuration of this service in the pre-release environment is 2C2G, even if memory is tight, it will not take so long to GC! So I first checked the container configuration, which was 2C2G, and then I checked the startup script, which almost made me wonder when the configuration in the startup script changed to -xmx: 5120m-xms :5120m. Our default configuration is -xmx: 1024m-xms :1024m. Looks like someone copied the configuration of the official environment to the pre-release environment and forgot to change it!

Third, in-depth analysis

I thought the boot was stuck because there was not enough heap memory, but in reality the configured heap was much larger than the physical memory! With the configuration changes started correctly, I began to extrapolate how the JVM would work in this case.

3.1 Analyzing memory

In the configuration, -xx :NewRatio=1 indicates that the young generation and the old generation share half and half. That is, the young generation can allocate 2.5 GB of physical memory but only 2 GB. In other words, 2.5 GB is greater than 2 GB. See here I probably know why, in order to verify my guess, look at the machine memory usage, swap memory unexpectedly full.

3.2 Virtual Memory

Linux uses virtual memory to solve the memory shortage problem. If the physical memory is insufficient, you can exchange data between swap memory (stored on disks) and physical memory to release some physical memory. When needed, the data is swapped back from swap memory. To solve the memory shortage situation. The downside of this is that swapping data between memory and hard disk is extremely CPU intensive! Therefore, services such as ES generally disable this function.

3.3 Is memory allocated by the JVM physical memory?

Virtual memory technology gives each process a certain amount of virtual memory space, and real physical memory is allocated only when the virtual memory is actually used. Using virtual memory technology +swap memory, each process gets the same virtual address size, and can exceed the actual physical memory size. When the memory required by the process is insufficient, the process can exchange data to release some of the physical memory.

To Linux systems, the JVM is a normal worker process, just like any other. So the JVM allocates the same amount of virtual memory at startup as any other process. The original space, the young generation, the old allocation of memory is also virtual memory. Physical memory is mapped only when it is actually used, and swap memory technology is used when memory is insufficient.

3.4 Why does Young GC take so long?

In this scenario, the Young generation is allocated 2.5g of virtual memory, while the actual physical memory is 2g, so swap memory is used and the Young GC process is very slow!

3.5 Solutions

1. Correct the JVM startup configuration. 2. Upgrade the device configuration. Based on the actual situation, we upgraded the machine to 2C4G and changed the JVM configuration to: -xmx: 3G-xMS: 3G. In the original configuration, the service starts slowly and the performance is poor, which affects the pre-delivery environment test. So the machine has been properly upgraded.

3. To summarize

2. I only read the theoretical knowledge about virtual memory in the book “Computer Operating System” before, but it was just because of the support of this theory that I analyzed the cause of the incident. It seems that reading and thinking always pays off. You may need it one day!