A key service in a recent project caused a number of GC issues due to the specific nature of the business. After not a short time of tracking and trying, the final perfect solution. The process and harvest are recorded below.

Background introduction

The service is to provide the function of ordering goods. The business requirements are as follows:

  1. Goods are divided into countries, each country’s goods are different.
  2. Each item has a primary key fieldgoodsId, and has a one-dimensional eigenmatrix, saved as a one-dimensional float array of length 128.
  3. When sorting, the feature matrix of the query condition is providedA, and a selection of productsgoodsId(5000 Max), and then take the input matrixAMultiply by the feature matrix of all the alternatives to obtain the matching degree score of each item and return.
  4. The commodity set needs to be updated regularly, and the commodity set of each country should be updated separately.

You can see how special this service is: up to 5000 float[128] arrays are searched per request! How to store this data is really a problem.

The solution we use is to create a large map in memory with the structure map

. The outer layer holds the mapping of the country to the commodity set, and the inner Map is the mapping of the goodsId to its characteristic matrix. We calculated the amount of data, and roughly estimated that the memory of a single inner Map is about 350M, and the memory of the entire external large Map is about 2GB.
,>

To ensure understanding, a simple map is shown as follows:

Nation1: goodsId1: characteristic matrix 1 goodsId2: characteristic matrix 2... Nation2: goodsId1: characteristic matrix 1 goodsId2: characteristic matrix 2...Copy the code

The question is, why don’t we use centralized caches like Redis instead of just putting data in memory?

Well, I did a round of pressure tests before writing this article and found that Redis really isn’t that strong. For example, the official OPS 100, 000+ single instance is achievable, but what does that number mean? That means a GET request takes 0.01ms, so a 1000 size MGET takes 10ms! And that’s without network latency. I measured 5000 MGET keys locally (server and client are on the local physical machine and VIRTUAL machine respectively), and the delay is between 40-60ms (in this scenario, the value is not too large, about 1kB, which has not caused significant performance degradation). Here’s an article: Redis performance Fantasy vs. Harsh reality

Another idea is to use Redis plus local caching. But in this scenario, with millions of pieces of data and no hot spots, local caches are not effective.

Back to the point. With this map, the main interface of the service is ready:

  • Input: Accept onenationParameters, a groupgoodsId, and an eigenmatrix of query conditionsAAnd float [128]
  • According to thenationandgoodsIdLook up the product eigenmatrix, and sumAThe matching score of the product is obtained by multiplying.

Effect of the first version: under normal QPS, the average delay is within 10ms.

The background is done. The nightmare is about to begin

GC problems occur

It went online and everything was perfect. However, after running for a period of time, the upstream service began to have timeouts and even circuit outages from time to time, each time for a short period of time. After some investigation, we noticed a spike in the TP99 indicator of the service when the problem occurred, as shown in the figure below:

Response times sometimes soar to nearly a second! If there is nothing abnormal in the log, you can only suspect that the GC is at fault, so find the GC log to find out.

Observing GC Logs

OpenJDK 1.8.0_212; OpenJDK 1.8.0_212; OpenJDK 1.8.0_212; OpenJDK 1.8.0_212

{Heap before GC invocations=393 (full 5): PSYoungGen total 1191936K, used 191168K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 986112K, 0%, informs [x000000076ab00000 0, 0 x000000076ab00000, 0 x00000007a6e00000) from space 205824 k, 92%, informs [x00000007c0000000 x00000007b3700000 0, 0 x00000007bf1b0000, 0) to space 205824 k, 0%, informs [x00000007a6e00000 0, 0 x00000007a6e00000, 0 x00000007b3700000) ParOldGen total 2796544 k, used 2791929K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000) object space 2796544K, 99%, informs [x00000006c0000000 0, 0 x000000076a67e750, 0 x000000076ab00000) Metaspace informs the 70873 k, capacity 73514 k, committed 73600K, reserved 1114112K class space used 8549K, capacity 9083K, committed 9088K, Reserved 1048576K 4542.168: [Full GC (Ergonomics) [PSYoungGen: 191168K->167781K(1191936K)] 2791929K->2796093K(2796544K)] 2983097K->2963875K(3988480K), [Metaspace: 70873 k - > 70638 k (1114112 k)], 2.9853595 secs] [Times: User =11.28 sys=0.00, real=2.99 secs] Heap after GC invocations=393 (full 5): PSYoungGen total 1191936K, used 167781K [0x000000076ab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 986112K, 0%, informs [x000000076ab00000 0, 0 x000000076ab00000, 0 x00000007a6e00000) from space 205824 k, 81%, informs [x00000007c0000000 x00000007b3700000 0, 0 x00000007bdad95e8, 0) to space 205824 k, 0%, informs [x00000007a6e00000 0, 0 x00000007a6e00000, 0 x00000007b3700000) ParOldGen total 2796544 k, used 2796093K [0x00000006c0000000, 0x000000076ab00000, 0x000000076ab00000) object space 2796544K, 99%, informs [x00000006c0000000 0, 0 x000000076aa8f6d8, 0 x000000076ab00000) Metaspace informs the 70638 k, capacity 73140 k, committed 73600K, reserved 1114112K class space used 8514K, capacity 9016K, committed 9088K, reserved 1048576K }Copy the code

Some information that can be gleaned from the logs:

  • JDK 8 if gc is not specified, the Parallel Exploitinsane + Parallel Old is used by default. Not CMS.
  • The reason for this Full GC is that the old age is Full and STW stops for 3 seconds…

Adjust GC policy

Now that there is a GC problem, there must be a wave adjustment. Here are some of the things I’ve tried:

  1. Change garbage collector to CMS
  2. Since there is not enough space in the old era, let’s give it more space! Turn up the whole heap, turn up the old memory.

The following are the experimental results and conclusions:

  1. Switching to a CMS wouldn’t help, it would make it worse. The guess is that CMS is more dependent on the number of CPU cores, and we set a very low limit on the number of cores in the Docker environment, resulting in the insignificant improvement in parallel processing of CMS. And even sometimes due to memory constraints in the old daysConcurrent Mode FailureTo enter the bottom of the linear Full GC pocket, it takes longer.
  2. As heap memory increases, regular and Full GCS occur less often, but single GCS are slower. Can’t solve the problem.

Also attached is a CMS accident scene:

[GC (CMS Initial Mark) [1 CMS-initial-mark: 4793583K, 0.0075637 secs] [Times: The user sys = = 0.01 0.00, Real = 0.00 secs] [CMS - concurrent - mark - start] 03:05:50. 594 INFO task [XNIO - 2-8] c.s. hein. Srchvecsort. Filter. The LogFilter - GET /prometheus? null took 3ms and returned 200 {Heap before GC invocations=240 (full 7): par new generation total 737280K, used 737280K [0x0000000640000000, 0x0000000672000000, 0x0000000672000000) eden space 655360K, 100% used [0x0000000640000000, 0x0000000668000000, 0x0000000668000000) from space 81920K, 100% used [0x0000000668000000, 0x000000066d000000, 0x000000066d000000) to space 81920K, 0% used [0x000000066d000000, 0x000000066d000000, 0x0000000672000000) concurrent mark-sweep generation total 5472256K, used 4793583K [0x0000000672000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 66901K, capacity 69393K, committed 69556K, reserved 1110016K class space used 8346K, capacity 8805K, committed 8884K, Reserved 1048576K [Allocation Failure (GC) [ParNew: 737280K->737280K(737280K), 0.0000229 secs][CMS[CMs-concurrent-mark: [Times: user=1.36 sys=0.05, real=1.05 secs] (Concurrent mode failure): 4793583K->3662044K(5472256K), 3.8206326 secs] 5530863K->3662044K(6209536K), [Metaspace: 66901 k - > 66901 k (1110016 k)], 3.8207144 secs] [Times: Heap after GC invocations= 240 (full 8): Heap after GC invocations= 240 (full 8): Heap after GC invocations= 240 (full 8): Heap after GC invocations= 240 (full 8): par new generation total 737280K, used 0K [0x0000000640000000, 0x0000000672000000, 0x0000000672000000) eden space 655360K, 0% used [0x0000000640000000, 0x0000000640000000, 0x0000000668000000) from space 81920K, 0% used [0x0000000668000000, 0x0000000668000000, 0x000000066d000000) to space 81920K, 0% used [0x000000066d000000, 0x000000066d000000, 0x0000000672000000) concurrent mark-sweep generation total 5472256K, used 3662044K [0x0000000672000000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 66901K, capacity 69393K, committed 69556K, reserved 1110016K class space used 8346K, capacity 8805K, committed 8884K, reserved 1048576K }Copy the code

Incidentally, there are some potholes encountered during the process, mainly the limitations of the Docker/K8S environment, some of which have not been resolved. I’ll talk more about that when I have a chance.

  • Jstat Cannot specify a process because the Java process is 0.
  • OOM Crash Dump seems to be hard to collect.
  • VisualVM is not connected well.

How do I avoid GC problems

Think: What’s the problem?

The reason for this GC problem is obvious: due to the nature of the business, we have several large Map objects in memory that will no doubt age. Yet these objects are not immortal! Every once in a while, as data needs to be updated, new maps are created, and old maps lose references and need to be reclaimed by GC. Since memory grew so much in the old days that you had to do a Major GC and free so much memory at once, it was hard to get that time down very low.

Since the problem is with large maps, the natural solution is to avoid using large maps, or to be more precise, to not put such large data into heap memory.

Data is not placed in heap memory, which is either out of heap (in-process direct memory) or out of process.

Out of process scenarios are obviously databases; What about in-process scenarios? There are in-process databases (such as Berkeley DB) and out-of-heap caches. And the database is too heavy for me, in fact, I just want a map function. So it was decided to take a closer look at off-heap caching.

In addition, the solution of Java cache will not be described here, refer to the caching section of “Learning architecture with Kaitao” :

Java cache types

  • Heap cache: The use of Java heap memory to store objects has the advantage of no serialization/deserialization, is fast, and has the disadvantage of being GC affected. Can use Guava Cache, Ehcache 3.x, MapDB implementation.
  • Off-heap cache: Cached data is stored off-heap, breaking the shackle of the JVM and requiring serialization/deserialization when reading data, which is much slower than in-heap cache. Ehcache 3.x, MapDB can be implemented.
  • Disk cache: The data is still there when the JVM restarts, but the heap/off-heap cache data is lost and needs to be reloaded. Ehcache 3.x, MapDB can be implemented.
  • Distributed caching: nothing more to say, Redis…

Out of the cache

Just post two articles:

  • What you don’t know about out-of-heap caching
  • Introduction to and use of JAVA off-heap memory

A quick summary:

  1. Large amounts of data cached in the heap can cause GC performance problems. Out-of-heap cache is solvable.
  2. This is how off-heap caching worksUnsafeClasses operate directly on process memory, so they need to control their own memory reclamation and serialization/deserialization with Java objects, since only bytes are known outside the heap, not Java objects.
  3. Such useful and hard-to-implement functionality is of course best served by frameworks. Supported frameworks include MAPDB, OHC, EHCache3, etc. Ehcache3 charges, ohC is the fastest.

In summary, it is decided to adopt OHC. The official website is here.

Code design

Ideas:

  1. Consider flexibility, adopt strategic pattern. Refer to the end of the off-heap cache article above that you didn’t know about.
  2. Since it is still used as a Map, let the encapsulated utility class inherit the Map interface. One of the ohcOHCacheObject represents an out-of-heap cache that I encapsulate as a map that holds a country’s data. Naturally, there are multiple applicationsOHCache.
  3. Also, notice thatOHCacheThe class itself is inheritanceCloseableInterface, that is, call itClose()Method to free its resources, that is, reclaim memory. Therefore, encapsulated utility classes also need to be inheritedCloseableAnd call the original map object when updating the country dataClose()Method to free memory. The test is feasible.
  4. Since it is used in Spring Boot and is a caching framework, it is natural to want to adapt it to Spring’s caching architecture. It has not yet been implemented.

The effect

The following figure shows the GC situation caused by updating the map while refreshing data when using an in-heap map before the improvement:

After using the out-of-heap improvement, I made the JVM heap smaller to leave enough out-of-heap memory.

  • The effect is the same… There will still be Major GC issues! I won’t post the picture.
  • The average latency changed from 10ms to 40ms…

further

Let’s talk about the cause of the problem.

Pit 1: Read large files

One other change was made: The data source used to update the map was changed from a database to files on S3, which are hundreds of megabytes in size. Instead, we use CommonsIO’s readLines() method. Well, it loads the entire file contents into the heap, if not GC!

With the switch to the row enumerator, the GC problem finally disappeared. No more Major GC.

Map substitution is taking place in the following figure:

The Major GC count is zero.

Pit two: serialization

Ohc requires you to provide the serialization of keys and values, passing in a ByteBuffer. Being naive, I once again used Apache’s serialization tool to convert objects into in-heap byte arrays the way JDK serialization would, and then copied them into ByteBuffer.

The solution is to directly manipulate ByteBuffer and customize the serialization method. After modification, the delay problem was also solved.

The final result

Add the change in the process’s total memory when replacing the map:

Thanks for watching!

Afterword.

Ohc has worked flawlessly online for several months with no problems. I am using fixed capacity, and the default loadFactor is 0.75. As more and more items were indexed (the total number of items became more and more), my preset capacity was exceeded, and then… Some indexes are overwritten. This is a notable pit, for the record. Ultimately, using components in production is about trying to understand all aspects of them.