Java server netty memory leakage problem

Memory leaks tend to be rare, but they can be very troublesome. Here through two online accidents experience summary, hope to share some experience for the future.

Say conclusion first: do not hand lu netty, encounter beforehand doubt whether hand lu netty

Two production environment accidents

  1. Leakage caused by uploading files

Due to the high performance requirements of the platform, the concurrency of pressure test is more than 1W, which cannot be achieved by ordinary Spring Boot Tomcat container. Therefore, we installed a layer of NetTY in Spring Boot (the new version already supports NetTY) to improve the concurrency of network connection layer.

    Object result = ReflectionUtils.invokeMethod(method, bean, paramObjs);
Copy the code

Netty uploads files to netty. Netty uploads files to Netty. Netty uploads files to Netty

    HttpPostRequestDecoder decoder = new HttpPostRequestDecoder(factory, request);
    try {
        Map<String, String> attrs = Maps.newHashMap();
        while (decoder.hasNext()) {
            InterfaceHttpData data = decoder.next();
            try {
                switch (data.getHttpDataType()) {
                case FileUpload:
                    FileUpload fileUpload = (FileUpload) data;
                    if (fileUpload.isCompleted()) {
                        File file = new File("somedir", fileUpload.getFilename());
                        fileUpload.renameTo(file);
                        files.add(file);
                    }
                    break;
                case Attribute:
                    Attribute attribute = (Attribute) data;
                    attrs.put(attribute.getName(), attribute.getValue());
                    break;
                default:
                    break; }}catch (IOException e) {
                e.printStackTrace();
            } finally{ decoder.removeHttpDataFromClean(data); data.release(); }}}finally {
        decoder.cleanFiles();
        decoder.destroy();
    }
Copy the code

Jmeter stress test results are as follows, easily supporting up to 28,000 concurrent sessions.

After going online to the production environment, it was found that the memory only increased after a period of time, so various memory analysis tools took turns to find that the stack memory was normal, but no result.

Later, some users reported that uploading pictures would fail. Finally, they checked the logs and found that many times there was insufficient memory out of the heap, so they adjusted the memory, but the problem remained. Finally, I suspected the problem of uploading pictures, so I removed the function of uploading files from the application layer and directly changed it to the Nginx Upload module. After observation, the out-of-heap memory problem was solved.

  1. spring-cloud-gatewayMemory leak this is another platform, uniform adoptionspring-cloudThe technical architecture, after a long time of operation after the launch, also has the problem of slow memory growth.

According to the experience doubt is before developers to rewrite the netty related logic, then review the code, found only in the org. Springframework. Cloud. Gateway. Filter. GlobalFilter, did not directly on the bottom layer of netty related operations. Through the ANALYSIS of JVM, it is found that the stack memory is normal, mainly because the out-of-heap memory occupation is too large. Turn on memory leak monitoring:

  -Dio.netty.leakDetection.level=advanced
Copy the code

Error log was found after pressure measurement tracking:

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: #2: io.netty.buffer.AdvancedLeakAwareByteBuf.nioBuffer(AdvancedLeakAwareByteBuf.java:712) org.springframework.cloud.gateway.filter.NettyWriteResponseFilter.wrap(NettyWriteResponseFilter.java:115) org.springframework.cloud.gateway.filter.NettyWriteResponseFilter.lambda$null$1(NettyWriteResponseFilter.java:87) reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100) org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:90 ) reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:192) org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:90 ) reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114) reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:256) reactor.netty.channel.FluxReceive.lambda$request$1(FluxReceive.java:135) io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) java.base/java.lang.Thread.run(Thread.java:834)Copy the code

Search one by one, and find that there is indeed a point to NettyWriteResponseFilter problem in the official issue of Spring-Cloud-Gateway

‘NettyWriteResponseFilter. Wrap never releases the the original pooled buffer in case of DefaultDataBufferFactory.’

Github.com/spring-clou…

Spring Cloud Hoxton.SR8 contains netty bug code.

<dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-gateway-core</artifactId> < version > 2.2.5. RELEASE < / version > < / dependency >Copy the code

In order not to affect other modules, the big version is not changed, only the spring-cloud-gateway-core version is upgraded from 2.2.5.RELEASE to 2.2.6.RELEASE, check the source code, the problem has been fixed. Finally, the out-of-heap memory problem was solved by pressure measurement.

The accident summary

One problem can be found from the above two accidents, that is, the problem can be found in the test phase, as long as the pressure test process is well designed and the test time is long enough, so it is essentially a problem in the management process.

Accident by-product

Troubleshooting for out-of-heap memory problems and stack memory problems is completely different, and the dump snapshot can hardly detect problems.

Traditional heap memory structure:

But out-of-heap memory has almost nothing to do with this graph.

  1. openNativeMemoryTrackingmonitoring
  # enable parameter
  -XX:NativeMemoryTracking=[off | summary | detail]
  # Enable monitoring
  jcmd 1  VM.native_memory baseline
  # View live memory
  jcmd 1  VM.native_memory summary.diff scale=MB
Copy the code
  1. openio.netty.leakDetection.levelmonitoring
  # enable parameter
  -Dio.netty.leakDetection.level=paranoid
Copy the code
  1. JXRay

This third-party tool keeps track of out-of-heap memory

  1. Low-level memory debugging tools

Gperftools, BTrace, Jemalloc, pmap, etc. These tools analyze memory problems at the operating system level, so you need to have some knowledge of the C language.

The above is a summary of personal experience, welcome to exchange correction.