1. What are the difficulties in troubleshooting online problems?

  1. We’ve learned a lot about how to troubleshoot online problems, but we’ve never had the opportunity to use it, you know?
  2. An occasional problem online, but not enough time to save the scene, and then very difficult to reproduce?
  3. Is it stable enough to reproduce, or is it too difficult to analyze? No experience, no flexible application of theoretical knowledge, difficult to start?

2. What are the advantages of this article

  1. Directly based on the open source frameworkdubbo-adminAnalysis, no complex business, anyone can look at the source code
  2. (We can simulate the service quantity manually)
  3. dubbo-adminThe source code is simple, and many companies are using, learning can help you solve practical problems

Memory leak problem

The new dubo-Admin is compatible with Dubo2.6 as well as supporting the new dubo2.7 features. With the Dubbo2.7 metadata center, we can do things like service testing, which is already supported in the current version of Dubbo-Admin.

conclusion

Say first conclusion, causing the memory leak of code in the org. Apache. Dubbo. Admin. Service. RegistryServerSync# notify, the core code is this paragraph

if (URL_IDS_MAPPER.containsKey(url.toFullString())) {
    ids.put(URL_IDS_MAPPER.get(url.toFullString()), url);
} else {
    String md5 = CoderUtil.MD5_16bit(url.toFullString());
    ids.put(md5, url);
    URL_IDS_MAPPER.putIfAbsent(url.toFullString(), md5);
}
Copy the code

To put it simply, URL_IDS_MAPPER keeps growing, causing it to take up more and more memory, resulting in non-stop fullGC

Analysis of the

1. When will this method be executed? When the node under /dubbo is changed

2. The original intention of URL_IDS_MAPPER is to maintain the relationship between MD5 and fullUrl, but due to improper control, its capacity keeps growing, so it feels that this URL_IDS_MAPPER is completely unnecessary

3. What does improper control mean? For example, every time the provider or consumer goes online -> goes offline -> goes online, although there is only one instance of the service, multiple MD5’s are generated. If this operation is performed frequently, the capacity of URL_IDS_MAPPER will become larger and larger

4, URL_IDS_MAPPER also has a ‘registryCache’, but why does ‘registryCache’ not leak memory? Because there is a cleanup for ‘registryCache’ in this method

How did you find this problem? If the service volume is small and the service changes infrequently, this problem may not be perceived. However, if the service volume is high and there is a lot of offline, the problem is obvious. You’ll notice that the more memory your application takes up, the more it stays in fullGC

6, how to check?

  1. top
  2. top -p pid -H
  3. jstatck pid |grep 0xxx
  4. The GC
  5. Memory DUmp.
  6. MAT analysis: View the large object and findURL_IDS_MAPPERThere are 1 million elements in
  7. Analysis of the code

Frequent YGC

background

The memory overflow problem was solved, but you encountered a new problem: YGC too frequently. The frequent YGC is mainly reflected in two aspects:

  1. Application startup: 300-500 YGCs occur
  2. Application runtime: irregular frequent YGC. A stable period of no GC and then a sudden period of frequent GC

conclusion

Dubbo-admin is essentially a registry of information query and modification! In dubbo-admin, it uses ZK’s listening mechanism to detect changes in registry information in a timely manner. Instead of using the native API, it uses the ZookeeperRegistry and NotifyListener provided by Dubbo to listen on nodes under /dubbo.

2. There is a service information caching mechanism in Dubbo. The purpose is that in the case of a suspended registry, the cached service instances can still be called, and the new service instances cannot be called. It’s a fault-tolerant mechanism. The cache file should be updated when the service node changes, i.e. abstractregistrynotify method. If the service node changes, the local cache file needs to be updated. Frequent YGC is caused by this.

3. Why does normal Dubbo applications not have this problem and dubbo-admin does? /dubbo/ the node under the interface name Dubo-admin listens on the /dubbo node, which is equivalent to all services

The analysis process

If YGC is too frequent, you may not find the information you want in memory dump

  1. Check gc status, full stable, YGC frequent
  2. Looking at the situation, the old used very little, the new generation is very easy to fill, then trigger YGC
  3. Try to make the Cenozoic generation capacity larger, the YGC frequency is less, but still very frequent, indicating that the problem is not here
  4. Memory is dumped several times, but very little is dumped and recycled
  5. With jVisualVM tools, GC plug-ins, and performance analysis, you can see how much memory is being consumed by a thread
  6. Dubbosaveregistrycache-thread-1 This thread is capable of generating a large number of objects at any time in a short time. See AbstractRegistry for details

1. GC at the beginning (1200M Cenozoic generation), frequent GC

2. The optimized GC (1200M Cenozoic generation) shows slow growth of Cenozoic generation, large capacity of Cenozoic generation and less GC frequency, but each GC takes longer time

3. Optimized GC (512M Cenozoic generation), the Cenozoic generation is smaller, the GC frequency is slightly faster, but the GC time of each time is relatively low

4. In the optimized GC situation (512M for the new generation), I manually simulated the request and found that the GC frequency was accelerated and met the expectation

5. Optimized GC (512M for the new generation). At this point, everyone should have gone to dinner and no service information was changed

Solution: This caching mechanism is all about fault tolerance, dubo-Amind doesn’t need this feature at all, just turn it off