preface

It happened a long time ago, and it’s either forgotten or forgotten.

The cause of

Gateway services launch around 20:00 and start to experience a lot of timeouts by 22:00. This release obviously caused the problem. The release is integrated with the Sentinel service for real-time traffic monitoring, downgrades and fuses.

screening

Run the top command to check the CPU surge. Sentinel massive time window computation plus network transmission;
Run top-hp PID to find the specific thread
Execute jStack PID to print the thread stack
The specific thread is identified and found to be a GC thread. Is Sentinel keeping a lot of data in memory?

Analysis of the

If it is caused by Sentinel, it may not be integrated in a short time, which I do not want to see. After all, this service allows me to see the change of system flow clearly in real time, and I have a model of system operation in my mind. So I kind of took a chance that it wasn’t Sentinel, because I wasn’t the first person to do it, and it was a mature product. So, start to look at the machine’s historical data. That’s the problem at a glance.

These gateway services are experiencing CPU and memory spikes at 22 o ‘clock every day. Such a regular thing reminds me of the time to load 2G of location information every day. Sentinel was just the straw that broke the camel’s back.

To solve

Load large files into memory and use direct memory address mapping to reduce one CPU copy and CPU usage
The timing time is changed to 3:00 am + random time range to avoid reading data when the traffic volume is heavy and trigger cluster peak error
The original configuration of big new generation and small old age is changed to appropriately reduce the new generation, adjust the old age, reduce the age threshold

Explain the idea of JVM parameter tuning

The data in the service comes in three categories: transaction data, Sentinel data and a huge amount of attribution information.

The current Cenozoic generation configuration is too large and the age threshold is too high, resulting in a large number of data of belonging places staying in the Cenozoic generation for a long time, experiencing minorJC again and again, including STW. A large number of data after the expiration of promotion in the old age but can not find the appropriate space.

The system produces about 20 MB of memory data per second, and these data are just like the characteristics of some Chinese medicines, so don’t let them have the opportunity to enter the old age.

So here’s the idea:

Old age: Let the information of the old age quickly enter the old age, because they have to save for 24 hours, enter the old age to wait for fullGC
Small New generation: transaction data will not enter the old age as long as half of the surviving area is larger than 20M data
Adjust the age threshold: reduce the retention time of the information of attribution place in the new generation and advance into the old age

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Note the troubleshooting process and solution for CPU surge

preface

The cause of

screening

Analysis of the

To solve

Explain the idea of JVM parameter tuning

Note the troubleshooting process and solution for CPU surge

preface

The cause of

screening

Analysis of the

To solve

Explain the idea of JVM parameter tuning

Related Posts

What is the operating system | tactical back

Python crawls the bullet screen data of B station’s 10th anniversary special video and draws the generated word cloud. (with source code)

Container is changed | ClickHouse on K8s deployment article recommended collection 】 【

Container is changed | ClickHouse on K8s deployment article recommended collection 】【