preface

It happened a long time ago, and it’s either forgotten or forgotten.

The cause of

Gateway services launch around 20:00 and start to experience a lot of timeouts by 22:00. This release obviously caused the problem. The release is integrated with the Sentinel service for real-time traffic monitoring, downgrades and fuses.

screening

  • Run the top command to check the CPU surge. Sentinel massive time window computation plus network transmission;
  • Run top-hp PID to find the specific thread
  • Execute jStack PID to print the thread stack
  • The specific thread is identified and found to be a GC thread. Is Sentinel keeping a lot of data in memory?

Analysis of the

If it is caused by Sentinel, it may not be integrated in a short time, which I do not want to see. After all, this service allows me to see the change of system flow clearly in real time, and I have a model of system operation in my mind. So I kind of took a chance that it wasn’t Sentinel, because I wasn’t the first person to do it, and it was a mature product. So, start to look at the machine’s historical data. That’s the problem at a glance.

These gateway services are experiencing CPU and memory spikes at 22 o ‘clock every day. Such a regular thing reminds me of the time to load 2G of location information every day. Sentinel was just the straw that broke the camel’s back.

To solve

  • Load large files into memory and use direct memory address mapping to reduce one CPU copy and CPU usage
  • The timing time is changed to 3:00 am + random time range to avoid reading data when the traffic volume is heavy and trigger cluster peak error
  • The original configuration of big new generation and small old age is changed to appropriately reduce the new generation, adjust the old age, reduce the age threshold

Explain the idea of JVM parameter tuning

The data in the service comes in three categories: transaction data, Sentinel data and a huge amount of attribution information.

The current Cenozoic generation configuration is too large and the age threshold is too high, resulting in a large number of data of belonging places staying in the Cenozoic generation for a long time, experiencing minorJC again and again, including STW. A large number of data after the expiration of promotion in the old age but can not find the appropriate space.

The system produces about 20 MB of memory data per second, and these data are just like the characteristics of some Chinese medicines, so don’t let them have the opportunity to enter the old age.

So here’s the idea:

  • Old age: Let the information of the old age quickly enter the old age, because they have to save for 24 hours, enter the old age to wait for fullGC
  • Small New generation: transaction data will not enter the old age as long as half of the surviving area is larger than 20M data
  • Adjust the age threshold: reduce the retention time of the information of attribution place in the new generation and advance into the old age