background

I just arrived at the company in the morning and was about to start the day’s fishing trip when I received an email from the monitoring center.

In the heart of the secret is not good, because the monitoring system will never tell me that the application is perfect without bugs, in fact, the system is rather creepy.

When I opened the email, I was told that an application’s thread pool queue had reached its threshold and triggered an alarm.

Because this application has a problem, the user experience is very affected; We immediately asked o&M to keep the live dump thread and memory and restart the application. Fortunately, the application recovered after the restart. So we started to troubleshoot the problem.

Analysis of the

First, let’s take a look at what this app does.

This simply means pulling data out of MQ and dropping it into a later business thread pool for specific business processing.

And the alarm queue happens to be the queue of this thread pool.

Trace code finds the following way to build a thread pool:

ThreadPoolExecutor executor = new ThreadPoolExecutor(coreSize, maxSize,
              0L, TimeUnit.MILLISECONDS,
              new LinkedBlockingQueue<Runnable>());;
             put(poolName,executor);
Copy the code

The default LinkedBlockingQueue is used without specifying the size (which is also a pit), so the default size of the queue is integer.max_value.

Because the application has been restarted, only the thread snapshot and memory snapshot can be used for analysis.

Memory analysis

The first use of MAT to analyze the memory, to the following report.

There are two large objects, one is the LinkedBlockingQueue where the thread pool previously held the task, and the other is the HashSet.

Of course, queues take up a lot of memory, so I’ll look at the HashSet in a second.

Due to the large size of the queue, the task processing in the thread pool is slow, resulting in the pile of tasks in the queue, at least this is the conclusion that can be drawn at present.

Thread analysis

Take a look at thread analysis, using fastThread. IO for thread analysis.

Since tasks in the thread pool appear to be running late, we focus on what they are doing.

They are both RUNNABLE, and the stack looks like this:

The stack is looking for the presence of a key. This is true by looking at line 312 of the business code.

The thread name here is also a pit, let me look for a long time.

positioning

After analyzing the memory and thread stack, you can already guess what the problem is.

There’s actually a premise I forgot to mention:

The alarm was sent by email at 3am, but there was no phone call or anything, so no one knew.

I discovered the evidence when I went to work in the morning and immediately dumped it.

So there is one important fact: these business threads have been running for 6 or 7 hours to query the HashSet without returning.

It can also be seen from the previous monitoring graph:

The operating system was under high load until we saw the alarm reboot this morning.

At the same time, IT is found that JDK1.7 is running on the application production, so I initially think it should be the ring linked list of HashMap that enters the high CPU load and also enters the dead loop when the key is queried.

The code was reviewed again to verify this problem.

The pseudo-code after sorting is as follows:

/ / thread pool
private ExecutorService executor;

private Set<String> set = new hashSet();

private void execute(a){
	
	while(true) {// Get data from MQ
		String key = subMQ();
		executor.excute(newWorker(key)) ; }}public class Worker extends Thread{
	private String key ;

	public Worker(String key){
		this.key = key;
	}

	@Override
	private void run(a){
		if(! set.contains(key)){// database query
			if(queryDB(key)){
				set.add(key);
				return; }}// Clear set when a condition is reached
		if(flag){
			set = null; }}}Copy the code

The general process is as follows:

  • An endless stream of data from MQ.
  • Throw data into the business thread pool.
  • Determine whether data has been writtenSet.
  • If no, query the database.
  • And then I’ll write toSetIn the.

The obvious problem here is that the Set, which is a shared resource, does not do any synchronization.

There will be concurrent operations by multiple threads, and since a HashSet is essentially a HashMap, it must be thread unsafe, so two problems arise:

  • The data in the Set is overwritten during concurrent writing, resulting in inaccurate data.
  • It forms a circular linked list during expansion.

The first question is more acceptable than the second.

We already know from the memory analysis above that there is a lot of data in this set. Also, because the size is not specified at initialization, it is only the default value, which can lead to frequent expansion in the event of large concurrent writes, and can lead to circular linked lists under 1.7 conditions.

Unfortunately, this code also contains a query (contains()), which looks at the stack above:

The discovery is running on line 465 of the HashMap. Let’s see what’s going on there in 1.7:

It’s already obvious. So we’re going through the list, and because of the circular list, this e.next is never empty, so the loop never exits.

So here’s the problem, but there’s also the question of why the task queue is piling up in the thread pool. My first instinct was that the task was too slow.

A closer look at the code revealed that there was only one place where it might be slow: a query with a database.

Take this SQL to the production environment to execute the discovery is really unpleasant, check the index to find all hits.

But WHEN I looked at the data in the table, I found that there were nearly 7,000 tons of data. At the same time, after operation and maintenance, it is known that the IO pressure of MySQL server is also relatively large.

So the reason is obvious:

Because every consumption of a piece of data has to query the database, MySQL itself is under a lot of pressure, and the amount of data is also very high, so the IO response is slow, resulting in the whole task processing is relatively slow.

But there is another reason that should not be overlooked; Since all the business threads are in an infinite loop at some point, there is no chance to complete the task, and the data is still coming in, so the queue just keeps piling up!

This is actually an old application, so some people might ask why it didn’t happen before.

This is because the amount of data before is relatively small, even concurrent writing does not appear concurrent expansion to form a circular linked list. The surge in business during this period has brought out the hidden thunder. So we have to take Murphy at his word.

conclusion

Now the whole investigation is over, and our follow-up adjustment measures are as follows:

  • HashSetIt’s not thread-safeConcurrentHashMapAt the same timevalueWriting dead can be achievedsetThe effect.
  • Initialize according to our monitoring laterConcurrentHashMapAs large as possible, avoid frequent expansion.
  • MySQLA lot of the data is no longer used for hot and cold processing. Minimize the amount of data in a single table. At the same time, later consider sub-table.
  • Check data there adjusted to check cache, improve query efficiency.
  • The name of the thread pool must make sense, otherwise it will make it harder for you.
  • Adjust the queue size of the thread pool to a specific value based on the monitoring and have a rejection policy.
  • Upgrade to theJDK1.8.
  • Another alarm mail is considered as a telephone notification 😂.

The HashMap bug has been popping up all over the web, and I’ve actually encountered it. These days, it’s rare to meet this requirement. JDK below 1.8, for example, may not be available to most people, just another example of Murphy’s law.

At the same time, I will put the article here, convenient for you to read and query.

crossoverjie.top/JCSprout/

Your likes and shares are the biggest support for me