The phenomenon of
There is a data migration job, but it does not respond for a long time during the local test.
By analyzing thread status through JStack, a large number of service threads in WAITING state are found.
After looking at the corresponding code, the problem was quickly located. The following is the main analysis of the main causes of the problem.
In addition, if this case occurred in the production environment with high ToC interface, it is likely to cause a large number of interface response timeout, the consequences are terrible (if not really expelled).
Thread Pool Review
Function of thread pool: manage thread resource and task execution process in a unified way to achieve the purpose of thread resource reuse.
Thread pools process tasks:
- Determines whether the current number of threads reaches the threshold for core threads. If not, a thread is created to execute the current task
- If the current thread count reaches the core thread threshold, an attempt is made to put the current task into a task queue for idle threads to process
- If the task queue is also full, an attempt is made to create a non-core thread to execute the current task
- If the current number of threads reaches the threshold of the maximum number of threads, the rejection policy is triggered
Problem analysis
Core cause: The main task and sub-task are executed in the same thread pool, resulting in the thread pool deadlock. This problem is caused by improper use of the thread pool.
This seems to be a low-level problem, but it is very easy to make if you are not careful. In the process of code review of the department, we found that someone used it like this.
The business scenario
The data migration of merchant stores involves a large amount of data, so the task is divided as follows:
- The whole store is divided into batches, each batch contains 100 stores, and the processing of a batch is defined as a main task
- In a batch to a single store as a unit, divided into a sub-task
The deadlock problem
There are four necessary conditions for a deadlock to occur:
- Mutually exclusive condition: a resource can only be used by one process at a time
- Hold and wait: when a process is blocked by requesting resources, it holds on to acquired resources
- Uncommandeable: A resource acquired by a process cannot be forcibly taken away until it has been used up
- Circular waiting condition: a circular waiting resource relationship is formed between several processes
The following mainly analyzes the cause of thread deadlock. The premise of the problem is that the main task fills up the core thread at one time, so that the sub-task has no available thread and can only enter the task queue for processing.
- A thread resource can only execute (be used by) one task at a time, which is mutually exclusive
- When a thread is blocked due to executing a task, the current thread resource cannot be released
- The thread cannot be preempted by other tasks before the current task is completed, which meets the condition of non-forcible possession
- Subtasks cannot be executed because there are no thread resources. However, the main task is blocked because the sub-task is not completed, and the thread resources held cannot be released to meet the circular waiting condition
The sample code
The core code is as follows, and the outer task
The code that caused the problem in task A
The solution
- Instead of blocking queues, thread pools use synchronous queues (which may cause tasks to be executed sequentially rather than concurrently)
- Parent and child tasks use different thread pools (using this solution)
- Controls the number of concurrent parent tasks below the number of core threads
Through this event analysis, we should also think more about the meaning of each line of code in the process of writing code, rather than simply CV, responsible for each line of code.