As usual let’s take a look at the accident first

【 Accident Description 】

From 6:32 a.m., a small number of users would visit the APP, and the home page service would be unavailable at 7:20 p.m., and the problem would be solved at 7:36 p.m.

[Overall experience]

  • 6:58 found the alarm, and found that the feedback home page of the group appeared busy network, considering that the store list service was released online a few nights ago, so consider rolling back the code to deal with the problem in an emergency.
  • From 7:07, contact XXX successively to check and solve the problem.
  • 7:36 The code is rolled back and the service is restored.

[Root cause of accident – Accident Code Simulation]

public static void test() throws InterruptedException, ExecutionException { Executor executor = Executors.newFixedThreadPool(3); CompletionService<String> service = new ExecutorCompletionService<>(executor); service.submit(new Callable<String>() { @Override public String call() throws Exception { return "HelloWorld--" + Thread.currentThread().getName(); }}); }Copy the code

Throw out the question, and we’ll talk about it later. The root of the problem lies in the ExecutorCompletionService results didn’t call take, polling method. The correct way to write it is as follows:

public static void test() throws InterruptedException, ExecutionException { Executor executor = Executors.newFixedThreadPool(3); CompletionService<String> service = new ExecutorCompletionService<>(executor); service.submit(new Callable<String>() { @Override public String call() throws Exception { return "HelloWorld--" + Thread.currentThread().getName(); }}); service.take().get(); }Copy the code

A line of code caused by the murder, and not easy to be found, because OOM is a process of slow memory growth, a little careless will be ignored, if this code block is less, it is likely to be a few days or even months after the explosion.

The operator rollback or restart the server is indeed the fastest way, but if it is not a quick analysis of oom code, and unfortunately the version of the rollback is also with OOM code, it is sad, as just said, the traffic is small, rollback or restart can release memory; However, in the case of heavy traffic, only the rollback to the normal version is displayed, GG is displayed.

Now let’s get to the root of the problem

In order to better understand the “routine” we use the ExecutorService ExecutorCompletionService as a contrast, can make us know better, what use ExecutorCompletionService scenarios.

Don’t read the following code while eating, it tastes a little heavy! But it’s easy to understand.

public static void test1() throws Exception{ ExecutorService executorService = Executors.newCachedThreadPool(); ArrayList<Future<String>> futureArrayList = new ArrayList<>(); System.out.println(" The company wants you to tell everyone you're going to dinner "); Future<String> future10 = executorService.submit(() -> {system.out.println (" executorService.executorService.submit "); TimeUnit.SECONDS.sleep(10); System.out.println(" PRESIDENT: I took a dump in an hour. You get it "); Return "The CEO took a dump "; }); futureArrayList.add(future10); Future<String> future3 = executorService.submit(() -> {system.out.println (" executorService.executorService.submit () -> {system.out.println (" "); TimeUnit.SECONDS.sleep(3); System.out.println(" R & D: I took a dump in 3 minutes. You get it "); Return "R&d has gone big "; }); futureArrayList.add(future3); Future<String> future6 = executorService.submit(() -> {system.out.println (" middle management: I am at home, I can do 10 minutes, you can come to fetch me later "); TimeUnit.SECONDS.sleep(6); Println (" Middle Management: I took a dump in 10 minutes. You get it "); Return "Middle management has taken a dump." }); futureArrayList.add(future6); TimeUnit.SECONDS.sleep(1); System.out.println(" All notifications are finished, wait to receive." ); try { for (Future<String> future : futureArrayList) { String returnStr = future.get(); System.out.println(returnStr + ", you pick him up "); } Thread.currentThread().join(); } catch (Exception e) { e.printStackTrace(); }}Copy the code

Three tasks are executed in 10s, 3s, and 6s respectively. Submit the three Callable tasks via SUBMIT in the JDK thread pool.

  • Step1 the main thread submits the three tasks to the thread pool, stores the corresponding Future in the List, and executes “all notifications are finished, waiting to be answered.” This line of output statement.
  • Step2 execute future.get() inside the loop and block the wait.

The final result is as follows:

After receiving the president, I went to pick up r&d and middle management, although they had already finished their work and had to wait for the president to finish

The -10s asynchronous task that takes the longest is executed in the list first. Therefore, when obtaining the result of the 10s asynchronous task during the loop, the GET operation will be blocked until the asynchronous task is executed. Even if the 3s and 5s tasks have finished long ago, they must wait for the 10s tasks to finish.

This may resonate with those who do gateway business. Generally speaking, gateway RPC will call more than N downstream interfaces, as shown in the following figure

If you follow the ExecutorService model, and the first few tasks call interfaces that take a long time and block waiting, that’s a bad idea. So the ExecutorCompletionService occasion. As a reasonable controller of the task thread, it deserves the title of “task planner”.

The same scenario ExecutorCompletionService code

public static void test2() throws Exception { ExecutorService executorService = Executors.newCachedThreadPool(); ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executorService); System.out.println(" The company wants you to tell everyone you're going to dinner "); Completionservice.submit (() -> {system.out.println (" president: I am at home, I have a slow stomach, I have to sit for 1 hours, you can pick me up later "); TimeUnit.SECONDS.sleep(10); System.out.println(" PRESIDENT: I took a dump in an hour. You get it "); Return "The CEO took a dump "; }); Completionservice.submit (() -> {system.out.println (" r: I am at home, I am in 3 minutes, you can come to pick me up later "); TimeUnit.SECONDS.sleep(3); System.out.println(" R & D: I took a dump in 3 minutes. You get it "); Return "R&d has gone big "; }); Completionservice.submit (() -> {system.out.println (" Middle management: I have to sit for 10 minutes at home, you can pick me up later "); TimeUnit.SECONDS.sleep(6); Println (" Middle Management: I took a dump in 10 minutes. You get it "); Return "Middle management has taken a dump." }); TimeUnit.SECONDS.sleep(1); System.out.println(" All notifications are finished, wait to receive." ); For (int I = 0; i < 3; i++) { String returnStr = completionService.take().get(); System.out.println(returnStr + ", you pick him up "); } Thread.currentThread().join(); }Copy the code

The results are as follows:

This time is relatively efficient, although the president of the first notice, but according to the speed of everyone on the number of the first pull to pick up who, do not have to wait for the number of the longest president.

Put it all together and compare the output:

Two pieces of code very small differences between ExecutorCompletionService used to obtain the results

completionService.take().get();
Copy the code

Why take() and then get() ???? Let’s look at the source code

The CompletionService interface and its implementation class

1, ExecutorCompletionService is CompletionService interface implementation class

2, then with ExecutorCompletionService constructor, you can see the need to preach a thread pool objects, use the default queue is LinkedBlockingQueue, but there is another method can specify queue type structure, the following two pictures, two construction methods.

Constructor of the default LinkedBlockingQueue

Constructor of an optional queue type

3, Submit two methods of task submission, both have return value, our example is used in the first type of Callable method.

4, compare the ExecutorService and ExecutorCompletionService submit method can tell the difference

(1) the ExecutorService

(2) ExecutorCompletionService

QueueingFuture: QueueingFuture: QueueingFuture: QueueingFuture: QueueingFuture: QueueingFuture

  • QueueingFuture inherits from FutureTask and overrides the done() method for the position indicated in the red line.
  • Put tasks in the completionQueue, and when the task is finished, it will be put in the queue.
  • At this point in the completionQueue all tasks are tasks that have already been done(), and that task is the future result that we get.
  • If the task method of the completionQueue is called, the waiting task is blocked. When the future is definitely finished, we call the.get() method to get the result immediately.

See here believe big guy should understand how much point

  • After submitting tasks using ExecutorService Submit, we need to pay attention to the future returned by each task, but CompletionService keeps track of these futures and overrides the Done method. The completionQueue that’s waiting for you must have completed tasks in it.
  • As the gateway RPC layer, we don’t have to slow down all the requests because of one interface, so we can use CompletionService in business scenarios that handle the fastest response.

But attention, attention, attention is also at the heart of this accident

When only invoked the ExecutorCompletionService when any of the following three methods, the blocking queue task execution results will only be removed from the queue, release the heap memory, because this business do not need to use the return value of the task, not take to calls, polling method. As a result, heap memory is not freed, and the heap memory grows as the call usage increases.

So, if you don’t need to use the return value of a task in a business scenario, don’t use CompletionService. If you do use CompletionService, be sure to remove the result of the task from the blocking queue to avoid OOM!

Know the cause of the accident, we sum up the methodology, after all, Confucius he old people said: self-examination of my body, often think of their own, good repair its body!

Online before:

  • Strict code review habits, must be handed over to back people to see, after all, their own written code can not see the problem, I believe that every program ape has this confidence (this follow-up accident may be repeatedly mentioned! Is very important)
  • Live log – Note the last package version that can be rolled back (give yourself a fallback)
  • Before going online, check whether services can be degraded after the rollback. If not, strictly extend the monitoring period for the rollback

After the online:

  • Keep an eye on memory growth (this part is easily overlooked, as memory is not as important as CPU usage)
  • Keep an eye on CPU usage growth
  • Gc status, whether the number of threads is growing, whether there are frequent FULLGCs, etc
  • Pay attention to the service performance alarm, tp99, 999, Max whether there is a significant increase