Blood case caused by the sharing of TransmittableThreadLocal and Hystrix

In order to solve the problem of feign timeout retry too many times, I turned on the feign.hystrix.enabled switch. A series of related problems came one after another, I was about to lose my job, I stayed up all night, I started from the phenomenon, one by one, and finally I solved it and saved my job.

Symptom 1: The service invoked by Feign does not get the value added to the Header by the FeignInterceptor

The FeignInterceptor has been configured to put all the information needed to pass into the header. It works fine until it is enabled, but it fails when it is enabled. There must be something wrong with Hystrix, so I searched Hystrix Feign and got a bunch of answers.

Each FeignClient has a THREAD pool. The number of core threads is 20. When HystrixCommand performs Feign interface access, it will be executed in new threads (reusable THREAD pool). The original FeignInterceptor takes the value from the RequestContextHolder context and places it in the FeignClient request header. However, the executing thread of Hystrix does not synchronize the parent thread’s RequestContextHolder, so the code in the FeignInterceptor does not complete as planned.

Modification scheme is simple, is to use HystrixConcurrencyStrategy Hystrix provide concurrency strategy configuration interface to customize their own thread execution process, before the thread reads RequestContextHolder information, This information is then set to the thread’s RequestContextHolder at execution time. The code is as follows:

/ * * * Feign Hystrix concurrency strategy * * @ author luodongseu * / @ Component @ Slf4j public class FeignHystrixConcurrencyStrategy extends  HystrixConcurrencyStrategy { public FeignHystrixConcurrencyStrategy() { try { HystrixConcurrencyStrategy delegate = HystrixPlugins.getInstance().getConcurrencyStrategy(); if (delegate instanceof FeignHystrixConcurrencyStrategy) { return; } HystrixCommandExecutionHook commandExecutionHook = HystrixPlugins.getInstance().getCommandExecutionHook(); HystrixEventNotifier eventNotifier = HystrixPlugins.getInstance().getEventNotifier(); HystrixMetricsPublisher metricsPublisher = HystrixPlugins.getInstance().getMetricsPublisher(); HystrixPropertiesStrategy propertiesStrategy = HystrixPlugins.getInstance().getPropertiesStrategy(); Log.isdebugenabled ()) {log.debug("Current Hystrix plugins configuration is [concurrencyStrategy [{}], eventNotifier [{}], metricPublisher [{}], propertiesStrategy [{}]]", delegate, eventNotifier, metricsPublisher, propertiesStrategy); log.debug("Registering FeignWithHystrix Concurrency Strategy."); } HystrixPlugins.reset(); HystrixPlugins instance = HystrixPlugins.getInstance(); instance.registerConcurrencyStrategy(this); instance.registerCommandExecutionHook(commandExecutionHook); instance.registerEventNotifier(eventNotifier); instance.registerMetricsPublisher(metricsPublisher); instance.registerPropertiesStrategy(propertiesStrategy); } catch (Exception e) { log.error("Failed to register FeignWithHystrix Concurrency Strategy", e); }} @override public <T> Callable<T> wrapCallable(Callable<T> Callable) {RequestAttributes RequestAttributes  = RequestContextHolder.getRequestAttributes(); String traceId = AppContext.getTraceId(); Return () - > {try {/ / request is added in the child thread context information RequestContextHolder. SetRequestAttributes (requestAttributes); return callable.call(); } finally { RequestContextHolder.resetRequestAttributes(); }}; }}Copy the code

You think this is the end?

Problem two: Fatal transaction problems arise from thread reuse and information passing between threads

If there is any thread with which you want to transmit information, there is a small toolkit called TransmittableThreadLocal that has been introduced for the purpose of log link tracking. Therefore, the ID of the link tracker and the identity information of the visiting user are placed in the TransmittableThreadLocal. I thought I could listen to the song and tap the code happily. However, less than one day after the launch, customer A complained about the lack of reconciliation. After checking the log, I found that customer A’s ¥had been transferred to customer B’s account. Vomiting…

Suddenly, I felt the fear of losing my job 😱.

The meal also did not care to eat, fixed eyes to find the code problem. Search through hundreds of thousands of lines of logs for a few key logs. After several hours of comparison search, it was found that Feign probably returned B’s information when it was supposed to retrieve A’s information when querying customer information. The pressure test was carried out immediately. It was obvious that the data of the previous several times were correct, and the data was distorted in probability. Most of them could not get the data. For a while I suspected I had seen a ghost and had no idea what was wrong. When the FeignInterceptor fails to copy the message from the TransmittableThreadLocal client, the client has the wrong message if there is any TransmittableThreadLocal client. There was no code issues with TransmittableThreadLocal. If the TransmittableThreadLocal file is not TransmittableThreadLocal, the RequestContextHolder file is not TransmittableThreadLocal. If the TransmittableThreadLocal file is TransmittableThreadLocal, the RequestContextHolder file is TransmittableThreadLocal. Sure enough, the value in the RequestContextHolder can be passed. As a result, the answer is almost clear.

Presumably because Hystrix is thread reuse, it is using data from other threads. In the corresponding code is in the HystrixConcurrencyStrategy wrapCallable no update TransmittableThreadLocal the customer information. Sure enough, a simple revision really won’t make mistakes again. The code is as follows:

@override public <T> Callable<T> wrapCallable(Callable<T> Callable) {RequestAttributes RequestAttributes = RequestContextHolder.getRequestAttributes(); String traceId = AppContext.getTraceId(); Return () - > {try {/ / TransmittableThreadLocalContext. The request is added in the child thread context information set (" the userInfo ", "B"); RequestContextHolder.setRequestAttributes(requestAttributes); return callable.call(); } finally { RequestContextHolder.resetRequestAttributes(); }}; }Copy the code

At this point, you have your answer.

The thread pool for Hystrix is provided by a simple ThreadPoolExecutor that encapsulates the JDK. However, the TransmittableThreadLocal requires TtlRunnable to ensure secure information synchronization. Therefore, the request thread RequestThread_1 creates a new thread HystrixThread_1, since there is no thread created in ThreadPoolExecutor during the first few calls to Feign. The TransmittableThreadLocal value (“A”) in RequestThread_1 will be synchronized with HystrixThread_1 and cached for the next reading. If Hystrix creates a new thread called HystrixThread_X (if the TransmittableThreadLocal value is “B”), there will be no data errors or null values. If Hystrix uses HystrixThread_1, then HystrixThread_1 will not read the “B” message, and if RequestThread_1 is terminated, If the TransmittableThreadLocal value does not exist, if the RequestThread_1 thread is still running, the TransmittableThreadLocal value is “A”.

The lesson that is blood really, did not say, prepare to sell kidney compensate client 💰.