preface

This paper mainly explains a pressure measurement problem some time ago. Then get straight to the point


For those of you who don’t know what forceTransactionTemplate is, there are three ways to start a transaction in Java

  • XML to open transactions by configuring the aspect according to the service and method name.
  • The @Transactional annotation opens transactions (most frequently used)
  • Using Spring’s transaction template (as shown in the screenshot, hardly anyone uses it)

We won’t worry about why we use the third one, and I’ll talk about it later when WE talk about transaction propagation, but we’ll focus on the topic, and you just need to know what it means to start a transaction. I purposely circled the logging code in red and blue, meaning that the log is printed when the method is entered, and then another log is printed when the transaction is started. After a wave of pressure measurement, it was found that the interface frequently timed out, and the data consistently failed to press up. We view the log as follows:


We found that the interval between the two log outputs was almost 5 seconds! Why did it take 5 seconds to open a transaction? When things go wrong, there is a demon!

How to cut into the problem

Online encountered a high concurrency problem, because of the general high concurrency problem recurrence difficulty is relatively large, so the general fat toward the eyes are compiled, nine shallow a deep static look at the source code way to analyze. Specific can refer to local can run, on-line collapse? Panic! However, considering that there are still a small number of fans who have not mastered the skills of analyzing problems, this article will talk about some common ways to analyze such problems, so as not to panic when facing problems!


Fortunately, the difficulty of this concurrent problem is not too big, this case investigation is very suitable for small white entry, we can reproduce the local simulation scene, reduce the scope of the problem, so as to gradually locate the problem.

Local recurrence

We can start by preparing a concurrency utility class that can simulate a concurrency scenario in a local environment. It’s not friendly to look at the code on your phone, but that’s ok, the code below is for you to copy and paste into your project to reproduce the problem, not for your phone. As for why this utility class can simulate concurrent scenarios, because the code of this utility class is all code in JDK, the core is the CountDownLatch class, this principle you can search your favorite search engine according to the keywords I provide.

CountDownLatchUtil.java

 1public class CountDownLatchUtil {
 2
 3    private CountDownLatch start;
 4    private CountDownLatch end;
 5    private int pollSize = 10;
 6
 7    public CountDownLatchUtil() { 8 this(10); 9 } 10 11 public CountDownLatchUtil(int pollSize) { 12 this.pollSize = pollSize; 13 start = new CountDownLatch(1); 14 end = new CountDownLatch(pollSize); 15 } 16 17 public void latch(MyFunctionalInterface functionalInterface) throws InterruptedException { 18 ExecutorService  executorService = Executors.newFixedThreadPool(pollSize); 19for (int i = 0; i < pollSize; i++) {
20            Runnable run = new Runnable() {
21                @Override
22                public void run() { 23 try { 24 start.await(); 25 functionalInterface.run(); 26 } catch (InterruptedException e) { 27 e.printStackTrace(); 28 } finally { 29 end.countDown(); 30} 31} 32}; 33 executorService.submit(run); 34 } 35 36 start.countDown(); 37 end.await(); 38 executorService.shutdown(); 39 } 40 41 @FunctionalInterface 42 public interface MyFunctionalInterface { 43 void run(); 45 44}}Copy the code

HelloService.java

1public interface HelloService { 2 3 void sayHello(long timeMillis); 4, 5}Copy the code

HelloServiceImpl.java

 1@Service
 2public class HelloServiceImpl implements HelloService {
 3
 4    private final Logger log = LoggerFactory.getLogger(HelloServiceImpl.class);
 5
 6    @Transactional
 7    @Override
 8    public void sayHello(long timeMillis) {
 9        long time = System.currentTimeMillis() - timeMillis;
10        if(time > 5000) {11 // Prints logs that exceed 5 seconds 12 log.warn("time : {}", time); 16 Thread.sleep(1000); 16 thread. sleep(1000); 17 } catch (Exception e) { 18 e.printStackTrace(); 19} 20} 21}Copy the code

HelloServiceTest.java

 1@RunWith(SpringRunner.class)
 2@SpringBootTest
 3public class HelloServiceTest {
 4
 5    @Autowired
 6    private HelloService helloService;
 7
 8    @Test
 9    public void testSayHello() throws Exception { 10 long currentTimeMillis = System.currentTimeMillis(); CountDownLatchUtil CountDownLatchUtil = new CountDownLatchUtil(1000); 13 countDownLatchUtil.latch(() -> { 14 helloService.sayHello(currentTimeMillis); 15}); 17 18 16}}Copy the code

From the local debug log, we found a large number of interfaces over 5s, and there are also some rules, fat to put different color boxes for you


Why is it that these times are all in groups of five, and the difference in each group of data is about 1s?

The truth

The core code for @Transactional is as follows (I’ll analyze the source code in a follow-up series, keeping an eye on the Transactional so as not to miss the core). Here is simply retVal = invocation. ProceedWithInvocation () method to obtain the database connection.

 1if(txAttr == null || ! (tm instanceof CallbackPreferringPlatformTransactionManager)) { 2 // Standard transaction demarcation with getTransaction and commit/rollback calls. 3 TransactionInfo txInfo = createTransactionIfNecessary(tm, txAttr, joinpointIdentification); 4 Object retVal = null; 5 try { 6 // This is an around advice: Invoke the next interceptorin the chain.
 7        // This will normally result in a target object being invoked.
 8        retVal = invocation.proceedWithInvocation();
 9    }
10    catch (Throwable ex) {
11        // target invocation exception
12        completeTransactionAfterThrowing(txInfo, ex);
13        throw ex;
14    }
15    finally {
16        cleanupTransactionInfo(txInfo);
17    }
18    commitTransactionAfterReturning(txInfo);
19    return retVal;
20}
Copy the code

To better demonstrate this problem, Druid set the parameters of the database connection pool as follows

1 / / 2 spring the initial number of connections. The datasource. InitialSize = 1/3/4 spring. The maximum number of connections. The datasource maxActive = 5Copy the code

Since the maximum number of connections is 5, when 1000 threads are coming in, you can imagine a queue of 1000 people, and the first 5 get the connection and do business in 1 second. So the rest of the 995 people in line are waiting outside the door. By the time these five are done. Five connections are released, and the next five people come in and perform another 1 second operation. With simple elementary school math, you can calculate how long it takes to execute the last five. From this analysis, you can see why the above log output is in groups of 5 seconds, and each group interval is 1s.

How to solve

See fat source actual combat fans all know, fat toward never play rogue, always throw out the problem, will give one of the corresponding solution. Of course, there is no optimal scheme only better!

For example, some friends may say that your maximum connection number is set as small as the usual amount of praise for fat dynasty, if set a little bigger, there will be no problem naturally. Of course, in order to facilitate the demonstration of the problem, set the maximum number of connections is 5. The normal production of the number of connections is to be based on the business characteristics and constant pressure test to get a reasonable value, of course, fat zhao also understand that some of the students of the company’s machine configuration, even than the market of thousands of mobile phones!!

But in fact, the maximum connection number of the database was set at 200 during the pressure test, and the pressure test pressure was not large at that time. So why is this still a problem? So take a closer look at the previous code

The code of this verification is RPC call, the colleague of this interface is not as reliable as the fatty dynasty, which is worthy of lifelong trust, resulting in a long time, resulting in a long time for subsequent threads to obtain the database connection. It is easy to see why this pressure measurement problem arises when you do the math in elementary school.

Type on the blackboard and underline

Fat zhao said repeatedly before, encountered problems, to go through deep thinking. Like the question, what kind of extended thinking can we get? Let’s take a look at a previous fan’s interview experience

In fact, the question he met in the interview was basically the same as our pressure test question, but the conclusion of the interviewer was not accurate enough. Let’s take a look at alibaba’s development manual


So what is abuse? In fact, even though this method is often called, but is a single table insert, update operation, the execution time is very short, so the large concurrency is not a problem. The key is whether all the method calls in the transaction make sense, or whether the methods in the transaction are actually guaranteed by the transaction. Because some students, in some of the more traditional companies, can do more than just the CRUD work, it is easy to a service method, begin direct transaction annotations on the transaction, and then in a transaction, and transaction a dime relationship all have no a lot of unrelated time-consuming operation, such as file IO operations, such as query check operation, etc. For example, business validation in this article is completely unnecessary in a transaction. At ordinary times there is no corresponding actual combat scene in the work, plus did not pay attention to the fertilizer toward the public number, the principle of source code real combat scene know nothing. Interview a little ask the principle of pain, the interviewer had to change the direction to continue in-depth!

What kind of extended thinking do we have from this experience? Because the problem is never solved, but we can through continuous thinking, the problem squeeze out more value! Let’s go back to the Ali specification manual


In plain English, minimize the granularity of locks. And try to avoid the call RPC method in the lock, because RPC method involves network factors, his call time is very uncontrollable, it is easy to cause the lock occupation time is too long.

In fact, this is the same problem that we have with manometry. First of all, calling RPC in your local transaction can neither function as a transaction (RPC requires distributed transaction guarantee), but it will take too long to connect to the database due to uncontrollable factors of RPC. The interface times out. Of course, we can also use APM tools to comb out the time topology of the interface and expose such problems before pressure measurement.

Write in the last

More thematic series source code analysis, the real scene source code principle of actual practice to share with you, scan below the TWO-DIMENSIONAL code to pay attention to fat, so that you were born to build the rocket, without injustice to screw!