Recently, I started to learn background development. Although I used Java language as Android development before, the technology stack is completely different. There are too many necessary “new” concepts to learn, and before I have a full understanding of them and the code written by others, I may encounter such a cup of tea and a cigarette. A Bug that can’t be fixed in a day.

This recently encountered Bug was discovered by accident while fixing a problem left over from the project, which in a nutshell is as follows:

Service in from the outside world to receive A push of A data, the data is inserted into the database, and then through MQ push A message to service B, some processing service B will be according to the received message, including the remote call service is A way to query the data, but in the test environment is always less than the data query.

After encountering problems, we first conducted some investigations:

  • Suspect that the parameter or data insertion library is not successful, so print out the query parameters, manually copy the parameters to the library to check — there is data;

  • I suspected that there was something wrong with the SQL actually executed, so I asked my colleague to help configure MyBatis to output SQL in the log, copy the original SQL and check it in the database — there was data;

  • In the local connection test environment database, the code under breakpoint debugging – can normally get data;

After wondering for a while, continue to investigate:

  • Suspecting a problem with the database connection of the test environment program, I tested some other library functions — data is fine;

  • Suspect that there is a problem with the package in the test environment, so ask o&M colleagues to copy the JAR package from the container and check the configuration — no problem;

  • Suspect that the test environment remote call failed, so log at the remote call — no exception;

  • Dubbo Admin: The number of nodes is normal and the network segment is normal.

  • It is suspected that there is A problem with the deployment of A node of service A in the test environment. Therefore, operation and maintenance colleagues are requested to manually execute remote calls one by one through Telnet — data can be retrieved normally.

  • Immediately after a failed case, the same message is manually pushed to service B again — data can be retrieved;

It wasn’t until I finally noticed that, from the logs, the time difference between service A’s plug-in and service B’s remote call to service A’s method was only 1 millisecond. Could it be that everything happened so fast that the database couldn’t find the data it had just written? Or is the plugin not in effect at the time of query?

With that in mind, I finally took a closer look at the code for the insert library and sent messages section, and saw this:


     

    @Override

    @Transactional(...)

    public boolean doSomething() {

    .

    // Insert data

    // Send a message

    .

    }

Copy the code

Yes, insert data and send messages are written in a transaction. Although I don’t know much about the database, or know something about the characteristic of transaction, send messages, do not just insert data in the database, and will only take effect after the transaction is committed, that is, service B after receiving the message remote calls back to service A want to find just inserting data, can you check to the luck of the draw, depending on the transaction has been performed at this time.

Timing diagram of the problem:

It is also easy to ensure that there is already data in the database when the message is sent by controlling the transaction granularity to include only insert data, and then sending the message after the insert is successful.


     

    @Override

    public boolean doSomething() {

    .

    // Transaction starts

    // Insert data

    // End of transaction

    If (insert data successfully) {

    // Send a message

    }

    .

    }

Copy the code

Normal timing diagram:

Conclusion:

  1. Don’t assume when you understand the logic written by others. The problem may be that you don’t think others can make such elementary mistakes and directly exclude them.

  2. When troubleshooting problems that may be caused by timing, use less breakpoint debugging and log more appropriate;

  3. Try to simulate the scene as completely as possible during local debugging, starting at a certain point in the middle of the process may override the trigger condition of the problem and not be able to reproduce.