Introduce: had this tool, fixed position line problem gets twice the result with half the effort, programmer often encounters a few line problem to need to examine in daily work, the hero of this article programmer xiao Zhang is not exceptional also. But the process of investigation often bothers him. Let’s take a look at what problems he encountered and how he solved them.

Hello everyone, my name is Liu Xuan, responsible for the development of cloud effect assembly line. Programmers often encounter some online problems in their daily work, and the protagonist of this article, Programmer Xiao Zhang, is no exception. But the process of investigation often bothers him. Let’s take a look at what problems he encountered and how he solved them.

It was a hell of a day

It was a sunny morning. Xiao Zhang came to his workstation, turned on the computer, prepared coffee, and started his work full of energy. Is xiao Zhang crackling on the keyboard, seriously Coding, nail nail group of a nail, broke the peace. According to the feedback from the customer service staff, a customer has encountered a problem that needs to be checked by the developer. Xiao Zhang checked the online logs and found that there were a lot of user requests and logs, but no key information was located. Zhang can only ask customer service to find users to provide more specific information. After repeated communication with the user, it took Zhang more than half an hour to locate the problem.

The busy day soon ended, just as Xiao Zhang was getting ready to go off work and planning how happy he would be after work, the sound of the telephone alarm pulled him back to reality. After receiving the alarm that the RT of the back-end service is high, Zhang checks the monitoring information and logs of multiple background applications. Although the problematic request information was quickly located from the Nginx log, It was difficult for Zhang to accurately locate the application log corresponding to the request, so it took a long time to locate the problem: a third-party service exception, causing some functions to be affected. After the cause is located, the system is degraded and restored to normal.

Look for solutions to problems

After a hot day, Xiao Zhang felt that the efficiency of dealing with the problem was too low, and spent a lot of time on the problem positioning. The reason why the investigation is so slow is that the system adopts micro-service architecture, a request will involve multiple services, and each service will call DB, cache and other third-party services. The approximate link is as follows:

Zhang thought that there should be a mature technical scheme to identify the entire request link and clearly mark abnormal service nodes. With the help of search tools, Zhang found that there is a solution, link tracking, just suitable for their own scene.

The link tracing tool can restore a distributed request to a complete call link, and display the call situation of the whole request, such as the time consuming on each service, the request status of each service, and the specific machine to which each service is scheduled.

Transformation system

Desired effect

According to the above two problems, Xiao Zhang expects:

  1. When a user encounters a problem with a request, he can obtain a traceId. Once the traceId is provided, he can view the invocation path of the request between services. In addition, you can use this traceId to query logs of all applications.
  2. When receiving an RT alarm, you can also find the traceId in Nginx logs.

Access link tracing

After technology selection, Xiao Zhang chose Ali Cloud product Link Tracing Tracing Analysis as his link Tracing server.

Ali Cloud Link Tracing Analysis provides complete call link restoration, call request volume statistics, link topology, application dependency Analysis and other tools, which can help users quickly analyze and diagnose performance bottlenecks in distributed application architecture and improve the development and diagnosis efficiency in the microservice era.

Ali Cloud Link Tracing Tracing Analysis supports a variety of common link Tracing tools, such as Zipkin, Skywalking, Jaeper, etc. Zhang chose Skywalking as the link tracking data burial point.

After opening the Link Tracing Analysis product on Ali Cloud, you can obtain the Skywalking access point in the cluster configuration. For more detailed access guidelines, please refer to the official documentation of Ali Cloud.

Since Xiao Zhang’s system is based on Spring Boot, all you need to do is add the following to the boot command.

After restarting the application, link Tracing buried data is collected to link Tracing Tracing Analysis.

TraceId is displayed in logs

To search for all logs using traceId, you need to display traceId information in the logs of the application as follows:

Introduce the following dependencies in your application:

Modify the logback configuration file. For example, tid is the traceId of Skywalking.

The traceId can be seen in the log. As shown in the following figure: the TID values marked in red are traceId.

In addition, When the system displays abnormal information, Zhang sends traceId to users. When users report problems, they only need to provide traceId. Accordingly, traceId needs to be written to the response body in the code as follows:

TraceId is printed in Nginx logs

To obtain traceId when receiving RT alarms, you need to modify the Nginx configuration.

After access to Skywalking, requests called between systems will carry the header named SW6 (where 6 is the corresponding Skywalking version number), and the format of the header value is: 1-TRACEID-SEGMENTID-3-PARENT_SERVICE- parent_instance-parent_endpoint-ipport Extracts TRACEID from this value, which is the part between the first and second bars, You can obtain traceId by BASE64 decode.

Then you need to add the corresponding Header to the Nginx log_format configuration as follows:

You can then see the corresponding value in the Nginx log, as shown in the figure below:

Solve the traceId loss problem caused by multiple threads in link tracing

When testing link tracing, Zhang found that only one subthread could correctly obtain the traceId in a multi-threaded scenario, while the traceId in other threads would be lost. To solve this problem, Zhang enhanced Callable and Runnable with the @tracecrossThread annotation that Skywalking provided. Skywalking enhanced the classes annotated by the annotation. In this way, traceId can be passed across threads. The sample code is as follows.

Use the modified TraceableCallable and TraceableRunnable to resolve the traceId loss problem.

After the modification is complete, the following figure shows a traceId for each user request, facilitating the display of the entire request link.

Cope with online problems easily

Another sunny morning, xiao zhang work, and the feedback of the service users, provide traceId zhang took this time according to the user, view on ali link tracking tracing.console.aliyun.com/ call link, will soon…

After a day’s work, Xiao Zhang receives an alarm indicating that the application RT is too high before the end of work. Because traceId information is printed in Nginx logs, time-consuming requests can be quickly located, as shown in the following example:

The time-consuming nodes in the figure are caused by the invocation of third-party services. Zhang degraded the service according to the situation and solved the problem of high RT in a short time to prevent the spread of the problem.

After the link tracking is connected, the time spent by Xiao Zhang on locating problems on the line is greatly shortened, and he can have more time to focus on other work.

The above is how Xiao Zhang changed from desperately troubleshooting online problems to calmly locating online problems through the use of link tracking, hoping to help students who have not yet used link tracking technology. This story is pure fiction, any similarity is pure coincidence.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.