background
In an era of service, service thinking gradually become the basic thinking mode of programmers, however, because most of the project is to increase service blindly, not to its proper management, when the interface problems, it is difficult to from the intricate network of service calls to find the root cause, so as to miss the golden opportunity of stop loss.
Link tracing was developed to solve this problem, to locate problems in complex service invocations, and to let new people know exactly what part of the service they are responsible for once they join the backend team.
In addition, if the time consumption of an interface suddenly increases, you do not need to query the time consumption of each service. In this way, you can intuitively analyze the performance bottleneck of the service, facilitating accurate and reasonable capacity expansion when the traffic surges.
Link to track
The term “link tracking” was coined in 2010 when Google published a Dapper paper that explained how Google’s own distributed link tracking was implemented and how it was made transparent to applications at a low cost.
In fact, Dapper was just an independent call link tracking system at the beginning, and gradually evolved into a monitoring platform, based on which many tools were bred, such as real-time warning, overload protection, index data query, etc.
In addition to Google’s Dapper, there are some other well-known products, such as EagleEye of Ali, CAT of Dianping, Zipkin of Twitter, Naver (parent company of LINE) Pinpoint and Skywalking, a domestic open-source software.
Basic Implementation Principles
If you want to know where an interface is failing, you need to know which services are being called by the interface and in what order. If you string these services together, they look like a chain, which we call a chain of calls.
To implement a chain of calls, identify each call and arrange the services in order of identity size, which we’ll name spanID for the moment, to make the order of calls clearer.
In the actual scenario, we need to know the situation of a request invocation, so spanID is not enough. We need to make a unique identifier for each request, so that all services invoked in this request can be found out according to the identifier, which we named as Traceid.
Spanid now makes it easy to know the order in which services are called, but it doesn’t show the hierarchy of calls. As you can see in the figure below, multiple services may be a chain of calls, or they may be called by the same service at the same time.
So you should keep track of who called it every time, and we’ll use parentid as the name of this identifier.
Till now, already know the calling sequence and hierarchy, but after interface problems, still can’t find out the problem of link, if you have any problem, a service that is invoked to perform the services must take very long, if you want to calculate the time consuming, the above three logo is not enough, also need to add a timestamp, timestamp can be a bit more detailed, accurate to microsecond.
The time difference can only be calculated by recording the time stamp of the service when the call is made. The time difference can only be calculated by recording the time stamp of the service when the call is made. Since the time stamp of the service is also recorded, it is better to write down the above three marks, otherwise it is impossible to tell whose time stamp it is.
Although we can calculate the total time from service invocation to service return, this time includes service execution time and network latency, and sometimes we need to separate these two types of time to facilitate targeted optimization. So how do you calculate network latency? We can divide the call and return processes into the following four events.
-
Client Sent: The Client sends a call request to the server.
-
Server Received Sr: Indicates that the Server receives the invocation request from the client.
-
Server Sent SS: The Server completes processing and is ready to send information back to the client.
-
Client Received Cr: Indicates that the Client receives information from the server.
If the time stamps are recorded at the time of each of the four events, it is easy to calculate the time, such as SR minus CS is the network delay at the time of the invocation, SS minus SR is the service execution time, Cr minus SS is the service response delay, and Cr minus CS is the total service invocation execution time.
In fact, in addition to these parameters, some other information can be recorded in the SPAN block, such as the name of the service that initiates the call, the name of the service that is called, the result that is returned, the IP address, the name of the service that is called, etc. Finally, we combine the same SPANID information into a large SPAN block to complete a complete call chain. Interested students can go to in-depth understanding of link tracking, I hope this article will be helpful to you.