What is a Trace?
In a broad sense, a trace represents the execution of a transaction or process in a (distributed) system. In the OpenTracing standard, a trace is a directed acyclic graph (DAG) composed of multiple spans, each of which represents a continuous execution segment named and timed in the trace. Each component in a distributed trace contains its own span or spans. For example, during a regular RPC call, OpenTracing recommends having at least one SPAN on both the client and server sides of an RPC call.
Represents a link as a tree:
To represent a link in chronological order:
The concept of OpenTracing can be found at GitHub: github.com/opentracing…
Zipkin and Jaeger support OpenTracing in multiple locales.
What problems does link solve
Development and engineering teams are increasingly replacing old stand-alone systems with microservice architectures because of the need for horizontal scaling of system components, miniaturization of development teams, service container, and decoupling. When a production system faces truly high concurrency, or is decoupled into a large number of microservices, priorities that were previously easy to implement become difficult. In the practical application, we need to face a series of problems, such as user experience optimization, background error cause analysis, and the invocation of components in the distributed system. Modern distributed tracing systems (e.g., Zipkin, Dapper, HTrace, X-Trace, etc.) aim to address these issues, but they use incompatible apis to meet their respective application requirements. The emergence of OpenTracing solves this problem. OpenTracing makes it easy for developers to add (or replace) tracking system implementations by providing platform-independent, vendor-independent apis.
In microservices, the correct way to use links is the link between services + logs of each service. Links focus on whether there are problems between service invocations, while logs focus on problems in the service.
A single link only needs to record the span of the server and client in the service. Logs containing Traceid can be printed in the middle process. When problems occur in the invocation of microservices, such as errors or timeouts, you can obtain the TracEID by request and locate the specific service through the link system. You can query service logs based on traceid. Link index search is established, and logs are stored in low-cost storage. Of course, if server resources are not limited, all call data between services and services can be put in trace.
According to the links between services, we can extract the service invocation diagram, The Times of invocation between services, the error rate and so on, and dynamically monitor the health of the entire microservice cluster.
3. Problems existing in link system
Most links are normal links, and storage of these links takes up a lot of resources. As link span is published in different micro-services, sampling set in advance cannot meet the requirements.
1. When the intermediate link node needs sampling, reQ header forwards the sampling flag, and RSP header forwards the sampling flag. In the case of asynchronous calls, RSP headers are invalidated when passed forward.
2. This kind of mark can only be effective on a single link. When the bifurcation link occurs before the abnormal node, the other bifurcation link cannot be marked with sampling mark.
In this case, flink can be used to lagged the sampling of the link and complete the sampling after collecting the entire trace link:
TraceId is used as the key and sessionTime Windows is used as the window to ensure that the span of single link data can be collected in the Flink source within the sessionTime.
The analysis of Flink + Trace is described in the third article in this series.