The link to track

Link tracing is basically a concept and strategy, which is simply the ability to pass through specific information between two associated calls. From a component design perspective, the following features are of concern:

  • Generality: what scope is available and whether it is not
  • Completeness: Whether the design of the data model is comprehensive enough to include what should be included and discard what should not be
  • Cost: cost and risk of implementation, complexity of access. There are many different strategies that fall into the implementation, but in general there are really three strategies.

A scheme implemented based on a specific language

The typical example is the Java system solution. Java is generally a compiled language, but thanks to the virtual machine and bytecode implementation, Java actually has the characteristics of a dynamic language.

The basic idea of this kind of implementation is to use Java-Agent to intercept the concrete class loading process and add custom code in the specific class loading process to realize the ability of trace.

The main disadvantage of this type of solution is that it only works with Java, but as long as the implementation of the Java technology stack has almost unlimited access, it works very well for the Java technology stack company. The cost of access is also very low, as long as you specify parameters in the startup command, it is very convenient to deploy scripts or build images. The other downside is that changing bytecodes is inherently risky. But overall stability is assured. skywalking

Based on the implementation of coding specifications within the organization

What if not all companies are Java in-house, other languages don’t modify bytecodes, or think it’s too rough?

In this case, the basic idea is to abstract the concept at the protocol level, so that the implementation of each component supports the realization of link tracing internally, and the information collection of trace log component is also completed by the component. If a service party has special needs to access the link tracking system, it also needs to interact with trace in accordance with the same convention. Also consider the need for compatibility with various open source components.

In this case, the design of the agreement level is very important, which needs to be recognized and understood by all parties. The completeness of the agreement itself is very important. By far the most famous specification is OpenTracing, which can basically be regarded as the de facto standard in the field of link tracing.

Personally speaking, this approach is a better strategy, and it is necessary to implement standardized programming specifications in large companies. However, it is also a double-edged sword. The implementation of trace depends on the degree of standardization, because the link cannot achieve the effect of link tracking as long as it is interrupted once.

Another problem is that there are limits to even standardization, such as information transfer across threads. Most of the standardization within the company is difficult to do. The resulting functionality is not as simple and effective as bytecode enhancement. Jaeger, CAT, SOFATracer, Zipkin, Dapper

Solution based on mesh

Of course, with the advancement of containerization and Service mesh in recent years, the solution based on Servicemesh can also do link tracking. By hijacking traffic through sidecare, we can build a link tracking system that does not depend on specific language or RPC, which is indeed a more ideal model from the perspective of the model. However, it is also a problem how to make the program at runtime also perceive link tracking. At the same time, various solutions of Mesh are still in the stage of exploration and practice, and there is no perfect solution. jaeger

Model and protocol design

The data model

Most existing link tracing models refer to the implementation of Dapper, and the specifications of OpenTracing also have a great influence on model design. Semantics of OpenTracing. Most of what follows is taken from these two sections

To put it simply, an external call can be composed of multiple internal requests. The process can be described as a tree, and each call is defined as a span. The entire call can be called a trace and can be identified with a unique ID.

A SPAN is the smallest unit that makes up the call tree. It usually includes the following parts and is usually identified by a unique ID:

  • Operation name, usually a pattern, such as calling method name, such as URL, etc.
  • Start and end time.
  • Zero or more tags with keys like string and value, such as IP, app name, data name, URL, etc
  • Zero or more logs, such as error codes, call stacks, time messages.
  • Zero or more span references, (ChildOf, FollowsFrom)
  • SpanContext, (Traceid, SPANID, sampleFlag, Baggage) is a concept or interface level thing. Objects that can be used to serialize and deserialize, or to provide thin apis for middleware and business applications, are inherently immutable. Baggage can pass through the user’s KV.
Causal relationships between Spans inA single Trace [Span a] please please please (the root Span) | + -- -- -- -- -- - + -- -- -- -- -- -- + | | [Span B] [C] Span please please please (Span C is a ` ChildOf ` Span a) | | [] Span D + + -- -- -- -- -- -- -- -- -- -- + | | [Span E] [Span F] > > > [Span G] > > > [Span H] write write (Span G ` FollowsFrom ` Span F)Copy the code

The above link tracing call tree is more intuitive, only two places need to be explained:

The relationship between the spans

One is references within a span. I actually think this application is more understanding of the difference, and most implementations are unlikely to find references to all subspans, or not necessary. In most cases, the concept of parent-span-ID is used. The type of a reference,

  • ChildOf is mainly used in synchronous scenarios where the subspan must be returned before the end of the parent span.
  • FollowsFrom mainly refers to some asynchronous scenarios. In this case, the span and parent span only have logical relation, but there is no relation in time. There is no problem in link construction, but it is more troublesome in some data processing scenarios. For example, cannot determine when the link ends?

  • First, trace is often used for performance analysis or dependency analysis. In these two typical scenarios, it is not meaningful to connect asynchronous scenarios in series, which is not conducive to subsequent data analysis

  • Another reason is that even without the use of the structural series of SPANID itself, it does not mean that the relevant information is lost, because trace itself has the ability of information transparent transmission, we can construct a concept like Logid or business meaningful data, and carry out transparent transmission even if the link itself is no longer under the same trace. But information can still be transmitted transparently.

Implementation principle of transparent transmission

Pass-through essentially completes the construction of a SpanContext between two contexts. There are two things to do in general, one of which is to pass the spanContext in as non-invasive a way as possible to rebuild the SPAN. The other is to create a new span in the spanwapper of the current context construction and push it to the top of the stack, depending on the situation. The following figure shows an example of cross-thread delivery.

Span Log data composition, transparent transmission, and collection

Another thing that needs to be discussed is SpanContext: SpanContext actually has different concepts in different scenarios. Here we refer more to the construction part for the downstream recovery link and the parameters that need to be passed through. Based on the above, it is easy to understand that a SPAN is actually two logs that need to be constructed on the other side to be complete. These two logs will be collected by each instance and uploaded separately. Only a small number of parameters will be transmitted to the downstream through RPC for link recovery or parameter transparent transmission. In general, parameter transparent transmission is divided into three parts:

  • Business parameters that need to be passed through: such as pressure identification, logId, etc. These parameters need to be passed in any scenario.
  • Parameters used to construct span: actually, only three parameters are required: traceId, spanId, and sample. These parameters will be transmitted or not transmitted according to different scenarios. Generally speaking, they are transmitted in synchronous scenarios and not transmitted in asynchronous scenarios. If not passed, new spanId and traceid are typically constructed downstream, and only single-ended logs can be constructed for spans.
  • Parameters for information sharing: Except the above parameters, all parameters can be passed over an out-of-band channel, in most cases using a technology stack similar to ELK. To be specific, there are different implementation strategies for transparent transmission, as shown in the figure below

  • One scenario is to request both the business information and the SPAN construction information, and the downstream return result passes the same information back. The advantage of this model is that the concept and model are relatively clear. But the question is does the upstream information need to be printed at the time the request is made? However, there is no printing within the deadline, and a merge is required after the request message is replied. Alternatively, you can print logs twice and merge data at the storage layer once.
  • Another option is to store temporary data on the stack of the current Thread variable and pop the current object out of the stack when the request is returned. Add the request parameters to the object. However, this approach does not handle asynchronous callback scenarios.
  • The last option is to asynchronously reconstruct a span scenario. The sample parameter here is special. Most link tracking systems need sampling, and the sampling mode is head-based, that is, sampling is performed after node sampling. This results in some asynchronous scenarios where sample needs to be passed even though neither traceI nor SPANID is passed.

Module split and design boundaries

System module split

This describes a common type of module design for distributed link tracking, and different systems may have trade-offs in different places. But generally follow

Client function split and implementation

  • Agent-transcoding packaged code modules for JavaAgent
    • Transform code enhances correlation
    • Classloader reasonable agent code, prevent class pollution
    • Static-config is used for intercepting logic and static configuration
  • Plugin – Logic at a specific business level
  • Client-component POM dependencies
    • The OpenTracing interface constructs spans
    • Processor: General processing logic
    • Weavepoint, the enhancement point type definition is used to define the scope
    • Dynamic – the config. Used to obtain dynamic configuration
    • Log: Prints logs
    • Util, toolkit
  • Core – Data model dependencies

Class and interface design

There are many forms of link tracking implementation. From my personal understanding, two layers of abstraction are needed, and the first layer is to enhance the point level. The implementation of the code module is divided into a lot of interesting parts of the code module into multiple parts

  • A layer of abstraction is for enhancements or implants. Because the entry points of different media are different, Wapper is a unified treatment for enhancement points, which is used for uniform code close. The Wapper level determines the entry type of the current code and selects different absctrctweaveprocesses for treatment. Code can design abstract code templates (client and service base classes). Different insertion points extend the interfaces exposed by the underlying code.
  • Generally speaking, the above code can ensure the basic function of link tracing. But there are often many business functions that need to be serviced inside the company. Examples include manometry, custom parameter pass-through, environment staining, logId, and so on. These functions often have a unified layer of business abstraction that needs to be reused at various enhancement points, so it makes more sense to abstract out the plug-in layer.

Interesting design details

Trace-id vs log-id

Is logid traceid? What’s the difference? Log-id is not technically trace-id, but it could be. In essence, trace provides the capability of transparent transmission. Logid is commonly used to concatenate log information. Therefore, in most scenarios, the loGID is transparently transmitted between systems based on trace’s transparent transmission capability. One of the interesting things we talked about earlier is concatenation in asynchronous scenarios. According to the content before we actually discussed, although trace itself because the start-stop time constraints can be used for asynchronous scenario, but it can bring a lot of trouble to information analysis, in practice, I actually prefer within the scope defined traceid in synchronous calls, in asynchronous scenarios, such as asynchronous RPC, or message queue scenarios, Rebuild the logId.

Java-agent with bytecode enhancement

There is also a problem not explained here is how to implement bytecode enhancement, the basic principle is to use java-Agent. Although Java is a compiled language, it has some dynamic properties due to the existence of the JVM and ClassLoader. Java’s actual execution logic is actually dependent on the bytecode in the JVM, such as the transaction management annotations that most Javaers miss, essentially marking up certain methods or member variables, generating a proxy class during execution and adding transaction management templates on top of the existing methods. Either generate new proxy class or change the bytecode of the original class to achieve dynamic proxy. Going back to the use of trace, we also wanted to enhance the bytecode, and it was theoretically possible to make a dynamic proxy implementation based on custom class loading, but the main problem was that there was no way to control all the class loaders. What Trace wants is to implement Wapper on a method that doesn’t care what the class load is. Java itself provides a Java-Agent mechanism that can intercept all the class loading process, or reload a class backdoor during the run, which is obviously more suitable for our scenario. In use, as long as the corresponding interface is realized, and packaged into JAR can be. Java-agent provides two ways: one is to start the Java-Agent at the same time as the JVM runtime. Or after a JVM instance is started, it can be started as a queue process and attcompartmentto a JVM process. I won’t go into the details of how the code is implemented, because there are many examples on the web. What I want to talk about here is a set of common java-Agent design methods, which are helpful for understanding other open source designs.

A typical Java-Agent module in many cases includes the above sections,

  • The first is agent-entrace. So this is the interface that implements java-Agent and it’s kind of like the main method, where the main logic unfolds.
  • A section that interconnects with an external interface and provides an observable and operable entry, such as an HTTPServer or an ETCD/ZK-client.
  • The main function of many Java-Agents is to enhance bytecode. This is a dangerous operation, which is equivalent to the code logic without stopping. Therefore, spi interfaces are usually defined, the enhancement points are limited to a few areas, and loaded by independently compiling some JARS to modify the specified interfaces. In addition to security concerns, you also want to consider decoupling your code. This process is somewhat similar to writing an API service, which starts with a URL, before writing the URL implementation.
  • Another scenario module is a separate Classloader. Because javAgent is often run in premian mode, the main method is not yet running, and if the introduction dependency references many classes, it may cause the application to load contaminated classes during normal startup. Therefore, most will write a classloader to independently load the classes used by the agent to avoid pollution problems. This class loading is usually not in the parent delegate model.

  • Open source trace implementation: skywalking.apache.org/
  • Chaos Engineering: github.com/alibaba/jvm… , github.com/chaosblade-…
  • Java analysis diagnostic tools: github.com/alibaba/art… Head-base sample vs tail-base sample Each RPC generates two logs. Without sampling, the number of logs is very large. Most trace logs are sampled, and the sampling rate is very low, ranging from 1-5%. However, this does not affect the function of trace. As mentioned before, sampling only involves the shedding and uploading of logs, and does not affect the generation of span and pass-through of parameters, just whether the generated span is uploaded or not. Sampling itself also has an interesting aspect. Most traces are head-base, that is, only the root node has the right to decide whether sampling is adopted, and the subsequent nodes determine whether sampling is carried out according to the sample flag bit transmitted upstream.

Typical application scenarios

Request staining (Environment routing)

First, the concept of environment isolation: most development patterns are based on GIFlow. If there is only one set of environments, there is the pain of multiple code merging. If there is a high cost of setting up multiple environments (such as separate databases, Nginx, registries, Redis, and numerous dependent services), is there a way to create a product that creates multiple environments without the pain?

The scheme of environment dyeing is that trace parses the outbound and outbound environment information under each middleware request and takes it as a parameter transparent transmission, and each middleware component routes trace to the corresponding cluster with this information: A simple example is shown as follows:

For RPC, the client preferentially selects a cluster in a specific subenvironment based on the environment information during routing. If the cluster does not exist, the base environment is invoked. Applications in the base environment can also invoke the subenvironment based on the same rules. Even if the call passes through middleware such as a queue, the message is passed with environment information, and trace can parse the rules based on that information and pass through. However, in order to avoid messages being consumed by apps in the base environment, you need to build topic specific to the sub-environment.

Other Application Scenarios

  • Pressure test
  • Application layer
  • Failure to rehearse

Wechat official account: Magic programmer


Accept internal promotion of first-line big factories (Microsoft, Ali, Toutiao, netease), for details, please contact the official account or email [email protected]