0 Background

With the popularity of microservices architectures, services are split along different dimensions, often involving multiple services in a single request. Internet applications are built on different sets of software modules, which may be developed by different teams, implemented in different programming languages, and spread across thousands of servers across different data centers. Therefore, you need tools to understand system behavior and analyze performance problems, so that faults can be quickly located and rectified.

The whole link monitoring component is produced under the background of this problem. The most famous is the Google Dapper mentioned in Google’s published paper. To understand the behavior of distributed systems in this context, you need to monitor the associated actions across different applications and different servers.

Therefore, in a complex microservice architecture system, almost every front-end request forms a complex distributed service invocation link. A complete invocation chain for a request might look like this:

One request complete call chain

Therefore, in the case of increasing business scale, increasing services and frequent changes, complex call links will bring a series of problems:

  1. How to find problems quickly?

  2. How do I determine the scope of the fault?

  3. How to sort out service dependencies and their rationality?

  4. How to analyze link performance problems and plan real-time capacity?

We also look at performance metrics for each invocation during request processing, such as throughput (TPS), response time, and error logging.

  1. Throughput: Calculates the real-time throughput of corresponding components, platforms, and physical devices based on the topology.

  2. Response time, including overall invocation response time and response time of individual services.

  3. Error record, according to the service return statistics unit time number of exceptions.

Full-link performance monitoring displays indicators from the global dimension to the local dimension. The performance information of all call chains across applications is displayed in a centralized manner, facilitating measurement of global and local performance, locating the source of faults, and greatly shortening the troubleshooting time in production.

With full link monitoring tools, we are able to:

  1. Request link tracing to quickly locate faults: You can use the call chain and service logs to quickly locate faults.

  2. Visualization: time consuming at each stage and performance analysis.

  3. Dependency optimization: Availability of each invocation, sorting out service dependencies, and optimization.

  4. Data analysis and link optimization: users’ behavior paths can be obtained and summarized and analyzed in many business scenarios.

1. Target Requirements

As mentioned above, what are our objectives for selecting a full-link monitoring component? Also mentioned in Google Dapper, the summary is as follows:

  1. Probe performance cost

  2. Invasiveness of code

  3. scalability

  4. Analysis of data

2 Function Modules

The general full-link monitoring system can be roughly divided into four functional modules:

  1. Bury points and generate logs

  2. Collect and store logs

  3. Analyze and statistics call link data, as well as timeliness

  4. Presentation and decision support

3 Google Dapper

3.1 Span

The basic unit of work, a link call (RPC, DB, etc.) creates a span, identifies it with a 64-bit ID, is convenient for UUID, and span contains other data, such as description information, timestamp, key-value tag information, Parent_id, parent-id can indicate the source of the span call link.

The figure above illustrates what span looks like during a large trace. Dapper records the span names, as well as the ID and parent ID of each span, to reconstruct the relationship between the different spans in a single trace. A span without a parent ID is called a root span. All spans hang on a specific trace and also share a trace ID.

Span data structure:

Type Span struct {TraceID int64 // Used to mark a complete request ID Name String ID int64 // Span_id ParentID int64 // Span_id of the upper-layer service Parent_id null Annotation []Annotation // Timestamp for marking Debug bool}Copy the code

3.2 Trace

Span collection, similar to a tree structure, represents a complete trace, starting from the request to the server, ending with the server’s response, tracking the time spent on each RPC call, with a unique identifier trace_id. For example, if you’re running a distributed big data store, a Trace consists of a single request you make.

Note in each color has a span. A link is uniquely identified by TraceId. The SPAN identifies the request information. Tree nodes are the building blocks of the entire architecture, and each node is a reference to a SPAN. The lines between nodes represent a direct relationship between the span and its parent span. Although spans simply represent the start and end times of spans in log files, they are relatively independent in the overall tree structure.

3.3 the Annotation

Annotations, used to record information about a particular event requested (such as the time). Multiple annotations are described in a span. Usually contains four annotated messages:

(1) CS: Client Start: indicates that the Client initiates a request

Sr: Server Receive: indicates that the Server receives the request

(3) SS: Server Send: indicates that the Server finishes processing and sends the result to the client

(4) CR: Client Received: indicates that the Client receives the returned information from the server

Annotation data structure:

type Annotation struct {
    Timestamp int64
    Value     string
    Host      Endpoint
    Duration  int32
}
Copy the code

3.4 Invocation Example

  1. Request invocation example
  • When A user initiates A request, it first reaches the front-end service A, and then makes RPC calls to service B and service C respectively.

  • Service B responds to service A after processing, but service C needs to interact with service D and service E at the back end and then return it to service A. Finally, service A responds to user requests.

  1. Call process tracing
  • A global TraceID is generated when the request arrives, and the entire invocation chain can be connected by TraceID. One TraceID represents one request.

  • In addition to TraceID, SpanID is needed to record the call parent-child relationship. Each service keeps track of the parent ID and span ID through which the parent-child relationship can be organized into a complete chain of calls.

  • A span without a parent ID becomes a root span, which can be regarded as a call chain entry point.

  • All of these ids can be represented by globally unique 64-bit integers;

  • TraceID and SpanID are passed through for each request throughout the call.

  • Each service records the TraceID that came with the request and the SpanID that came with it as the parent ID, as well as the SpanID it generated.

  • To view a complete call, trace all the call records based on TraceID, and organize the entire parent-child relationship with the parent ID and span ID.

  1. Call chain core work
  • Call chain data generation, for the entire call process of all applications buried and output logs.

  • Call chain data collection to collect log data in each application.

  • Call chain data storage and query, store the collected data, because the log data volume is generally very large, not only to its storage, but also need to be able to provide fast query.

  • Index calculation, storage, and query. Perform various index calculations on the collected log data and save the calculation results.

  • Alarm function, provides various threshold warning functions.

  1. Overall Deployment Architecture

  • Generate call chain logs through the AGENT.

  • Collect logs to Kafka through logstash.

  • Kafka is responsible for providing data for downstream consumption.

  • Storm calculates the convergence indicator and falls on ES.

  • Storm extracts trace data and drops it on ES to provide a more complex query. For example, you can quickly query all the traceids that match the time dimension by querying the invocation chain. You can quickly check data in Hbase based on these Traceids.

  • Logstash pulls the original Kafka data to hbase. The hbase rowkey is traceID. The query based on traceID is fast.

  1. AGENT non-invasive deployment

Through the non-invasive deployment of AGENT AGENT, the performance measurement is completely separated from the business logic, and the execution time of any method of any class can be measured. In this way, the collection efficiency is greatly improved and the operation and maintenance costs are reduced. AGENT can be divided into two categories according to service span:

  • In this way, the AGENT uses the Java AGENT mechanism to collect data on the method invocation level information inside the service, such as method invocation time, input and output parameters.

  • Cross-service agents, which requires seamless support for mainstream RPC frameworks in the form of plug-ins. And by providing standard data specifications to accommodate custom RPC frameworks:

(1) Dubbo support; (2) Rest support; (3) Custom RPC support;

  1. Call chain monitoring benefits
  • Accurately grasp the deployment of first-line applications in production;

  • From the point of view of the whole process performance of the call chain, identify the key call chain and optimize it;

  • Provide traceable performance data to quantify the business value of IT operations department;

  • Quickly locate code performance issues and assist developers to continuously optimize code;

  • Assist developers to conduct white box test, shorten the system on-line stability period;

4 Scheme Comparison

Most of the theoretical models of full-link monitoring in the market refer to The Google Dapper paper. This paper focuses on the following three APM components:

  1. Zipkin: An open source distributed tracking system developed by Twitter to collect timed data from services to address latency issues in microservices architecture, including data collection, storage, lookup, and presentation.

  2. Pinpoint: a large – scale distributed system written on Java APM tool, by the Korean open source distributed tracking components.

  3. Skywalking: Domestic excellent APM component, is a JAVA distributed application cluster business performance tracking, alarm and analysis system.

The items to be compared in the above three full-link monitoring schemes are extracted:

  1. Probe performance

  2. Collector’s extensibility

  3. Comprehensive call link data analysis

  4. For development transparency, easy switching

  5. Complete call chain application topology

4.1 Performance of the probe

After all, APM positioning is still a tool. If the link monitoring component is enabled, the throughput will be reduced by half, which is unacceptable. Pressure measurements were performed on Skywalking, Zipkin and Pinpoint and compared with baseline (without probe).

Choose a common spring-based application, which includes Spring Boot, Spring MVC, Redis client, mysql. Monitoring the application, each trace, probe captures 5 spans (1 Tomcat, 1 SpringMVC, 2 Jedis, 1 Mysql). This is basically the same test application as SkywalkingTest.

Three types of concurrent users are simulated: 500, 750, and 1000. Using JMeter, each thread sent 30 requests and set the thought time to 10ms. The sampling rate used is 1, i.e. 100%, which may differ from production. Pinpoint default sample rate of 20, that is 50%, by setting the agent profile to 100%. Zipkin also defaults to 1. Together, there are 12. Take a look at the summary table:

Comparison of probe performance

As can be seen from the above table, among the three link monitoring components, The probe of Skywalking has the least impact on throughput, while the throughput of Zipkin is in the middle. The impact of pinpoint probe on throughput is more obvious, the throughput of the test service is reduced from 1385 to 774 when 500 concurrent users, which has a great impact. Then we look at the impact of CPU and memory. When we do pressure tests on internal servers, the impact of CPU and memory is almost in the 10% range.

4.2 Collector scalability

Collector is scalable, allowing it to scale horizontally to support large server clusters.

  1. zipkin

2. skywalking

Skywalking’s Collector can be deployed in two modes: standalone and cluster. GRPC is used for communication between collector and agent.

3. pinpoint

Similarly, Pinpoint also supports cluster and single-machine deployment. Pinpoint Agent sends link information to the Collector through the thrift communication framework.

4.3 Comprehensive call link data analysis

Comprehensive call link data analysis provides code-level visibility to easily locate failures and bottlenecks.

  1. zipkin

Zipkin’s link monitoring granularity is not so fine. From the figure above, it can be seen that the call chain is specific to the interface level, and further call information is not involved.

2. skywalking

Skywalking also supports 20+ middleware, frameworks, and class libraries such as the mainstream Dubbo and Okhttp, as well as DB and messaging middleware. The interception of skywalking link call analysis in the figure above is relatively simple. The gateway calls the User service. Due to the support of numerous middleware, the call analysis of Skywalking link is more complete than that of Zipkin.

3.pinpoint

Pinpoint e should be the three kinds of APM components, the most complete data analysis component. Provides code-level visibility to easily locate failures and bottlenecks, as you can see in the figure above for SQL statements executed, which are logged. Alarm rules can also be configured to set the corresponding person in charge of each application. According to the configured rules, the supported middleware and framework are relatively complete.

4.4 For development transparency, easy switch

For development transparency, easy to switch on and off, add new features without modifying the code, and easy to enable or disable. We expect functionality to work without modifying the code and expect code-level visibility.

For this purpose, Zipkin uses a modified class library and its own container (Finagle) to provide distributed transaction tracking capabilities. However, it requires that the code be changed as needed. Skywalking and Pinpoint are both based on bytecode-enhanced methods, with developers not having to modify the code and being able to collect more accurate data because there is more information in bytecode.

4.5 Complete application topology of call chain

Automatically detect the application topology to help you understand the application architecture.

Pinpoint link topology

Skywalking link topology

Zipkin link topology

The above three figures show the respective call topologies of APM components, which can achieve a complete call chain application topology. Relatively speaking, pinpoint interface display more rich, specific to the DB name, Zipkin topology is limited to services between services.

4.6 Pinpoint and Zipkin refine comparison

4.6.1 Differences between Pinpoint and Zipkin

  1. Pinpoint e is a complete performance monitoring solution: from probe, collector, storage to the Web interface and other complete system; And Zipkin only focus on collector and storage services, although there is a user interface, but its function and Pinpoint not in the same league. Instead, Zipkin provides Query interface, a more powerful user interface and system integration capability, which can be implemented based on the secondary development of the interface.

  2. Zipkin officially provides interfaces based on Finagle framework (Scala language), while interfaces of other frameworks are contributed by the community. Currently, Zipkin can support mainstream development languages and frameworks such as Java, Scala, Node, Go, Python, Ruby and C#. But Pinpoint currently has only the official Java Agent probe, the rest are in the request community support (see #1759 and #1760).

  3. Pinpoint provides a Java Agent probe, through the bytecode injection method to achieve call interception and data collection, can do the real code without invasion, only need to start the server to add some parameters, you can complete the probe deployment; Zipkin’s Java interface implementation Brave, on the other hand, provides only a basic operational API that requires manual configuration files or code additions if you need to integrate with a framework or project.

  4. Pinpoint’s back-end storage is based on HBase, while Zipkin is based on Cassandra.

Similarities between Pinpoint Pinpoint and Zipkin

Pinpoint and Zipkin are based on the paper of Google Dapper, so the theoretical basis is roughly the same. In both cases, service calls are divided into several spans with cascading relationships. SpanId and ParentSpanId are used to cascade the calling relationships. Finally, all Span flows through the entire invocation chain are aggregated into a Trace, which is reported to the server collector for collection and storage.

Even here, the concept adopted by Pinpoint is not entirely consistent with that paper. For example, he uses TransactionId instead of TraceId, which is a structure containing TransactionId, SpanId, and ParentSpanId. And Pinpoint Pinpoint has added a SpanEvent structure under Span, which is used to record the call details within a Span (such as specific method calls and so on), so the default will record more trace data than Zipkin. Theoretically, however, there is no limit to the granularity of a Span, so a service call can be a Span, and so can method calls in each service, in which case Brave can also trace method call levels, but the implementation does not.

4.6.3 Bytecode injection vs API calls

Pinpoint has realized Java Agent probe based on bytecode injection, and Zipkin Brave framework only provides application layer API, but the problem is far from so simple. Bytecode injection is a simple and crude solution. In theory, any method call can be intercepted by injecting code, which means that there is nothing that can’t be implemented, only what can’t be implemented. Brave, on the other hand, provides application-level apis that require support from the underlying drivers of the framework for interception. For example, the MySQL JDBC driver provides a way to inject interceptor, so you only need to implement the StatementInterceptor interface and configure it in the Connection String. In contrast, older MongoDB drivers or Spring Data MongoDB implementations do not have such interfaces, making it difficult to intercept query statements.

So Brave is vulnerable at this point, and no matter how difficult bytecode injection is, it is at least achievable, but Brave has no way of doing it, and whether and to what extent it can inject depends more on the API of the framework than on its capabilities.

4.6.4 Difficulty and Cost

After a simple reading of the Pinpoint and Brave plug-in code, you can find that the implementation of the two difficulties have a world of difference. Brave is easier to use than Pinpoint without any development documentation support. Brave has very little code and its core functions are concentrated under the brave-core module, which an average developer can read in a day and have a very clear understanding of the STRUCTURE of the API.

Pinpoint code package is also very good, especially for the bytecode injection of the upper API package is very good, but this still requires reading staff to have some understanding of the bytecode injection, although the core API for the injection code is not much, but to understand thoroughly, I’m afraid have to go deep into the Agent related code, For example, it’s hard to tell the difference between addInterceptor and addScopedInterceptor when the two methods are in the Agent’s related type.

Because Brave’s injection relies on the underlying framework to provide interfaces, it does not require a complete understanding of the framework, just where it can be injected and what data it can retrieve at the time of injection. As in the example above, we do not need to know how MySQL JDBC Driver is implemented to intercept SQL. But Pinpoint is not, because Pinpoint can inject almost any code anywhere, which requires developers to have a very deep understanding of the required injection library code implementation, by looking at its MySQL and Http Client plug-in implementation can be insight into this point, Of course, this also from another level explain the ability of Pinpoint really can be very powerful, and its default implementation of a lot of plug-ins have done very fine-grained interception.

Brave is not completely helpless when the underlying framework does not expose an API. We can take an AOP approach and still inject the relevant interception into the specified code, and AOP is obviously much simpler than bytecode injection.

These are directly related to the realization of a monitoring cost, Pinpoint the official technical documents, given a reference data. If the integration of a system, then the development of Pinpoint plug-in cost is 100, the plug-in integrated into the system cost is 0; But for Brave, the plug-in development cost is only 20 and the integration cost is 10. From this point you can see that the official cost reference figure is 5:1. But the official stressed, if there are 10 systems need to be integrated, then the total cost is 10 * 10 + 20 = 120, Pinpoint Pinpoint development cost of 100, and the need to integrate the more services, the greater the gap.

4.6.5 Versatility and expansibility

Obviously, the Pinpoint completely at a disadvantage, from the development of the community integration interface can be seen.

Pinpoint Pinpoint’s data interface lacks documentation and is not very standard (see the forum discussion thread), requiring a lot of reading code to be able to implement a probe of its own (such as Node or PHP). And the team used Thrift as a standard data transfer protocol for performance reasons, which is much more difficult than HTTP and JSON.

4.6.6 Community Support

Needless to say, Zipkin, developed by Twitter, is a star team, while Naver’s team is a minor unknown (as can be seen from the #1759 discussion). While this project is unlikely to disappear or stop updating in the short term, it is not as safe to use as the former. And there is no more community developed plug-in, Pinpoint rely only on the team’s own strength to complete the integration of many frameworks is difficult, and their current work is still focused on improving performance and stability.

4.6.7 other

Pinpoint has been implemented with performance concerns in mind. Some of the services at the back end of the www.naver.com website handle more than 20 billion requests a day, so they use Thrift binary variable-length encoding and UDP as the transport link. Also try to use data reference dictionaries when passing constants, pass a number instead of a string, and so on. These optimizations also add complexity to the system, including the difficulty of using the Thrift interface, UDP data transfer issues, and data constant dictionary registration issues.

Zipkin, by contrast, uses the familiar Restful interface plus JSON with almost no learning cost or integration difficulty, and can easily develop interfaces for a new framework as long as you know the data transfer structure.

Another Pinpoint lack of sampling ability for requests, obviously in the production environment of large traffic, it is unlikely to record all requests, which requires sampling of requests to determine what kind of requests I need to record. Both Pinpoint and Brave support sample percentages, which are the percentage of requests that are recorded. In addition, however, Brave provides the Sampler interface to customize sampling strategies, which makes sense especially when doing A/B testing.

4.6.8 summary

In the short term, Pinpoint does have overwhelming advantages: the ability to deploy probes without making any changes to the project code, the fineness of trace data to the method call level, a powerful user interface and almost comprehensive Java framework support. But in the long run, the development of Pinpoint learning interface, and the future for different frameworks to achieve the cost of interface are still unknown. By contrast, Brave is relatively easy to master, and Zipkin’s community is stronger and more likely to develop more interfaces in the future. In the worst case, we can add our own monitoring code in an AOP way that suits us, without introducing too many new technologies and concepts. And in the future business changes, Pinpoint official statements can meet the requirements of not to say, add new statements will also bring unpredictable work difficulty and workload.

5 Tracing differs from Monitor

Monitor can be divided into system monitoring and application monitoring. System monitoring such as CPU, memory, network, disk and so on the overall system load data, refinement can be specific to the relevant data of each process. This kind of information is directly available from the system. Application monitoring requires application support and exposes corresponding data. QPS for in-application requests, latency for request processing, number of errors for request processing, message queue length, crashes, process garbage collection, etc. The main objective of Monitor is to detect anomalies and timely alarm.

Tracing is the base and core of the call chain. Most of the relevant metrics are analyzed around the call chain. The main goal of Tracing is system analysis. Finding problems ahead of time is better than trying to solve them later.

Tracing has much in common with the application-level Monitor technology stack. There are data acquisition, analysis, storage and development. However, the analysis process is different due to the different dimensions of data collected.