In the theory part, we introduced the research work of adfish in the field of call chain tracing. This part continues to introduce the practice of call chain tracing of adfish. Before the formal introduction, I would like to briefly explain the background: In 2015, the technical team made the decision to use the Go language uniformly when the Fish server started. The impact of this decision is mainly reflected in:

  1. Internal infrastructure does not require cross-language support
  2. There is a slight linguistic bias to the technology selection

1. Early practice

1.1 docking Jaeger

In 2019, the number of micro-services in the company has gradually increased, and the call relationship has become increasingly complex, making it more difficult for engineers to do performance analysis and troubleshooting. There is an urgent need for a call chain tracking system to help us enhance the overall picture of the server. After investigation, we decided to adopt Jaeger, a project incubated by CNCF and also built based on Go language. At that time, the development and governance of services had not yet introduced the context, and there was no context passing either within or across processes. Therefore, the focus of the early introduction of call chain tracing fell on the transformation of service and service governance framework, including:

  • Inject context information on both the server and client sides of HTTP, Thrift, and gRPC
  • Bury points at database, cache, message queue access points
  • Quickly inject into the existing project repositorycontextCommand line tool

Deployment: All-in-one is used in the test environment, and direct-to-storage is used in the online environment. The whole process took about a month, and we launched the first version of the call chain tracking system in Q3 2019. With the widespread adoption of Prometheus + Grafana and ELK, we finally gathered together three elements of call chains, logs, and metrics in the observability of microservice clusters.

The figure below is the data reporting path diagram of the first version of call chain tracking system. The service runs in the container, and the GO-SDK of Jaeger is reported to the Jaeger-Agent of the host through the SDK burial point of OpenTracing, and the latter further reports the data to the Jaeger-Collector. Finally, the call chain data is written into ES to establish the index. Jaeger Backends.

1.2 Problems encountered

In theory, we introduced that Jaeger supports three sampling methods:

  • Const: Either all sampling or no sampling
  • Probabilistic: Sampling with fixed probability
  • Rate Limiting: Ensures that each process collects a maximum of K data in a fixed time

These sampling methods all belong to head-based coherent sampling, and we have discussed their advantages and disadvantages in the theoretical chapter. In the production environment of fish, a current-limiting sampling strategy is used: each process collects at most 1 trace per second. Although this strategy saves resources, its disadvantages are gradually exposed in online troubleshooting:

  1. There are multiple interfaces in one process: Regardless of sampling with fixed probability or sampling with limited flow, the interface with small flow always fails to collect the data of call chain and starve.
  2. Online service error is a small probability event, and the probability that the request causing error is collected is even smaller, which leads to the loss of the call chain causing the problem while the information collected is not large

2. Call chain path transformation

2.1 Application Scenarios

In 2020, we continuously received feedback from business research and development: can we collect trace in full?

This led us to rethink how to improve the call chain tracing system. We made a simple capacity estimation: at present, the daily data volume of Jaeger writing ES is close to 100GB/ day. If trace data is to be collected in full, it is conservatively assumed that the total QPS of each HTTP API service is 100, then the complete storage of full data needs 10TB/ day. The optimistic assumption is that each of 100 server developers checks one trace per day, and the average size of each trace is 1KB, so the overall SNR is one part in ten million. It can be seen that the ROI of this thing itself is very low. Considering that the business will continue to grow in the future, the value of storing these data will also continue to decrease, so the scheme of full collection was abandoned. Step back and think: Is full collection really an essential requirement? Actually, that’s not the case. What we really want is for ** “interesting” trace to be collected and ** “boring” trace not to be collected.

According to the application scenarios introduced in the theory chapter, in fact, the first version of the call chain tracing system only supports steady-state analysis, while the business research and development urgently needs exception detection. To support both scenarios, tail-based coherent sampling is necessary. Rather than making a decision on whether or not to sample at the first span, tail sampling allows us to make a judgment after obtaining (near) complete trace information. Ideally, as long as the judgment logic of sampling can be reasonably formulated, our call chain tracing system can collect all the “interesting” traces and reject the “boring” traces.

2.2 Architecture Design

Jaeger’s team has been discussing the possibility of introducing tail-based sampling since 2017, but no conclusion has been reached so far. In a one-on-one conversation with Jaeger engineer jpkrohling, he confirmed that Jaeger has no plans to support him. So we had to find another way. After some research, we found OpenTelemetry, which just entered the CNCF SandBox, and its OpentElemetyr-Collector subproject just happened to support the introduction of tail-coherent sampling on our existing architecture.

2.2.1 OpenTelemetry Collector

The entire OpenTelemetry project aims to unify commercially available standards for observational data and provide components and tools to drive the implementation of those standards. Opentelemetyr-collector is an important component of this ecosystem and its architecture is shown as follows:

There are four core components inside the Collector:

  • Receivers: Receivers of telemetry data in different formats, which in the case of Trace are Zipkin, Jaeger, OpenCensus, and their own OTLP. In addition, it is possible to receive data in the above formats from Kafka
  • Processors: Implements processing logic such as packaging, filtering, grooming, sampling, etc. Tail-sampling logic can be implemented here
  • Exporters: Outputs telemetry data in a specified format to backend services such as Zipkin, Jaeger, OpenCensus Backend, Kafka, or another set of collectors
  • Extensions: Provide plug-ins outside the core process, such as Pprof for performance analysis, Health for health monitoring, and so on

We can assemble one or more Pipelines using different Receivers, Processors, Exporters.

The OpentElemetyle-Collector project is split into two parts: Main project OpentElemetyr-collector and community contribution project OpentElemetyr-collector-contrib, The former manages core logic, data structures, and common Processors, Processors, Exporters, and Extensions implementations, while the latter receives community contributions from observables SaaS solution providers. DataDog, Splunk, LightStep, and some public cloud vendors. The plug-in management mode of OpentElemtry-Collector project makes the cost of customized development of Receiver, Processor and Exporter very low. During the proof-of-concept, we can basically develop and test a component within one or two hours. In addition to this, OpentElemetyr-collector-contrib provides tailsamplingProcessor out of the box.

Opentelemetry – Collector is an important component of the OpentElemetry team to promote the implementation of the standard, and OpentElemetry itself does not provide independent data storage, index implementation plan, so it is in the market popular call chain tracing framework, Like Zipkin, Jaeger, and OpenCensus, a lot of work has been done on compatibility. According to Receivers and Exporters, we can use it to replace jaeger-Agent and between Jaeger-Agent and Jaeger-collector. If necessary, you can deploy multiple layers of collectors between jaeger-agent and Jaeger-Collector. In addition, if one day we want to replace Jaeger-Backend, such as the newly released Grafana Tempo, we can easily replace jaeger-Backend with multiple Pipelines/Exporters.

2.2.2 Reporting path

Based on the above research and existing architecture, we designed the architecture of the second version of call chain tracing, as shown in the figure below:

Replace the Jaeger-agent (otel-Agent in the figure) with a set of OpentElemety-collector. At the same time, add another opentElemetyr-collector between otel-agent and Jaeger-collector, namely the Otel-collector in the figure. Otel-agent collects trace data reported by different services on the host and sends it in bulk to Otel-Collector, who is responsible for tail continuous sampling and continues to output ** “interesting” ** trace to jaeger-Collector in the original architecture. The latter is responsible for indexing the data into ElasticSearch.

Here’s another problem to solve: To achieve high availability of the entire architecture, it is necessary to deploy multiple otel-collector instances. If a simple load balancing strategy is used, the data reported by different Otel-agents and a single Otel-agent at different times, Spans can be reported to a random Otel-collector instance, which means different SPANS on the same Otel-Collector instance can’t be spans on the same trace, which can lead to different SPANS on the same trace, It’s not coherent sampling; It also results in incomplete data for providing judgment during tail sampling. The solution is simple: Let otel-Agent perform load balancing according to traceID.

During the research phase we happened to see that the opentelemetyr-collector community also has this support program, and the aforementioned engineer jpkrohling is addressing it by adding loadbalancingexporter, Although going straight to the Bleeding Edge version was risky, we decided to give it a try. During the proof-of-concept phase, we did find several problems with the new functionality, but solved them one by one through feedback (#1621) and contributions (#1677, #1690), resulting in a call chain tracing system that performed tail-coherent sampling as expected.

2.3 Sampling Rules

With a tail-coherent sampling data path in place, the next step is to define and enforce sampling rules.

2.3.1 “Interesting” call chain

What is an “interesting” call chain? Any call chain that the developer wants to query during analysis and troubleshooting is an “interesting” call chain. However, the rules implemented into the code must be deterministic. Based on daily obstacle removal experience, we first determined the following three situations:

  • ERROR level logs were printed on the call chain
  • Database queries greater than 200ms have appeared on the call chain
  • The entire call chain request takes more than 1s

The chain of calls is considered “interesting” if any condition is met. In Banyu, as long as the service prints the error-level log, it will trigger the alarm, and the r&d personnel will receive the IM message or telephone alarm. If the call chain data that triggers the alarm can be guaranteed, the obstacle removal experience of the r&d personnel will be greatly improved. Our DBA team believes that all query requests over 200ms are considered as slow queries. If we can ensure that the invocation chain of these requests is required, it will greatly facilitate the RESEARCH and development to check the requests causing slow queries. For online services, the user experience may deteriorate if the delay is too high, but to what extent will cause significant experience decline we do not have data support for the time being, so we set the threshold to 1s to support changing the threshold at any time.

Of course, the above conditions are not absolute. We can adjust and add new rules according to the feedback in future practice, such as the database caused by a single request and the number of cache queries exceeding a certain threshold.

2.3.2 Sampling pipeline

In the second version of the system, we expected to support steady-state analysis and anomaly detection at the same time. Therefore, part of trace should be collected in the sampling process according to probability or current-limiting mode, and part of trace should be collected according to the “interesting” rule formulated above. As of this writing, tailsamplingProcessor supports four policies:

  • always_sample: the sampling
  • numeric_attribute: a numeric attribute is located[min_value, max_value]between
  • string_attribute: a string attribute is in the collection[value1, value2, ...]Among the
  • rate_limitingSpans: Limits traffic according to the SPANS number, specified by the SPANS parameterspans_per_secondcontrol

There are numeric_attribute, string_attribute, and rate_limiting for us.

Can you use rate_limiting to collect some traces? Spans can only be limited by the number of SPANS you pass per second, but the spans vary at peak, peak, and phase of your business. The average SPANS of each trace vary with the size of your microservice cluster and dependencies. Therefore, setting spans_per_second makes it difficult to evaluate the final effect of this parameter, so the idea of using Rate_limiting directly is rejected.

Can “collect another part of trace by rule” be implemented directly using numeric_attribute and string_attribute? SetTag(” ERROR “, true); However, TailSamplingProcessor does not support BOOL_attribute; In addition, we may have more complex combinative conditions in the future, where numeric_attribute and string_attribute alone cannot be implemented.

After repeated analysis, we finally decided to use the chain structure of Processors to combine multiple Processors to complete sampling. The flow line is as follows:

Probattr is responsible for sampling at the trace level based on probability. Anomaly is responsible for analyzing whether each trace meets the “interesting” rule. If one of the two rules is hit, the trace will be marked with sampling. Finally, configure a rule on tailSamplingProcessor as follows:

Here, sampling. Priority is an integer, and the current value is 0 and 1. According to the above configuration, all trace with sampler. Priority = 1 will be collected. In the later stage, more acquisition priorities can be added, and multiple upsampling or downsampling can be carried out when necessary.

3. Deployment and implementation

With sampling rules in place, the entire solution is ready to go, and the next step is deployment.

3.1 Preparing for Commissioning

3.1.1 Transformation of basic library

Dynamically update the Tracer

In the first version, each process started with a sample configuration from Apollo that was passed to the Jaeger SDK, which used the configuration to initialize GlobalTracer. GlobalTracer decides whether to collect when the first span of the trace appears, and passes this decision on, i.e., header continuous sampling. When implementing the new architecture, we need the Jaeger SDK to report all trace data. To make this process smoother, we made two changes to the Jaeger SDK configuration:

  • Supports different sampling configurations for each service, facilitating grayscale publishing
  • Dynamic updating of the sampling configuration is supported to make the sampling policy configuration independent of publication
Log library Transformation

To ensure that call chains that have printed error-level logs are valid, we also label the SPAN with ERROR at the appropriate location in the common log library.

3.1.2 Monitor Kanban configuration

Opentelemetyr-collector internally buries many useful monitoring Metrics using the OpenCensus SDK and exposes the data according to the Open Metrics specification. Since there are not many metrics, we have configured most of them into Grafana Kanban, including:

  • Receivers: obsreport_receiver
  • Processors: obsreport_processor tailsamplingprocessor
  • Exporters: obsreport_exporter

Here are a few indicators that we consider important in practice:

xxx_receiver_accepted/refused_spans

Here XXX refers to any receiver used in pipeline. There are actually two metrics: The number of SPANS a receiver receives and the number of spans a receiver rejects. The two can be combined with other indicators to analyze the bottleneck of inlet flow in the current situation.

xxx_exporter_send(failed)_spans

Here XXX refers to any of the exporters used in pipeline. There are actually two specific metrics: the number of SPANS my exporter has sent successfully, and the number of SPANS my exporter has failed. The two can be combined with other indicators to analyze the bottleneck of exit flow in the current situation of the system.

otelcol_processor_tail_sampling_sampling_trace_dropped_too_early

To introduce the above metrics, you need a brief understanding of how tailSamplingProcessor works. In distributed environments, tailsamplingprocessor can never determine if all SPANS of a trace are collected at this point in time, so you need to set a timeout, decision_wait, Let’s assume decision_wait = 5s. After the Trace Data enters the processor, it is put into two Data structures:

A fixed-size queue and a hash table, which together implement the LRU cache for Trace Data. At the same time, the processor will organize all the inputs into one batch per second, and maintain a total of five batches (decision_wait) internally. Every one second, take the oldest batch and determine whether the traces match the sample rules. If they do, pass them to the processors that follow:

If it is found that the corresponding trace has been cleared by LRU cache when making the sampling decision, it is considered that “Trace Dropped too early”, which means that the tailsamplingprocessor is overloaded. Theoretically, if this index is not equal to 0, the tail coherent sampling function is in an abnormal state.

3.2 Gray scheme

As mentioned above, implementing the change requires the Jaeger SDK to report trace in full. Whether as a result of “report” the decision on the request entry services (i.e., the HTTP service), and is transmitted as cross-process calls to the downstream service, at the same time with fish within the service side entrance service has been split in accordance with the business, so the process of gray can according to the service entrance, start with small flow, low level service entrance online watch, Then gradually add the entry service with large traffic and high level, and finally turn on full sampling by default, and discover and solve potential problems in this process.

3.3 Optimization of resource consumption

The resources required for the new architecture are not much different from the old one:

  • Storage: ES index
  • Computing: Otel-Agent deployed on each K8S node, and otel-Collector cluster (new)

We did a full risk assessment before we phased in all entry services. After full collection is enabled, the network I /o of the host is mainly increased. Supported by gigabit network adapters (about 300MB/s), the increased I /o volume is far from reaching the bottleneck. During implementation, the business team is indeed unaware. But in the gray on-line process, we also found and solved some problems.

3.3.1 Hotspot Service Problems

Different services have different requests. Excessive reporting of individual services results in unbalanced traffic among different Otel-agents. As a result, the CPU of otel-Agents often exceeds the WARNING threshold during peak hours. We indirectly balanced the traffic carried by each Otel-Agent by adding hot service instances and reducing the request volume of a single instance.

3.3.2 Filter push down

In a production environment, we maintain the last 7 days of trace data by default. In analyzing the index jaeger-span-* in ES, it is not surprising that we see the presence of Power Law:

Careful analysis reveals that more than 50% of the spans belong to apolloConfigCenter.*. If you are familiar with Apollo development, you should know that the Apollo SDK usually pulls configuration information from the metadata center in a long polling manner and caches it locally, whereas the service obtains configurations from the local cache at the application layer. So actually all of apolloConfigCenter.* here is just local access, not cross-process calls, and the SPAN data value is low enough to be ignored. So we developed a processor that filters the span through regex matching spanName and deployed it on Otel-Agent, which we called filter push-down. After the deployment went online, the ES index volume decreased by more than 50%, and the index volume is 40-50 GB per day. The CPU consumption of otel-collector and Otel-Agent is also reduced by nearly 50%.

3.4 develop SLO

In the fish server, the first entry for online problem detection is usually IM message. The alarm platform will inject the traceID and log causing the alarm into the message, and provide a link page for the r & D to quickly check the alarm related call chain information and the log printed by each service on the whole call chain. Based on this, we developed the SLO of the new call chain tracing system:

Currently we have just supported this SLI burial point in the alarm platform. At present, some services have not yet upgraded their dependencies, so this index cannot reflect the situation of all services. We will continue to promote the comprehensive upgrade of all services and require the new version of the system according to the above SLO.

4. Summary

With the help of the open source project, we were able to solve the needs of the current companion fish internal call chain tracking application for steady-state analysis and anomaly detection with very little manpower, while also making a small contribution to the open source project and the community. Call chain tracing is an important component of the observability platform, and we will continue to focus some of our efforts on telemetry data integration in the future to provide a more comprehensive and consistent service observation and analysis experience for r&d.

Reference 5.

  • Call the chain tracking system in companion fish: Theory
  • Building your own OpenTelemetry Collector distribution
  • A brief history of OpenTelemetry (So Far)
  • OpenTelemetry Agent and Collector: Telemetry Built-in into All Services
  • Github: OpenTelemetry-CNCF
    • opentelemetry-collector
    • opentelemetry-collector-contrib