Distributed Tracing Matters

By Tobias Schmidt

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: ItzMiracleOwO

Proofread by: jaredliw, KimYangOfCat

A matter of distributed tracking

Singleton application problems are mostly easy to troubleshoot. You can look for execution traces and investigate errors and bottlenecks in detail. Sometimes it only takes a quick stack trace to figure out what’s wrong with your business process.

This is not the case for distributed systems, especially serverless cloud architectures. A single process often invokes several functions or orchestrates messages generated by an event-driven process. Debugging such a complex distributed system can be difficult. Correlating requests and actions is not an easy task, but you often need to do so to find errors, inconsistencies, or bottlenecks.

This is why you need distributed tracking to cover your team structure and ecosystem. It helps enormously by giving you a detailed look at all the executed code for a specific request or triggered business process.

In this article, I want to introduce you to the OpenTracing API, the W3C Trace Context specification, and how to build your own custom integration that you can use to create and extend Trace contexts for the Trace tool of your choice.

Argument – Why is distributed tracing important
OpenTracing API – a vendor-independent framework
Concepts — SPANS, spans, and threads
Explore further – learn about trace parent SPAN and trace status
Practice — Initialize your tracker and turn SPANS on and off
The main harvest

Why is distributed tracking essential?

Modern software architectures are often the convergence of multiple systems. A single request on a system can cause many operations and business processes in an ecosystem. Sometimes they are not even synchronous, but event-based driven and completely decoupled.

This makes debugging a tedious, complex task. Moreover, it is not easy to observe a single transaction. You can’t rely on a stack trace for a single system because the code executes on multiple systems.

Looking at some examples of fictional architectures on AWS, we see that there is a lot of code in different systems that is just an external trigger.

Data is written to DynamoDB, but changes are forwarded through Streams.
The message is added to the SQS queue and is later processed by another Lambda function.
The external service is invoked, which in turn generates incoming requests for the next step.

Although the operations appear to be decoupled, they are usually associated with the same trigger and are therefore coupled from a business perspective. You usually need to correlate each operation involved in the entire business process to solve the problem.

This is especially true if you are working on a multi-team project and the product was developed by an independent team. In fact, each team monitors the resources and microservices it is responsible for. But what if the customer complains of slow requests, but no team reports any performance bottlenecks? If the request does not have overall team relevance, you will not be able to investigate these issues in detail.

The bottom line: There are all sorts of reasons why you want to use distributed tracking.

An overview of the

If you are tracking your request across multiple systems, you need to collect data for each step along the route.

Our browser submits the first request in our example, starting the tracing context for the entire operation. It needs to send the request itself and attach more information about the context so that your trace collector can associate the request later. All subsequent systems can extend it by adding details about the code they execute.

Each system involved in the process needs to send its details about executing code to the core processing instance, which we can use later to analyze our results.

OpenTracing API

Quote introduction:

OpenTracing consists of API specifications, frameworks and libraries that implement the specifications, and project documentation. OpenTracing allows developers to add detectors to their application code using apis that are not specific to any particular product or vendor.

Although it is not a standard, it is still widely used by many frameworks and services, and the OpenTracing API allows you to create custom implementations that follow definition guidelines.

concept

This section introduces the core concepts and terminology. We’ll take a look at Spans, Scopes, and Threads in detail.

Spans

Spans is the basic concept of distributed tracking. A Span represents a specific amount of executing code or work. Take a look at the specification for OpenTracing, which contains:

Operation name
Start and end times
A set of tags (for queries) and logs (for span-specific messages)
context

This means that every component in our ecosystem that involves requests should contribute at least one span. Since spans can reference other spans, we can leverage those spans to build a complete stack trace to cover all operations in a single request. There is no limit to the fineness of the span. We can use it anywhere, whether it’s the entire complex process, or the span of a single function/operation.

Scopes and Threading

Looking at dedicated threads in an application, it can only have one ActiveSpan at a time, called ActiveSpan. This does not mean that we cannot have multiple spans, but other spans will be blocked.

When a new span is generated, the currently active span automatically becomes its parent span unless otherwise specified.

explore

Transferring our context between systems is implemented by two different HTTP headers: Traceparent and tracestate. It will contain all the information on how to correlate all the relevant SPAN information. This is explained in detail in the W3C Trace Context.

traceparent— Specifies requests for tracking systems that are not dependent on any vendor.
tracestate— Including vendor-specific information about the request.

Trace the parent span

The Traceparent header carries four different types of information: version, trace identifier, parent identifier, and flag.

Version – Identifies the version, currently00.
TraceID – Unique identifier for distributed tracing.
ParentID – The request identifier known to the caller.
Trace flag – Used to specify options such as sampling or trace levels.

The trace system needs to trace the parent span to correlate our requests and aggregate them into a multi-span request.

Tracking state

The tracestate header is accompanied by the Trace Parent, which adds vendor-specific Trace information.

Take a look at an example from NewRelic:

We can identify the parent, span’s timestamp, and our vendor. There are no set rules about what the trace status header can carry or what it must look like, so it can vary greatly depending on the trace tool you use.

practice

Now that we’ve introduced these concepts, let’s put them into practice and see how you can use them to initialize trackers and extend them by manually turning spans on and off.

Although distributed tracing tools provide proxies for many different languages and frameworks, the following information may be important if you need to extend the tracker manually.

Let’s look at a basic scenario where the system we want to add to a distributed trace performs secondary business logic only by invoking another external call.

Our example is very simple, so are the basic steps we must take:

If there is no trace, we will initialize a new (root span)
We can create a new SPAN for each process boundary (such as an external call) on each system that handles the request or operation
We’ll close the Span we opened when we’re done
We submit the SPAN details to our trace collector system

We can open as many spans as needed in a single system. We just need to make sure we nest them properly and close them separately. So we need to manually trace the SPAN stack.

Set up our tracker

If we want to do this manually, what do we need first? We need a bunch of spAs that our system has turned on and therefore needs to turn off.

If no trace state header comes into our system, we can easily create a new one by generating it ourselves. If we already have one, we will extend the trace by opening a new SPAN.

let traceParent = RequestContext.getHeader('traceparent');
let version = "00";
let traceId;
let spanId;
let flags = "00";

if(! traceParent) { traceId = randHex(32);
  spanId = "";
} else {
  const split = traceParent.split("-");
  version = split[0];
  traceId = split[1];
  spanId = split[2];
  flags = split[3];
}
Copy the code

Now that we have the root span, we can proceed to create spans for operations or processes at a custom granularity.

Create a new span

As mentioned earlier, in order to track our spans across a single system, we need a stack to hold the spans we have opened.

I prefer to use a request context with two dedicated objects:

An object that holds the necessary details of the span: the timestamp when it is opened and the span’s identifier
An object that holds the necessary details of a SPAN: the timestamp when it is opened and the span’s identifier

Our root span is only used to track our tracker identifier. Most importantly, we will keep track of all the spans we have open and delete each span when the operation or process is complete.

const openNewSpan = (spanName) = > {
  const spanId = ranHex(16);
  const startTime = Date.now();
  RequestContext.addSpan(spanName, { spanId, startTime });
};
Copy the code

Close the SPAN and submit the trace information

When we need to close the span at the end of an operation or process, we can pop the span from the top of the stack. We can also calculate the required information based on the information we have saved in the SPAN or the information left in the stack.

Our parent SPAN identifier — this is the span now at the top of the stack.
The difference between the current timestamp and the timestamp we saved at the span is the duration of the operation.
Trace identifier – stored in our root span, it is always at the bottom of the stack.

In this example, we submit the span information to NewRelic in their own format. This may vary depending on the tracking tool you use.

const closeSpan = (spanName) = > {
  const top = RequestContext.popSpan();
  const spanId = top.spanId;
  const duration = Date.now() - top.startTime;
  const parent = RequestContext.getSpan('root');
  const traceId = parent.traceId;
  const parentId = parent.spanId;
  // set the trace parent of 
  updateTraceParent();
  submitTraceInformation(traceId, spanName, spanId, parentId, duration); 
}

const updateTraceParent = () = > {
  const top = RequestContext.getTopSpan();
  const root = RequestContext.getRootSpan();
  const spanId = top.spanId;
  const traceId = root.traceId;
  const version = root.version;
  const flags = root.flags;
  const traceParent = [version, traceId, spanId, flags].join("-")
  RequestContext.setHeader("traceparent", traceParent);
};

const submitTraceInformation = (traceId, spanName, spanId, parentId, duration) = > {
  var data = JSON.stringify([
    {
      common: {
        attributes: {
          "service.name": "register-service".host: "mydomain.com"}},spans: [{"trace.id": traceId,
          id: spanId,
          attributes: {
            "parent.id": parentId,
            "duration.ms": duration,
            name: spanName
          }
        }
      ]
    }
  ]);

  var config = {
    method: 'post'.url: 'https://trace-api.newrelic.com/trace/v1'.headers: { 
      'Api-Key': 'NRII-xc......... 77m5P6O'.'Content-Type': 'application/json'.'Data-Format': 'newrelic'.'Data-Format-Version': '1'
    },
    data
  };

  axios(config)
    .then(res= > console.log(JSON.stringify(res.data)))
    .catch(err= > console.log(err));
}
Copy the code

That’s it. All you need to do now is call the openNewSpan and closeSpan functions everywhere. It makes sense to include this in some comments so that you only need to comment the method or process to track and automatically invoke the open and close operations.

The main harvest

Building distributed systems is a complex task, but it can be done quickly with cloud providers such as AWS, Azure or GCP, and with advanced infrastructure such as CloudFormation’s serverless framework as code tools. Keep in mind that you need distributed tracing to analyze the performance of your system and to debug problems methodically.

This article introduced you to the standard for tracking context and how it can be used to track requests across systems and services.

Thank you for reading.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.

A matter of distributed tracking

A matter of distributed tracking

Why is distributed tracking essential?

An overview of the

OpenTracing API

concept

Spans

Scopes and Threading

explore

Trace the parent span

Tracking state

practice

Set up our tracker

Create a new span

Close the SPAN and submit the trace information

The main harvest

Related Posts

03- Runtime data areas and program counters

JupyterHub on Kubernetes: How to build Tubi Data Science Platform

Docker command