Author | Ya Hai

The value of full link tracing

The value of link tracing lies in association. End users, back-end applications, and cloud components (such as databases and messages) form the track topology of link tracing. The wider the topology, the more valuable link tracing can be. Full-link tracing is the best practice that covers all associated IT systems and can completely record the call path and status of user behaviors between systems.

Complete full link tracing can bring three core values to the business: end-to-end problem diagnosis, inter-system dependency sorting, and custom tag passthrough.

• End-to-end problem diagnosis: VIP customers fail to place orders, internal test users timeout requests, and many end user experience problems can be traced back to back-end application or cloud component abnormalities. And full link tracing is the most effective way to solve the end-to-end problem. • Inter-system dependency sorting: When new services come online, old services are cut off, and the equipment room is moved or architecture is upgraded, the dependency relationships between IT systems are complex and beyond the scope of manual sorting. Topology discovery based on full-link tracing makes decision-making in the above scenarios more agile and reliable. • Custom mark passthrough: full link pressure measurement, user level gray scale, order tracing, traffic isolation. Hierarchical processing & data association based on custom tags has spawned a thriving full-link ecosystem. However, once the data is disconnected and tags are lost, unpredictable logical disasters will also occur.

Challenges and solutions for full link tracing

The value of full-link tracking is proportional to its coverage, and so are its challenges. To maximize link integrity, all front-end applications, cloud components, Java and Go languages, public clouds and self-built equipment rooms must follow the same set of link specifications and realize data interconnection. Multi-language protocol stack unification, front-end/back-end/cloud (multi-terminal) linkage, and cross-cloud data fusion are the three major challenges to achieve full-link tracing, as shown in the following figure:

1. Unified multi-language protocol stack

In the era of cloud native, multi-language application architectures are becoming more and more common, and it has become a trend to utilize different language features to achieve the best performance and r&d experience. However, due to the maturity differences of different languages, full link tracing cannot achieve complete consistency of capabilities. At present, the mainstream approach in the industry is to ensure the unified format of the remote call protocol layer, and implement call interception and context transparent transmission internally in multi-language applications, which can ensure the integrity of basic link data.

However, the vast majority of online problems cannot be effectively located and solved only with the basic capability of link tracing. The complexity of online systems determines that an excellent Trace product must provide more comprehensive and effective data diagnosis capabilities, such as code-level diagnosis, memory analysis, thread pool analysis, non-destructive statistics and so on. Making full use of diagnostic interfaces provided by different languages to maximize the release of multilingual products is the foundation of Trace’s continuous development.

  • Transparent transmission protocol standardization: All applications on a full link must comply with the same set of transparent protocol transmission standards to ensure that the link context can be fully transparent between applications in different languages without link disconnection or context loss. At present, the mainstream open source transparent transmission protocols include Jaeger, SkyWalking, ZipKin and so on.
  • Maximize the release of multilingual product capabilities: In addition to the most basic call chain functions, link tracing has gradually evolved into higher-level capabilities such as application/service monitoring, method stack tracing, and performance profiling. However, the maturity of different languages leads to great differences in product capabilities. For example, Java probes can implement many high-level edge side diagnostics based on JVMTI. An excellent full-link tracking scheme maximizes the technical bonus of differentiation for each language, rather than pursuing convergence and mediocrity. Those who are interested can read the article “Open Source/Hosted vs. Commercial Trace: How to Choose”.

2. The front and back cloud (multi-terminal) linkage

At present, open source link tracing mainly focuses on the back-end business application layer, and there is no effective buried means between user terminals and cloud components (such as cloud database). The main reason is that the latter two services are usually provided by cloud service providers or third-party vendors, depending on whether the vendors are friendly to open source compatibility. It is difficult for the business side to get directly involved in development.

The direct impact of the above situation is that the front-end page response is slow, and it is difficult to directly locate the back-end application or service, so it is impossible to give a definite root cause. Similarly, cloud component anomalies cannot be directly equated with service application anomalies. In particular, when multiple applications share the same database instance, more indirect verification methods are required, and the troubleshooting efficiency is very low.

In order to solve such problems, cloud service providers should first better support open source link standards, add core method burying points, and support transparent transmission of open source protocol stack and data backflow (for example, Ali Cloud ARMS front-end monitoring supports transparent transmission of Jaeger protocol and method stack tracking).

Secondly, because different systems may be unable to unify the whole link protocol stack due to problems such as attribution, in order to realize multi-terminal linkage, the Trace system needs to provide a solution to break through the heterogeneous protocol stack.

The heterogeneous protocol stack is cleared. Procedure

In order to get through heterogeneous protocol stacks (Jaeger, SkyWalking and Zipkin), Trace system needs to support two capabilities: first, protocol stack conversion and dynamic configuration, for example, Jaeger protocol is passed down through the front-end, and Zipkin B3 protocol is used by the newly connected downstream external system. Node.js applications between the two can receive Jaeger protocol and transparently transmit ZipKin protocol to ensure the integrity of full-link flag transparently transmission. The second is data format conversion on the server side, which can convert different data formats to a unified format for storage or compatibility on the query side. The former has relatively low maintenance costs, while the latter has higher compatibility costs but is relatively more flexible.

3. Cross-cloud data fusion

For the sake of stability or data security, many large enterprises choose multi-cloud deployment. For example, domestic systems are deployed in Ali Cloud, overseas systems are deployed in AWS cloud, and systems involving sensitive internal data are deployed in self-built computer rooms. Multi-cloud deployment has become a typical deployment architecture on the cloud. However, network isolation in different environments and differences in infrastructure also pose huge challenges for O&M personnel.

Cloud environments can communicate only over the public network. To achieve link integrity in a multi-cloud deployment architecture, link data can be reported and queried across clouds. Either way, the goal is to achieve unified visibility of multi-cloud data and quickly locate or analyze problems through complete link data.

Across the clouds report

Cross-cloud link data reporting is relatively easy to implement and easy to maintain and manage, which is the mainstream approach adopted by cloud vendors at present. For example, Ali Cloud ARMS realizes multi-cloud data fusion through cross-cloud data reporting.

Cross-cloud reporting has the advantages of low deployment cost and easy maintenance on one server. The disadvantage is that cross-cloud transmission occupies public network bandwidth, and public network traffic cost and stability are important limitations. Cross-cloud reporting is more suitable for a master and multi-slave architecture. Most nodes are deployed in the same cloud ring, while other clouds or self-built equipment rooms occupy only a small amount of service traffic. For example, the toC service of an enterprise is deployed in the X cloud and internal applications are deployed in the self-built equipment room.

Across the clouds query

Cross-cloud query means that the original link data is saved in the current cloud network, a user query is delivered separately, and the query results are aggregated and processed in a unified manner to reduce transmission costs on the public network.

The advantage of cross-cloud query is that the amount of data transmitted across the network is small. In particular, the actual query amount of link data is usually less than 1/10000 of the original data amount, which greatly saves the bandwidth of the public network. The disadvantage is that multiple data processing terminals need to be deployed and complex calculations such as quantile and global TopN are not supported. Suitable for multi-master architecture, simple link stitching, Max /min/ AVG statistics can be supported.

There are two modes to realize cross-cloud query. One is to build a set of centralized data processing terminals inside the cloud network and connect to the user network through the private Intranet line, which can process the data of multiple users at the same time. Another option is to create a data processing terminal in a VPC for each user. The former has lower maintenance cost and greater capacity elasticity. The latter provides better data isolation.

The other way

In addition to the above two schemes, mixed mode or only transparent mode can be adopted in practical application.

In mixed mode, statistics are reported over the public network and processed in a centralized manner (small amount of data requires high precision), while link data is retrieved in cross-cloud query mode (large amount of data requires low query frequency).

In transparent transmission only mode, only the link context can be fully transparent between each cloud environment, and link data is stored and queried independently. The advantage of this mode is that the implementation cost is very low. Each cloud only needs to follow the same set of transparent transmission protocol, and the specific implementation scheme can be completely independent. Manual series through the same TraceId or application name is more suitable for rapid integration of stock systems and minimum transformation cost.

Full link tracing access practice

The challenges and solutions faced by full-link tracking in various scenarios are introduced in detail in the preceding paragraph. Next, taking Ali Cloud ARMS as an example, it introduces how to build a complete observable system that runs through front-end, gateway, server, container and cloud components from 0 to 1.

  • Header transparent transmission format: Adopt the Jaeger format. The Key is uber-trace-id, and the Value is {trace-id}:{span-id}:{parent-span-id}:{flags}.
  • Front-end access: Use Script injection (CDN) or NPM for low-code access, and support Web/H5, Weex, and various small program scenarios.
  • Back-end access:
    • Java applications are recommended to use ARMS Agent preferentially. Non-invasive buried points do not require code modification, and high-level functions such as edge diagnosis, non-destructive statistics and precise sampling are supported. User – defined methods can be actively buried through the OpenTelemetry SDK.
    • Non-java applications are recommended to access through Jaeger and report data to the Endpoint of ARMS, which is compatible with transparent link transmission and display between multi-language applications.

The current full-link tracking scheme of Ali Cloud ARMS is based on Jaeger protocol, and SkyWalking protocol is being developed to support lossless migration of Self-built SkyWalking users. The call chain effect of full-link tracing of front-end, Java application and non-Java application is shown in the figure below:

1. Front-end access practice

ARMS front-end monitoring supports Web/H5, Weex, Alipay and wechat mini programs. This paper takes the CDN access of A Web application to ARMS front-end monitoring as an example to briefly explain the access process. For detailed access guidelines, refer to the OFFICIAL website of ARMS front-end monitoring.

  1. Log in to the ARMS console, click the Access center in the left navigation bar, and select front-end Web/H5 access.
  2. Enter the app name and click Create. Select the required options in the SDK extension configuration item area to quickly generate the BI probe code to be inserted into the page.
  3. Select asynchronous load, copy the following code and paste it into the first line inside the ** ** element in the HTML page, then restart the application.
<script> ! (function(c,b,d,a){c[a]||(c[a]={}); c[a].config={pid:"xxx",imgUrl:"https://arms-retcode.aliyuncs.com/r.png?", enableLinkTrace: true, linkType: 'tracing'}; with(b)with(body)with(insertBefore(createElement("script"),firstChild))setAttribute("crossorigin","",src=d) })(window,document,"https://retcode.alicdn.com/retcode/bl.js","__bl"); </script>Copy the code

The probe code must contain the following two parameters in order to make the front and back end links open:

  1. EnableLinkTrace :true // Indicates that the front-end link tracing function is enabled
  2. LinkType: ‘tracing’ // Generates link data in Jaeger format. Hearder allows Uber-trace-ID transparent transmission

In addition, if the API is not the same source as the current application, the enableApiCors: true parameter needs to be added, and the backend server needs to support cross-domain requests and user-defined header values. For details, see the link association documents of the front and back ends. To verify whether the link tracing configuration takes effect, open the console and check whether the Uber-trace-id flag is displayed in the Request Headers of the corresponding API Request.

2. Java application access practice

It is recommended for Java applications to access ARMS JavaAgent. The non-invasive probe is out of the box without modifying the business code. For detailed access guidelines, refer to the official website documents of ARMS application monitoring.

  1. Log in to the ARMS console, click Access Center in the left navigation bar, and select back-end Java Access.
  2. Choose manual installation, script installation, and container service installation as required.
  3. Ensure that the probe has been downloaded and decompressed to the local PC, set the appName, LicenseKey, and JavaAgent startup parameters correctly, and restart the application.

3. Non-java application access practice

Non-java applications can use open source SDKS (such as Jaeger) to report data to the ARMS Access Point. For details, refer to the documentation on the ARMS App Monitoring website.

  1. Log in to the ARMS console, click access center in the left navigation bar, and click to select the back-end Go/C++/.NET/Node.js and other access modes.
  2. Replace the ap by referring to the operation guide and restart the application.

Full link tracing is the beginning, not the end

Link tracking has been in development for more than a decade since Google published the Dapper paper in 2010. But books about link tracking or in-depth articles have been less, most of the blog is simply introduced some concepts of open source or the QuickStart, a large enterprise how to build a real, useful and easy to use, link tracking systems are available, and what need to fill in the pit, avoid what ray, it is hard to find a more systematic and comprehensive answer.

Full-link Tracing access is just the starting point of Tracing. Choosing solutions suitable for their own business architecture can avoid some detours. But link tracing is more than just looking at call chains and service monitoring. How can it empower the business upwards, branching out into the business observable realm to aid in business decisions? How can I detect resource risks in advance by connecting with infrastructure observables downwards? There is still a lot of work to do, looking forward to more students to join us.

Related links: 1. ARMS Front-end monitoring official website document: help.aliyun.com/document_de… 2. Associated documents of front and back end links: help.aliyun.com/document_de… 3. Official website document of ARMS application monitoring: help.aliyun.com/document_de… 4. Official website document of ARMS application monitoring: help.aliyun.com/document_de… 5. ARMS Console: arms.console.aliyun.com/?spm=ata.21… 6. How to choose between open source self-built/hosted and commercial self-developed Trace? : mp.weixin.qq.com/s?spm=a2c6h…

Click the link below to experience link tracking now! www.aliyun.com/product/xtr…

Quickly analyze and diagnose performance bottlenecks in distributed application architecture to improve development and diagnosis efficiency in the era of microservices.