“Micro-service War” is a series of themes about micro-service design thinking, mainly aiming at some contradictions/conflict points after micro-service, not involving a specific knowledge point in-depth. If you have any questions or suggestions, please feel free to contact us.

background

After the battle of microservices: cascading failures and avalanches of P0 events, you lay your hands on the floor. Start to do a self-review, remember the experience of the investigation, because there is no infrastructure, so after receiving customer feedback, you check the problem through the error log.

However, in cascading errors, too many error logs are generated. Different services and different links are almost crowded together, and the repair time is mainly spent in the log, turning over several pages to find relatively effective error information.

The next time this happens, alas, the MTTR takes too long and the four nines will run out quickly. At this point, you think of a weapon that is often mentioned in the industry, that is “distributed link tracking system”. Roughly speaking, you can see the call dependencies of various applications:

One of the most famous is the Dapper introduced in the Google Dapper paper. In order to solve the software complexity caused by different teams, different languages, different modules, deployed in different servers and different data centers (difficult to analyze and locate), Google built a distributed tracking system:

Since then, the industry has started to inspire/enlighten the road of distributed link. Many well-known distributed link tracking systems are developed based on The Paper of Google Dapper, and their basic principles and architectures are almost the same with minor differences. If you are interested in this, you can check out Google Dapper, which is very interesting.

(The concept of tracking tree and Span exists in Google Dapper)

The selection? What are the

If you want to do link tracing, you must choose an open source product as your distributed link tracing system. It is unlikely to create a new one. Therefore, a search on the Internet found a large number of products as follows:

  • Twitter: Zipkin.
  • Uber: Jaeger.
  • Elastic Stack: Elastic APM.
  • Apache: SkyWalking (open source enthusiast Wu Sheng).
  • Naver: Pinpoint (Korea Development).
  • Ali: Hawkeye.
  • Yelp: Cat.
  • Jd: Hydra.

A quick search reveals that there are a lot of these products in particular, and it is rumored that every major company has its own internal link tracking system, which can be a big problem. They are all evolved based on Google Dapper, so what is the difference in essence and how can so many new products be extended?

Jaeger

First take a look at Jaeger, developed by Uber and currently hosted by Cloud Native Computing Foundation (CNCF), which is CNCF’s seventh top project (graduated in October 2019) :

  • Jaeger Client: Jaeger client, a language-specific implementation of Jaeger for the OpenTracing API, can be used manually or through various existing open source frameworks integrated with OpenTracing (e.g. Flask, Dropwizard, GRPC, etc.) to detect applications for distributed tracing.

  • Jaeger Agent: Jaeger client Agent that listens on the UDP port for the received span and sends it in batches to the Collector.

  • Jaeger Collector: The Jaeger Collector, as its name implies, is agent-oriented and is used to collect/manage trace information for links.

  • Jaeger Query: Data Query and front-end interface display.

  • Jaeger Ingester: Can read data from Kafka and write to other storage media (Cassandra, Elasticsearch).

After understanding the functions of each component of Jaeger, we mainly focus on the data flow of its overall architecture:

Jaeger is a classic architecture. The client actively sends link information to the Agent, which reports to the Collector, passes through the queue, and finally drops to the storage. By another visual management background to view and analysis.

More visible is a standardized process for reporting, collecting, storing, and analyzing. And you’ll find that Jaeger and Zipkin are architecturally similar:

  • Zipkin Collector: The Zipkin Collector is used to collect/manage trace information for links.

  • Storage: Zipkin supports third-party Storage such as Cassandra, ElasticSearch, and MySQL.

  • Zipkin Query Service: After data is stored and indexed, it is used to find and retrieve trace information.

  • Web UI: Data query and front-end interface display.

Jaeger is four years behind Zipkin in terms of time, so maybe he’s reinventing the wheel. After looking through the pages, it can be learned that the main reasons for doing Jaeger are as follows:

The only way to send spans to Zipkin at the time was through Scribe, and the only high-performance data store Zipkin supported was Cassandra. At the time, Uber had no experience with either technology, so it chose to build its own back end, which combined custom components with Zipkin UI to form a complete tracking system.

You can learn more about Evolving Distributed Tracing at Uber Engineering.

Ali eagle eye

Another example of link tracing, which is based on logging and streaming, is Ali’s Hawk-eye and traces, as shown below:

More specifically, we can see “Alibaba Hawk-Eye Technology Decryption” and “Heterogeneous System Link Tracking — Didi Trace Practice” shared in the conference, which will not be repeated here. It is recommended for curious or sad friends to read.

conclusion

For example, Jaeger belongs to Go, Zipkin and Skywalking belong to Java family. All three are fully compatible with OpenTracing, but their architectures are somewhat different. Both are based on Google Dapper divergence, so the basic functions supported and the elegance of the query page are very important.

There are already N original systems, if you want to access the direct new link tracking system, or very troublesome. Because the original intention of access is necessarily to solve the troubleshooting/locating problems of the original system, rather than just for the new system, from the point of view of access, most people will not use the existing open source tracking system (unless the historical debt is small), and the amount of data may be huge.

Therefore, it is quite common to transform and clean data based on the existing methods to make the link tracking mode, among which the log is often a good starting point, that is, to clean certain data, form a new analysis system, and recreate an internal wheel.

In addition, servicemes-based “non-invasive” link tracking is also gaining popularity in recent years, which seems to be a promising direction. One of its typical examples is Istio, which uses CNCF Jaeger, and Jaeger is compatible with Zipkin, in which Jaeger wins.

My official account