ServiceMesh Virtual Meetup is an online series co-hosted by the ServiceMesher community and CNCF. This session is Service Mesh Virtual Meetup#1. We invited four guests from different companies to share the application practice of Service Mesh from different perspectives. The sharing will cover the observability and production practices of Service Mesh and the differences between observability and traditional microservices, as well as how to use SkyWalking to observe Service Mesh, from Momo and Baidu Service Mesh production practices.

This article is based on the observability application of Apache SkyWalking in Service Mesh, which was shared by Gao Hongtao, founding engineer of Tetrate, a Service Mesh Service provider in the United States, on the evening of May 7. The video review link and PPT download address are included at the end of the article.

preface

This presentation will share the application practice of Apache SkyWalking on observability of Service Mesh, which can be divided into three parts:

  • The first part is the background of Apache SkyWalking.
  • The second part is the challenges faced by SkyWalking in the Service Mesh scenario.
  • Finally, the evolution of Service Mesh scenarios is discussed.

Historical evolution and characteristics of SkyWalking

The SkyWalking project was built to solve the problem of how to quickly locate system stability in a microservice environment. The founding team started the project in 2016 and worked for a year to refine the initial version. In 2017, the team started the process of donating the project to the Apache Foundation. In the Apache Foundation incubator, I graduated in 2019 after several iterations of system upgrading and nearly doubling the number of contributors and attention. After years of upgrades and maintenance, SkyWalking has evolved from a single platform focused on distributed tracking systems to a full-domain APM system with multiple categories and rich functionality.

SkyWalking’s overall system architecture consists of three parts:

  • The first is the data acquisition terminal, which can use the language probe to collect the monitoring indicators of the system, and also provides a complete set of data acquisition protocol. Third-party systems can use protocols to report monitoring data to the analysis platform.
  • The second part is the analysis platform, which mainly includes the collection of monitoring index data, streaming processing, and finally writing the data into the storage engine. Storage engine can use Elasticsearch, MySQL database and many other solutions.
  • The third part is the UI. The UI component provides rich data display functions, including indicator display panels, invoking topology diagrams, tracing data query, indicator comparison, and alarm functions.

On top of that, SkyWalking’s own components have rich customization capabilities, making it easy for users to re-develop to support their own scenarios.

SkyWalking defines three dimensions for binding relevant monitoring indicator data.

  • Service: Represents a series or set of workloads that provide the same behavior for requests. When using an agent or SDK, you can define the name of the service. If not, SkyWalking will use the name you define on the platform, such as Istio.
  • Instance: Each of the above set of workloads is called an Instance. Like Pod in Kubernetes, a service instance is not necessarily a process on the operating system. But when you use an agent, a service instance is actually a real process on the operating system.
  • Endpoint: The request path received by a particular service, such as HTTP URL path and gRPC service class name + method signature.

Predefined dimensions can facilitate data pre-collection and are an important part of SkyWalking analysis engine. Although it has the relative disadvantage of not being flexible enough to use, in the CASE of APM, metrics are often carefully designed in advance, and performance is the key factor. So SkyWalking uses this predefined dimension pattern for data aggregation operations.

Challenges faced by SkyWalking in the Service Mesh scenario

Before describing the challenges of a Service Mesh scenario, we need to explain what observability means. Observability generally consists of three parts:

  • First, the logging system. The real-time state of system operation can be constructed by it. So the log becomes a very convenient means of observation.
  • Second, distributed tracking. This part of data has strong vitality in microservice scenarios and can provide users with distributed system observation indicators.
  • Third, index monitoring. Compared with log and distributed tracing, it has the characteristics of low consumption and easy processing, and is usually used as an important data source for the system to monitor alarms.

The architecture diagram for Istio1.5 is shown above. Focus on his support for observability. As can be seen from the figure, all monitoring indicators converge to the central Mixer component, which is then sent to the adapters on the left and right of Mixer, through which these indicators are sent to the peripheral monitoring platform, such as the SkyWalking back-end analysis platform. Istio’s metadata is appended to the indicators as the monitoring data flows through the Mixer. Another new Telemetry V2 based observation system is the Envoy Proxy that delivers monitoring metrics directly to the analysis platform. This model is still evolving and developing rapidly, but it represents a trend for the future.

As you can see from the architecture diagram, the first challenge here is that in the Service Mesh scenario, support for the observable technology architecture is very variable.

Istio itself consists of two non-converged systems, the first being mixer-based scenarios and the second Mixerless scenarios.

Mixer generates indicators based on access logs, that is, the access logs between services are sent to the peripheral analysis system after Mixer adds relevant raw data. The characteristic is that this mode is very mature and stable, but the performance is very low. Its inefficiency stems from two aspects. First, its data transmission channel is very long and there are too many intermediate nodes. You can see that the data needs to be sent from the Proxy to the Mixer node and then to the peripheral Adapter node. Another reason for low efficiency is that it sends original access logs, whose data volume is very large and will consume too much bandwidth, which poses great challenges to the overall data collection and analysis.

Another model is Mixerless, which is based entirely on Metrics. Through the analysis of observability technology and its characteristics, it can be seen that it is a relatively small consumption of technology, bandwidth and analysis background are very friendly. However, it also has its own problems, the first problem is that it requires a relatively high technical threshold (using WASM plug-in to achieve), and the Proxy side of the performance consumption is relatively large. At the same time, because it is a new technology, the stability is poor, related interfaces and specifications are not complete.

The second challenge is the absence of Tracing data. SkyWalking was originally designed as a system for collecting and processing Tracing data. However, we can see from the figure on the right that data reported to Service Mesh is actually call-based, that is to say, there is no complete Tracing link. This is a great challenge to the background analysis model. How to support these two modes at the same time becomes a thorny problem to be dealt with by the back-end analysis system.

The third challenge is dimension matching. As we saw in the previous chapter, SkyWalking has three dimensions, both instances and endpoints are well supported in the Service Mesh scenario. Again, not just for Mesh scenes, but for most of them. However, matching services is quite difficult because SkyWalking only has the concept of services, while Istio has several concepts that can be called “services”. How the relevant dimension matching can be done, especially for the Service level dimension matching, becomes another key point for how the Service Mesh is integrated with SkyWalking.

Application scheme and its evolution

Integration with Istio

We can see from Istio’s architecture diagram that in addition to network traffic control services, Istio also provides Telemetry data integration capabilities. The Telemetry component is primarily integrated through Mixer, which is where SkyWalking first integrated with Istio. In the early days, Istio could perform in-process integration, where the integration code was added to its source code to mutate for maximum performance. Istio later evolved this capability as an out-of-process adapter to reduce the integration complexity of the system. SkyWalking is currently integrated with such out-of-process adapters.

There are two installation modes:

  1. If you install SkyWalking from Helm Chart, you can set the parameters in the figure to true in the values.yml file. SkyWalking is then automatically installed at Helm to analyze the background and integrate it into Istio in an out-of-process adapter mode.
  2. If SkyWalking and Istio are already installed, you can configure Istio to send observations to SkyWalking using the CR file shown on the right.

Once installed, test it using the BookInfo sample program. You can see that the dimension match is:

  • Service: < ReplicaSet >.< Namespace >;
  • Instance: kubernetes://< Pod >;
  • Endpoint Endpoint: HTTP URL;

You can see that the Service contains a Namespace. Therefore, there must be two different services under different namespaces.

In addition to the service and Ingress in the example, the topology contains the IStio-Telemetry component. This reflects actual data traffic, but some users may find it a bit redundant, and you’ll see a slightly different scenario here.

In addition to performing Mixer integration, SkyWalking can also perform related system integration with Envoy Access Log Service to achieve a similar effect to Mixer. The advantage of integrating with Envoy is that it is very efficient to send visit logs to SkyWalking’s receivers with minimal latency. However, the disadvantage is that the current Access Log Service sends a large amount of data, potentially affecting SkyWalking’s processing performance and network bandwidth. At the same time, all analysis modules rely on low-level access logs, and some isTIO-related features cannot be identified. For example, in this mode only Envoy metadata can be realized, and concepts such as Istio virtual services cannot be effectively realized.

This mode needs to be configured when SkyWalking and Istio are installed. First set “enthrone.als. enabled” to true in the SkyWalking Helm. Before installing a Istio, you need to set up “values. Global. Proxy. EnvoyAccessLogService” for the values as shown in figure.

From the topology, the most obvious difference from the Mixer model is the absence of an IStio-Telemetry component. This is because the component does not have an Envoy Sidecar to route traffic and therefore does not generate an access log. That is, this model fully reflects the actual workload.

In addition to the above two models, the community is currently developing an observation model based on Istio’s latest TelemetryV2 protocol. This mode is based on Metrics monitoring rather than access logs. This model exposes two Metrics:

  • Service level: This type of Metrics describes the relationship between services and is used to generate topology diagrams and service level Metrics.
  • Proxy level: Proxy process-related Metrics described by Metrics are used to generate instance-level Metrics.

This mode is the Mixerless of the standard job, which is friendly to the analysis platform and consumes little network bandwidth. The disadvantage is that you need to consume Envoy resources, especially memory. However, I believe that these problems can be solved well after multiple rounds of external optimization.

However, this mode also has the disadvantage of not generating monitoring metrics for Endpoint Endpoint. If users want to include this metric, they also need to use a pattern based on the ALS access log.

Tracing mixed with Metric support

Prior to SkyWalking8.0, the traditional Tracing mode could not be used if Service Mesh mode was enabled. The reason is that they share an analytical pipeline. If it is enabled at the same time, the calculation index will be repeated.

In Skywalker 8.0, MeterSystem was introduced to avoid this problem. It is also planned to adjust the Tracing so that it can be configured to generate or not to generate monitoring metrics. The final result will be: Metrics panel and topology data from Envoy Metrics, Tracing data from Tracing analysis, so as to support Istio Telemetry in the control surface of all functions.

In addition, Envoy and Istio themselves do not support Skywalking’s Remote Tracing protocol. The community has already tried to support Proxy protocols commonly used in Mesh environments such as Nginx and MOSN, and will try to add the Skywalking protocol to the Envoy using the WASM plug-in in the future.

The dimension matching

From the installation process, it can be found that the rule of Service in Mixer and ALS is ReplicaSet+Namespace. It hardly reflects the actual dimensions of Istio. A real Istio service mapping will be obtained later in TelemetryV2. It will also try to increase the following naming convention to distinguish different Cluster: “Version | | App Namespace | Cluster”.

conclusion

This share briefly introduces the application solution of Apache SkyWalking in the Service Mesh scenario. Istio provides a detailed overview of the three main challenges that lead to solutions that will help you understand and use SkyWalking’s Mesh capabilities. I hope you will be interested in trying to observe Istio using SkyWalking.

That’s all for this share, thank you for your attention and support!

The guest is introduced

Hongtao gao is a FoundingEngineer of Tetrate, a us Service Mesh Service provider. Former Huawei software development cloud technology expert, has rich experience in the design, development and implementation of cloud native products. In-depth understanding of distributed database, container scheduling, microservices, Servic Mesh and other technologies. Currently, he is a core contributor to Apache ShardingSphere and Apache SkyWalking, participating in the commercialization process of the open source project in the software development cloud. Former System architect at Dangdang, open source expert, worked on well-known open source projects such as Elastic-Job.

Review video and PPT download address

  • Video review: www.bilibili.com/video/BV1qp…
  • PPT download: github.com/servicemesh…

Financial Class Distributed Architecture (Antfin_SOFA)