takeaway

Distributed architecture and microservice framework bring great challenges to system performance analysis and problem location under the prevailing environment of microservices. How to realize the whole-link performance monitoring of applications by aggregating the real-time data of each processing link of the service system becomes a big problem of service governance. This paper mainly shares the observation practice of Tencent cloud micro-service based on SkyWalking based on the business background of Intelligent retail Tencent Number products, hoping to give some inspiration to students who have such requirements.

The authors introduce

Chen Junhong

The operation and development engineer of Tencent Smart Retail Department is good at data asset management and micro-service governance

background

Numerous microservices lack unified management specifications (monitoring, call chain tracing)

  • Multi-account public cloud, self-research cloud, Intranet K8S

  • Multiple deployment platforms TAF, TKE, TKEX

An online service exception cannot be quickly located

  • NFS log mounting mode

  • Lack of a unified Web management interface

Missing service performance diagnostics

  • Some unreasonable calls, such as frequent database operations, circular dependencies, and so on, cannot be detected in time.

The target

Based on the above background, we hope to build a component platform with the following core functions

SkyWalking, a distributed application behavior monitoring tool for microservices (Docker, Kubernetes, Mesos), met our needs.

Core principles

skywalking_architecture

SkyWalking’s architecture mainly consists of three parts: the Client, the Collector and the WEB display UI.

The core principles are as follows:

  1. JVM and behavioral data collection using Java Agent probe technology

  2. Internal communication uses HTTP and gRPC protocols

  3. Use GrapHQL and HTTP for UI presentation

  4. The supported storage is H2(only for debugging with small amount of data, not recommended) and Elasticseach

Service reporting practice

At present, Tencent’s background services mainly use SpringBoot technology stack. In order to reduce the extra development cost of background students, we try to avoid code intrusion when considering the overall service governance. The usual service startup commands are:

$ java -Dspring.profiles.active=dev -jar target/youshu-app.jar
Copy the code

You only need to add the -javaAgent parameter to specify the absolute path of the Agent to start the command after the Agent with SkyWalking is introduced to report data. For example:

$ java -javaagent:/e/apache-skywalking-apm-bin/agent/skywalking-agent.jar -Dspring.profiles.active=dev -jar target/youshu-app.jar
Copy the code

The visual UI provides a call relationship topology and call link Trace, as shown below:

Tencent micro service observation platform: TSW

Once the components are identified, how can developers focus more on the development of business functions rather than the day-to-day operation and maintenance of SkyWalking’s underlying services? On the one hand, in response to the general trend of the group’s integration of cloud services and on the other hand, in order to reduce the operating costs within the team, we finally decided to make use of Tencent Cloud capabilities and let a professional team take charge of the daily operation of SkyWalking services.

Compared with the open source SkyWalking, we can find the following points through the Tencent cloud TSW architecture diagram:

  • Data collection (Client) : The reporting mode is more flexible. You can choose to use TSW probe or open source collection terminal for data collection. If you want to migrate to the cloud from open source, you can keep most of the configuration of the Client and just change the reporting address.

  • Data processing (Server) : When the data is reported to the Server, the Pulsar Function (message queue) will first cut the peak and fill in the valley. For data reported by different open source clients, the Adapter Adapter converts the data into a unified Opentracing compatible format for subsequent use. After the data format is unified, link data is directly stored and allocated to real-time computing operators and offline computing operators according to data usage scenarios. The real-time computing operator provides real-time monitoring, displays statistics, and connects to the alarm platform for quick response. Offline computing operators deal with the statistical aggregation of a large amount of data in a long period of time and provide business value by utilizing big data analysis capabilities.

  • Storage: The Storage layer is designed to meet different scenarios of Data types. It ADAPTS to write requests of the Server layer and query and read requests of the Data Usage layer. Meanwhile, HBase and HDFS Storage modes are added to the Storage layer.

  • Data Usage: provides unified console operation, Data display, and alarm support for Tencent cloud.

TSW reports data based on open source Agent

Because the background service is deployed on Tencent Cloud TKE, you need to mount NFS cloud disks to configure and manage the Agent.

Step1: Modify the startup command of the Docker service

$ java -javaagent:/nfs_data/XXX/XXX/agent/skywalking-agent.jar -Dspring.profiles.active=dev -jar target/youshu-app.jar
Copy the code

Step2: download the open source agent and upload it to the NFS mount point directory, for example, /nfs_data/XXX/agent/.

Step3: modify the corresponding configuration information config/agent.conf

Service \_name= xxx-api the service name to be reported can be customizedCopy the code

Step4: restart the service and verify whether there is a corresponding topology on the interface after traffic comes in (you can also check whether an exception is reported by checking logs/ XXX)

Comparison of mainstream APM components

There are so many application performance management components in the industry why SkyWalking? For details, please refer to the following comparison table:

Performance analysis of Agent probe

As for the introduction of Agent, like many students, we are also very concerned about its impact on service performance. Here, based on official test statistics, we found only about a 10% increase in CPU load for a Web application.

conclusion

Application performance management is only a part of service governance. In order to solve the three major problems of service invocation monitoring, service link tracking and service performance diagnosis, this paper introduces the system architecture and related practices of open source SkyWalking and Tencent CLOUD TSW.

The resources

1. Java Agent probe:

zhuanlan.zhihu.com/p/135872794 2. Tencent Micro Service Observation Platform Product Overview:

Cloud.tencent.com/document/pr…

3. Agent probe performance Disclosure:

Github.com/SkyAPMTest/…

4. Skywalking:

skywalking.apache.org/

Phase to recommend

Embrace Agent, “0” code to play Trace OpenTelemetry series second bomb!

“Today we talk about the Trace OpenTelemetry And TSW | overview”

Exclusive: Information Replication between Kafka clusters is here!