Recently, the company is building a service-oriented platform and needs to launch APM system. This article briefly introduces SkyWalking
APM
APM uses various probes to collect data, collect key indicators, and present data to achieve a systematic solution for Application Performance Management and fault Management
Monitoring systems such as Zabbix, Premetheus and Open-Falcon mainly focus on server hardware indicators and system service running status, while APM system pays more attention to monitoring internal execution process indicators and link call between services. APM is more conducive to finding the root problem of “slow” request response in deep code. Complementary to monitoring like Zabbix
At present, there are CAT, Zipkin, Pinpoint, SkyWalking open source APM system on the market, most of which are realized by referring to Google’s Dapper
CAT: It is developed based on Java language and is open source by Domestic Meituan-Dianping. Currently, it provides clients in Java, C/C++, Node.js, Python, Go and other languages. Both CAT and Zipkin need to be embedded in the application, which is highly intrusive to the code. We tend to choose products that are non-intrusive to the code, so WE eliminated CAT
Zipkin: Developed by Twitter and open source. Implemented in Java, Zipkin is less intrusive than CAT, requiring modifications to configuration files such as web.xml, but still intrusive code with no options
Pinpoint: A south Korean team open source products, using the bytecode enhancement technology, you just need to add launch parameters at startup, the code without intrusion, the support for Java and PHP language, the underlying HBase is used to store data, probes collect data granularity is very thin, but the performance loss is big, because its are there for a long time, complete the degree is high, More companies are using it
SkyWalking: SkyWalking is an open source product of Chinese people, and the main developers are from Huawei. On April 17, 2019, The Apache Board of Directors approved SkyWalking as a top project, supporting Java,.NET, NodeJs and other probes, and data storage supporting Mysql, Elasticsearch, etc. With Pinpoint the same bytecode injection method to achieve code non-invasion, probe data acquisition coarse granularity, but excellent performance, and cloud native support, the current growth momentum is strong, active community, Chinese documents without language barriers
All things considered, we chose SkyWalking
SkyWalking
SkyWalking is officially introduced in two sentences:
SkyWalking is an application performance monitoring tool for distributed systems, designed for microservices, cloud native architectures, and container-based (Docker, K8S, Mesos) architectures
SkyWalking is an observational analytics platform and application performance management system. Provides integrated solutions for distributed tracking, service grid telemetry analysis, measurement aggregation and visualization
SkyWalking architecture
SkyWalking adopts component development and is easy to expand. The main components are as follows:
Skywalking Agent: Collects tracing and reporting metrics, and sends data to a Skywalking Collector via HTTP or gRPC
Skywalking Collector: The link-data collector can integrate and analyze the tracing and metric data sent by the agent, process them through the Analysis Core module, and put them into the relevant data storage. Meanwhile, the link-data collector can perform secondary statistics and monitor alarms through the Query Core module
Storage: Skywalking Storage. ElasticSearch, Mysql, TiDB, and H2 are used as Storage media to store data
UI: A Web visualization platform for displaying landing data. RocketBot has been officially adopted as SkyWalking’s main UI
SkyWalking interface
- The dashboard
Dashboards include Service dashboards and Database dashboards
Service Dashboards contain Global, Service, Endpoint, and Instance panels to display detailed information about Global services, endpoints, and instances
The Database Dashboard displays detailed information about a Database, including response time, response time distribution, throughput, SLA, and slow SQL
- The topology
SkyWalking is able to automatically map the invocation relationship between services based on the data obtained and identify common services displayed on the ICONS, such as Kafka and H2 services on the diagram
The color of each line reflects the call delay between services, and the call status between services can be seen intuitively. The point in the middle of the line can be clicked to display the average response time, throughput rate and SLA of the link between two services
- Track panel
It can show the internal execution of the code of the request, which services a complete request has gone through, which code methods have been executed, the execution time of each method, the execution status and other detailed information, and quickly locate code problems
- The alarm panel
Write in the last
SkyWalking is still in the stage of rapid development. We deployed SkyWalking in production environment and encountered a series of problems, such as large amount of data image breakpoints, slow image display, incompatible with ElasticSearch version, etc. We had limited data to find, so we had to be careful to go into production. However, we can see from Github that the product is still being updated and improved rapidly, and I believe SkyWalking will continue to grow in the future, thanks to open source
Related articles recommended reading:
- Small and medium-sized teams based on Docker’s Devops practices
- Redis removes an elegant implementation of a particular prefix key