This section describes the principles of APM and Dapper
APM (Application performance Management) and Dapper Principles What is APMAPM Describes the three features of APM the development history of APM The core idea of DevOpsAPM Why to use APM Good APM must meet the conditions Dapper Introduction as well as basic original……
Welcome to visit my blind few whole site: Copy the future
Application Performance Management (APM)
APM monitors and optimizes IT application performance and user experience for enterprise key services to improve the reliability and quality of enterprise IT applications. Designed to ensure a high quality experience for end users and reduce IT total cost of Ownership (TCO)
Total Cost of Ownership (TCO) refers to the Total Cost of Ownership, including the Cost from product purchase to later use and maintenance. This is a technical evaluation standard that companies often use.
APM is introduced
At present, the system in the market basically refers to Google’s Dapper (tracking system of large-scale distributed system) to do. Track the processing process of service requests, complete the performance consumption tracking of application system in front and back end processing and server end invocation, and provide a visual interface to display the analysis of tracking data. By aggregating real-time data of each processing link of the service system, the transaction path and processing time of each transaction of the service system are analyzed to realize the whole-link performance monitoring of the application.
APM differs from traditional performance monitoring tools in that it not only provides some scattered resource monitoring points and indicators, but also focuses on analyzing performance bottlenecks executed within and between systems. This helps locate the specific causes of problems. APM is dedicated to detecting and diagnosing application performance problems to provide the desired level of service for the application.
Three characteristics of APM
- Multi-level application performance monitoring: covering communication protocol layer 1-7, end-to-end application monitoring is realized through transaction process monitoring, simulation and other means.
- Application performance Rapid fault locating: Monitors each component of an application system to quickly locate and rectify system faults.
- Comprehensive application performance optimization: Accurately analyzes the system resource usage of each component and provides expert advice based on application system performance requirements.
The history of APM
At present, THE development of APM mainly goes through the previous three stages: The first stage: Network monitoring infrastructure mainly monitors host CPU usage, I/O, memory resources, network speed, etc., mainly represented by various network management systems (NMS) and various system monitoring tools.
The second stage is mainly to monitor various basic components. With the rapid development of the Internet, in order to reduce the difficulty of application development, various basic components (such as database, middleware, etc.) begin to emerge in large numbers. Therefore, application performance management during this period is mainly to monitor and manage the performance of various basic components.
Phase 3: Monitoring application performance. The complexity of IT operation and maintenance management explosively increases, and the focus of application performance management also begins to focus on application performance and management.
The fourth stage belongs to the developing stage: cloud computing is in the ascendent stage, while DevOps and the rise of micro-services have a great impact on traditional APM, and traditional vendors are also making some innovations, and also making some attempts in micro-services and cloud computing. With the rise of Machine Learning and AI technologies, fault locating and problem locating will also be of some help, and the means based on big data analysis will also be of some help. At present, the market is in the preliminary stage of trial.
Gartner’s definition of APM in 2016 is divided into three dimensions
- DEM-Digital Experience Monitoring: Digital experience monitoring, browser and mobile device user experience monitoring, and business availability and performance monitoring using active dial-up.
- Adtd-application Discovery, Tracing and Diagnostics: Automatic Application discovery, tracing and fault diagnosis, automatic discovery of logical relationships between applications, automatic modeling, in-depth monitoring of Application components, and performance correlation analysis.
- Aa-application Analytics: Application analytics, through machine learning, for JAVA and NET and other applications root cause analysis.
DevOps
DevOps (a portmanteal of Development and Operations) is a culture, movement, or practice that values communication and cooperation between “software developers” and “IT Operations technicians”. DevOps can be succinctly defined as “a more collaborative and efficient relationship between development teams and operations teams.” Are two relevant trends in the collision of the new term For more information, see: zh.wikipedia.org/wiki/DevOps
When the application service nodes call each other, an application-level tag is recorded and passed, which can be used to correlate the relationships between the service nodes. For example, if HTTP is used as the transport protocol between two application service nodes, these tags will be added to the HTTP header. It can be seen that how to pass these tags is related to the communication protocol used between the application service nodes. Common protocols are relatively easy to add these contents, while some customized ones may be relatively difficult, which directly determines the difficulty of implementing a distributed tracking system.
With the increasing business of the company, each system is more and more complex, the invocation between services, service dependence, and analysis of the performance of services are more and more difficult, so the introduction of service tracking system is particularly important. The existing APM is basically done by referring to Google’s Dapper system. By tracing the processing process of requests, the system can track the performance consumption of application system processing at the front and back ends and invocation at the service end. (The complete invocation link of each request is collected, and the performance data of each service on the invocation link is collected.) This helps engineers quickly locate problems.
Good APM should meet the conditions
In summary, a good APM system should meet the following five conditions
- Low consumption, high efficiency: By tracking system for tracking system resources of the price of small as far as possible, now the mainstream of APM for system resources consumption from 2.5% to 5%, but this value should be as small as possible, because under the large-scale distributed system, a single node resources is unable to control, can be super configuration, may also be a master machine, It only runs a few small services, but its performance is already very tight. If the tracking application runs again at this time, it is likely that the node will fail, which is not worth the loss.
- Less invasive and transparent enough: As tracking system, invasive is not may not exist, the key this invasive to which level, how in the underlying level of intrusion, perception for developers and the less they need to support the work of tracking system, if the code is needed invasion, the application of the business for itself is more complex, the code becomes more complex, redundant Nor is it conducive to the fast-paced development of developers.
- Flexible scalability: The distributed tracking system should not be crippled by the expansion of microservices and clusters. To be able to take full account of the scale of future distributed services, the tracking system should be fully accommodated for at least the next few years.
- Tracking data visualization and rapid feedback: to have a visual monitoring interface, from tracking data collection, processing to the presentation of results as fast as possible, you can make a rapid response to the abnormal situation of the system
- Continuous monitoring: it is required that the distributed tracking system must operate 7×24 hours a day, otherwise it will be difficult to locate the system’s occasional jitter behavior
Dapper – Distributed tracking system in Google production environment. Dapper English papers: static.googleusercontent.com/media/resea… Chinese paper: bigbully. Making. IO/Dapper – tran… (Be sure to check it out)
Distributed call tracing is actually a concept that caught fire with microservices. Google microservized it many years ago (using their own containers in early 2000), so its distributed tracing theory is currently the most mature. The reason for the emergence of distributed tracing systems is simply that a single request in a distributed system can contain many RPCS, and there is an urgent need for tools that can help understand the behavior of the system and analyze performance problems, and determine which services are holding back.
Here’s an example, according to the paper:
A to E represent five services respectively. A user sends A request to A, and THEN A sends AN RPC request to B and C respectively. B processes the request and returns it. A simple, practical implementation of distributed tracing for such a request is to collect trace message identifiers and timestamped events for every action you send and receive on the server. That is, to associate these records with a specific request.
How do you associate logs for each service with each record associated with a particular request
There are currently two approaches in academia and industry to associate records with a particular request
1. Black Box
The black-box scheme assumes that there is no additional information to be tracked beyond the above information, so that statistical regression techniques can be used to infer the relationship between the two.
Logging is the same record, but machine learning methods are used to associate records with specific requests. Take A particular request RequestX as A variable and use A black box (that is, A machine learning model such as regression analysis) to find A corresponding record from A’s log, and similarly find relevant records for B, C, D, E, and so on.
The advantage of the black box scheme is that it does not need to change the existing logging method, but its disadvantages are very obvious. Due to its dependence on statistical reasoning, the accuracy of the black box scheme is often not high (the accuracy can be improved through a lot of data learning and training), and the effect is not good in practice.
(Project5, WAP5, Sherlock)
2. Annotation-based scheme
Annotation-based solutions rely on the application or middleware explicitly marking a global ID for all services, thereby associating a string of requests. In the case of the RequestX, for example, an identifier 1000 is assigned, and subsequent services will log the identifier 1000 along with the record. The advantage of this method is that it is more accurate. At present, Google, Twitter, Taobao and so on all use this method.
The way to do this is to get an instance of Trace based on the TraceID in the request, and each programming language has its own way of doing this. After the Trace instance is obtained, the Span can be recorded by calling the Recorder. The recorded value is stored locally as a log, and then the Trace system starts a Collector Daemon to collect the log, which is then collated and written to the database. You are advised to save the parsed log results in the database of sparse tables such as BigTable(Cassandra, HDFS). Because each Trace may carry a different Span, the final record is that each row represents a Trace, and each column of that row represents a Span.
But the main drawback of annotation-based solutions is that they require code placement. (So how to reduce code intrusion is the biggest problem)
To make the code less intrusive, it is recommended to keep the core trace code light and build it into common components, such as thread calls, control flows, and RPC libraries.
Track trees and spans
All a distributed tracing system has to do is record the identifiers and timestamps of each send and receive action, and connect all the services involved in a request so that the complete invocation chain of a request can be understood.
In Dapper, Trace is used to Trace the complete invocation chain of A request, and the request/response process of two services, such as service A and service B above, is called A span. You can see that each Trace is a tree structure, and the SPAN represents the specific dependencies between services.
Each Trace tree Trace defines a globally unique TraceID, which is recommended to be a 64-bit integer. All spans in the Trace will get the TraceID. Each Span has a ParentSpanID and its own SpanID. In the example above, the ParentSpanID of service A is empty and SpanID is 1; Then the ParentSpanID of service B is 1, SpanID is 2; The ParentSpanID of C service is also 1, SpanID is 3, and so on.
In the Dapper trace tree structure, tree nodes are the basic units of the entire architecture, and each node is a reference to a SPAN. The lines between nodes represent a direct relationship between the span and its parent span. Although spans simply represent the start and end times of spans in log files, they are relatively independent in the overall tree structure.
Figure 2: The transient correlation of 5 spans in Dapper tracking tree species
The figure above illustrates what span looks like over a large trace. Dapper records the span names, as well as the ID and parent ID of each span, to reconstruct the relationship between the different spans in a single trace. A span without a parent ID is called a root span. All spans hang on a specific trace and also share a traceID (traceID is not shown in the figure). All of these ids are identified with globally unique 64-bit integers. In a typical Dapper trace, we want to correspond to a single span for each RPC, and each additional component layer corresponds to a hierarchy of trace tree structures.
Figure 3: A detailed view of a single span shown in Figure 2
In addition to keeping track of ParentSpanID and its own SpanID, SPAN also keeps track of its request times and response times for other services. Due to the client and the server timestamp from different hosts, must consider the time deviation, in order to solve this problem need to agree a premise, namely the RPC client must send the request, the server can receive, if server-side timestamp than the client request timestamp, then according to the request time, response time, too. (The RPC client sends a request before the server receives it, and the same is true for the response (the server responds before the client receives the response)), so the RPC on the server side has an upper and lower bound of timestamps.
As can be seen from the figure above, the span is a “Hello.Call” RPC. SpanID is 5, ParentSpanID is 3, and TraceID is 100. Let’s focus on Client Send, Server Recv, Server Send, Client Recv CS, SR, SS, CR.
- CS: client send time
- SR: indicates the receiving time of the server
- SS: server send time
- CR: client receive time
By collecting these four timestamps, you can calculate the execution and network time of the entire Trace, as well as the execution and network time of each Span process in the Trace, after a request is completed.
- Service invocation time = CR-cs
- Service processing time = SS-sr
- Network time = Service invocation time – Service processing time
The start and end times of the SPAN, as well as any RPC time information, are recorded through Dapper’s incorporation into the RPC component library. If the application developer chooses to add their own comments (for “Foo” in the figure) (business data) to the trace, this information will be logged just like any other SPAN information.
How to achieve application-level transparency?
In Google’s environment, all applications use the same threading model, control flow, and RPC system, and since engineers can’t write code to log, they have to use the threading model, control flow, and RPC system to automatically help engineers log.
For example, almost all Google interprocess communication is based on an RPC framework developed in C++ and JAVA. Dapper integrates tracing into this framework. The span ID and trace ID are sent from the client to the server, so that engineers do not need to care about the application implementation level.
Dapper tracks the collection process
Divided into three stages:
- Each service writes the SPAN data to the local log;
- The Dapper daemon pulls and reads data into the Dapper collector.
- The Dapper collector writes the results to BigTable, recording a trace as one line.
Track the loss
The cost of the tracking system consists of two parts:
- The system being monitored suffers from the consumption of trace generation and collection of trace data
- A portion of the resources are required to store and analyze trace data. While you can argue that a valuable component implementation trace is worth a fraction of the performance cost, we believe that the initial roll-out of the trace system will be greatly aided if the basic cost is negligible.
The following three aspects are shown: the cost of Dapper component operations, the cost of trace collection, and the impact of Dapper on production environment load. We also described how Dapper’s adjustable sampling rate mechanism helps us deal with the trade-off between low loss and tracking representativeness.
Generate trace losses
The overhead of generating traces is the most critical part of Dapper’s performance impact, because collection and analysis can be more easily turned off in an emergency. The most important trace generation cost in the Dapper runtime is creating and destroying spans and annotations and recording them to local disk for subsequent collection. The creation and destruction of the root span took an average of 204 nanoseconds, while the same operation took 176 nanoseconds on other spans. The time difference is mainly due to the need to assign a globally unique ID to the trace on both spans.
If a span is not sampled, the cost of creating an annotation on that extra span is negligible, which consists of a ThreadLocal lookup at Dapper runtime, which takes an average of 9 nanoseconds. If this span is counted in the sampling, it is annotated with a string – illustrated in Figure 4 – which takes an average of 40 nanoseconds. The data was collected on an x86 server running at 2.2GHz.
Writing to local disk during the Dapper runtime is the most expensive operation, but their visible loss is greatly reduced because writing to log files and operations are asynchronous relative to the application being traced. However, log writing operations can become noticeable in heavy traffic situations, especially if every request is tracked.
Track collection consumption
Google stats: Worst case, CPU usage of Dapper’s log collecting daemon when tested at a higher than actual load baseline: no single-core CPU usage exceeded 0.3%. Limiting the Dapper daemon to the lowest priority of the kernel scheduler to prevent CPU contention on a heavily loaded server. Dapper is also a lightweight consumer of bandwidth resources, taking up an average of 426 bytes per span transmitted through our repository. As a tiny part of the network behavior, Dapper’s data collection occupies only 0.01% of the network resources in Google’s production environment.
Figure 3: CPU resource usage of the Dapper daemon during a load test
Impact on load in production environment
Each request takes advantage of a large number of servers with high-throughput online services, which is one of the primary requirements for effective tracking; This situation requires large amounts of trace data to be generated, and their impact on performance is most sensitive. In Table 2, we take the network search service under the cluster as an example. By adjusting the sampling rate, we measure the impact of Dapper on performance in terms of latency and throughput.
Figure 4: Effects of different sampling rates on network latency and throughput in a network search cluster. The experimental errors of delay and throughput are 2.5% and 0.15% respectively.
We see that although the impact on throughput is not significant, tracing sampling is necessary to avoid significant delays. However, the loss of latency and throughput is all within the experimental error range after the sampling rate is adjusted to less than 1/16. In practice, we found that even if the sampling rate was adjusted to 1/1024 there was still enough trace data to track a large number of services. It is important to keep Dapper’s performance degradation baseline at a very low level, as it provides a loose environment for applications to use the full Annotation API without fear of performance loss. Using the lower sampling rate has the added benefit of allowing trace data persisted to hard disk to be retained longer before being processed by the garbage collection mechanism, which gives Dapper’s collection component more flexibility.
The sampling
The realization of distributed tracking system requires low performance loss, especially in the production environment, distributed tracking system can not affect the performance of core services. Google can’t track every request, so sampling is required, and each application or service can set its own sampling rate. The sampling rate should be set for each application in its own configuration, so that each application can adjust the sampling rate dynamically. In particular, the sampling rate can be adjusted appropriately when the application has just come online.
Generally, when the system has a large peak traffic, only a small part of the requests need to be sampled, such as the sampling rate of 1/1000, that is, the distributed tracking system will sample only one of the 1000 requests.
Variable sampling
The amount of Dapper consumed for any given process is proportional to the sampling rate of tracking per process per unit of time. The first production version of Dapper used a uniform sample rate of 1/1024 across all processes within Google. This simple scheme is very useful for our high-throughput online service, because those events of interest (in the case of high-throughput) are still likely to occur frequently and are often enough to be caught.
However, important events can be missed at lower sampling rates and lower transmission loads, and acceptable performance losses are required to use higher sampling rates. The solution for such a system is to override the default sampling rate, which requires manual intervention, a situation we try to avoid in Dapper.
In the process of deploying variable sampling, we parameterize the sampling rate, instead of using a uniform sampling scheme, we use a sampling expectation rate to identify the tracking of sampling in unit time. In this way, low flow and low load will automatically increase the sampling rate, while high flow and high load will reduce the sampling rate and keep the loss under control. The actual sampling rate used will be recorded along with the tracking itself, which is conducive to accurate analysis from the tracking data of Dapper.
Coping with aggressive sampling
New Dapper users often feel that the low sampling rate — often as low as 0.01% for high-throughput services — will be detrimental to their analysis. Our experience at Google leads us to believe that aggressive sampling does not preclude the most important analysis for high-throughput services. If a significant action occurs once in the system, it occurs thousands of times. Low-throughput services — perhaps dozens of requests per second, rather than hundreds of thousands — can afford to keep track of every request, which is what made us decide to use adaptive sampling rates.
Additional sampling during collection
The sampling mechanism described above is designed to minimize significant performance losses in applications that work with the Dapper runtime. Dapper’s team also needed to control the total size of data written to the central database, so to do this we combined secondary sampling.
Our production cluster currently generates over 1TB of sampled trace data per day. Dapper users expect to keep track of processes in production for at least two weeks after they are recorded. The increasing density of tracking data must be balanced against the server and hard disk storage consumed by Dapper’s central repository. The high sampling rate for requests also brings the Dapper collector close to the upper limit of write throughput.
In order to maintain flexibility between the demand for material resources and the increasing throughput of Bigtable, we added additional sampling rate support to the collection system itself. We take advantage of the fact that all spans come from a particular trace and share the same trace ID, even though these spans may span thousands of hosts. For each span in the collection system, we hash the trace ID into a scalar Z, where 0<=Z<=1. If Z is lower than the coefficient in our collection system, we keep the span information and write it to Bigtable. If not, we abandon him. By using the trace ID in the sampling decision, we either save or discard the entire trace, rather than processing the span within the trace separately. We found that having this additional configuration parameter made it much easier to manage our collection pipeline because we could easily adjust our global write rate parameter in the configuration file.
It would be simpler to use a single sample rate parameter for the entire trace process and collection system, but this does not allow for the need to quickly adjust the run-time sample rate configuration on all deployed nodes. We chose the run-time sampling rate so that we could elegantly remove excess data that we could not write to the warehouse, and we could adjust the run-time sampling rate by adjusting the secondary sampling rate coefficient in the collection system. Dapper’s pipeline maintenance becomes easier because we can directly increase or decrease our global coverage and write speed by modifying our configuration of the secondary sample rate.
The most important flaw of Dapper
- Merge impact: The implicit premise of our model is that different subsystems are processing requests from the same traced request. In some cases, it is more efficient to buffer a portion of requests and then operate on one set of requests at a time. (For example, a merge write on disk). In this case, a traced request can appear as one large unit of work. In addition, when multiple trace requests are collected, only one of them is used to generate that unique trace ID for use by other spans, so no trace can be made. The solution we are considering is to solve this problem with as few records as possible, as long as this situation can be identified.
- Tracking batch load: Dapper was designed primarily for online service systems with the initial goal of understanding the system behavior generated by a user request. However, off-line intensive loads, such as those that conform to the MapReduce model, can also benefit from performance tapping. In this case, we need to associate the trace ID with some other meaningful unit of work, such as a key value (or range of key values) in the input data, or a MapReduce Shard.
- Find the root cause: Dapper can effectively determine which part of the system is slowing down the whole system, but it can’t always find the root cause of the problem. For example, a request may be slow not because of its own behavior, but because of queued ahead of it in the queue. Applications can use application-level annotations to write queue sizes or overloads to the trace system. Also, if this is a common occurrence, the paired sampling technique mentioned in ProfileMe can solve this problem. It consists of two time-overlapping sample rates and looks at their relative delays throughout the system.
- Record kernel-level information: Details of some kernel-visible events can sometimes be useful in determining the root cause of a problem. We have tools that can track or otherwise describe kernel execution, but it is difficult to get this information into the context of user-level tracing in a general or less obtrusive way. We are working on a compromise solution where we take snapshots of some kernel-level activity parameters at the user level and bind them to an activity span.
Graphs model
MapReduce is a programming model for parallel computation of large data sets (larger than 1TB). The concepts “Map “and “Reduce,” which are the main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. Current software implementations specify a Map function that maps a set of key-value pairs to a new set of key-value pairs, and a concurrent Reduce function that ensures that each of all mapped key-value pairs shares the same key set.
The pairwise sampling technique mentioned by ProfileMe
English papers address: www.cs.tufts.edu/comp/150PAT…
Pinpoint
Pinpoint is an APM (Application Performance Management) tool for large distributed systems written in Java/PHP. Inspired by Dapper, Pinpoint offers a solution that helps analyze the overall structure of the system and how the components in them connect to each other by tracking transactions between distributed applications. Anyone interested in performance analysis in the Java field should check out this open source project, which is implemented and open-source by a Korean team. It uses JavaAgent mechanism to do bytecode code embedding (in addition to ASM bytecode technology) to implement the purpose of adding traceid and capturing performance data.
Github address: github.com/naver/pinpo…
Tools such as NewRelic, Oneapm and others perform performance analysis on the Java platform in a similar way. NewRelic: Foreign charged APM Oneapm: Domestic charged APM
Pinpoint architecture schematic diagram
Pinpoint Pinpoint collects data by adding PinpointAgent when the Host App starts, and then sends the collected tracking data and performance data to Pinpoint Pinpoint Collector in real time, after which the Collector collects and stores it in HBase database. By HBase to do MapReduce operation, analyze the distributed system machine access topology, each node thread status, request/response data, call stack information, application performance data, Pinpoint WebUI can be displayed in real time.
invasive
PinPoint the use of Java Agent to node application specified function before injection before and after logic, to send messages to the server, so the basic need not modify the code, just a simple modification of the configuration;
Basic principles of APM probe
SkyWalking
Skywalking is an APM system designed for microservice architecture and cloud native architecture system, which is open source by a Chinese engineer named Wu Sheng and has joined the Apache incubator. It automatically collects the desired metrics through probes and makes distributed tracing. Through these call links and indicators, Skywalking APM can sense the relationship between applications and services and make corresponding indicator statistics. Skywalking support link tracking and monitoring application components basically cover mainstream frameworks and containers, such as domestic PRC Dubbo and Motan, international Spring Boot and Spring Cloud.
Github address: github.com/apache/incu…
Skywalking’s overall architecture is divided into three parts:
- Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint Pinpoint
- SkywalkingCollector – link data collector, data can be dropped to ElasticSearch or H2
- SkywalkingUI: Web visualization platform for displaying landing data;
Zipkin
This is twitter open source, but also reference Dapper’s system to do. Zipkin’s Java application uses a component called Brave to collect performance analysis data within the application. Brave (github.com/openzipkin/…
Github address: github.com/openzipkin/…
Zipkin architecture diagram
Zipkin is divided into four parts:
- ZipkinCollector: After data is collected and transferred to the Collector, the Collector validates the data, stores the data, and indexes the data.
- Storage: Zipkin data can be stored in Cassandra, ElasticSearch, and MySQL;
- Query API: provides data Query and retrieval services;
- Web UI: Visual presentation platform for displaying trace data.
CAT
CAT is an open source project developed by Meituan-Dianping. It is a real-time application monitoring platform based on Java, including real-time application monitoring and business monitoring. It can provide more than ten reports for display. Cat is positioned as a real-time monitoring platform, but it is more like a data warehouse than a monitoring platform, providing rich report analysis functions on the basis of data warehouse. But the way CAT does tracing is by hard-coding some “buried” points in the code, which is intrusive. This has both advantages and disadvantages, the advantage is that you can add buried points in their own needs, more targeted; The downside is that you have to change existing systems, which many development teams don’t want to do.
Github address: github.com/dianping/ca…
The CAT architecture diagram
CAT is divided into client and server. The client uses the CAT interface to report logs in a unified format to the server. The client side of CAT is the place where logs are generated (generally, the monitored application, the “application” node in the figure above), and the corresponding server side is the place where logs are received and consumed (the Server node in the figure above). After logs are consumed, log reports are generated.
contrast
plan | Rely on | implementation | storage | The JVM monitoring | Trace the query | intrusion | Deployment costs |
---|---|---|---|---|---|---|---|
Pinpoint | Java 6,7,8 maven3+ Hbase0.94+ | Java probe, enhanced bytecode | HBase | support | Need secondary development | The minimum | higher |
SkyWalking | Java 6,7,8 maven3.0+ zookeeper elasticsearch | Java probe, enhanced bytecode | Elasticsearch, H2, mysql,TIDN,Sharding Sphere | support | support | low | low |
Zipkin | Java 6,7,8 Maven3.2+ rabbitMQ | Intercepts requests and sends (HTTP, MQ) data to the Zipkin service | Memory, mysql, Cassandra, Elasticsearch | Does not support | support | High, need to develop | In the |
CAT | Java 6 7 8, Maven 3+ MySQL 5.6 5.7, Linux 2.6+ Hadoop Optional | Code burying points (interceptors, annotations, filters, etc.) | mysql , hdfs | Does not support | support | It’s high. It needs to be buried | In the |
Based on the low invasion consideration of the program source code and configuration file, the recommended selection order is Pinpoint > SkyWalking > Zipkin > CAT
Pinpoint: the basic need not modify source code and configuration files, as long as the command to specify javaAgent parameters can, for operation and maintenance personnel is the most convenient; SkyWalking: You don’t need to modify the source code, you need to modify the configuration file; Zipkin: Configuration files like Spring and web.xml need to be modified, which is a bit more cumbersome; CAT: Because of the need to change the source code setting burying point, it is unlikely to be done by the operations staff alone, but must be deeply involved by the developers;
Compared with traditional monitoring software (flow) of Zabbix difference, with focus on the APM inside for system implementation, the performance of the system call between bottleneck analysis, it is more advantageous to locate the exact reasons for this problem, not just like traditional monitoring software provides only a few scattered and indicators point to point, even if the alarm is also don’t know where the problem is out.
conclusion
Mainstream APM tools use less intrusive methods to modify application code for better promotion. In order to cope with the rapid development of cloud computing, micro-services and containerization and the trend of massive growth of APM monitored data brought by applications, the data landing method is mainly mass storage database.
In the future, in data analysis and performance analysis, big data and machine learning will play an important role in THE FIELD of APM, and the function of APM will also develop from a single resource monitoring and application monitoring to anomaly detection, performance diagnosis, future prediction and other automatic and intelligent direction.
There will be time later to put together an article on the JavaAgent mechanism and ASM bytecode technology.