Author: Ya Hai

The “third play” of link tracing * *

When it comes to link tracing, it is natural to use call chain to check single request exceptions, or use pre-aggregated link statistics to monitor and alarm services. In fact, there is a third way to play with link tracing: it can solve the bounding problem faster than the call chain; It provides more flexibility to implement custom diagnostics than pre-aggregated monitoring charts. That is, post-aggregation analysis based on detailed link data, or link analysis for short.

Link analysis is based on the stored detailed link data and can freely combine the filtering conditions and aggregation dimensions for real-time analysis to meet customized diagnosis requirements in different scenarios. For example, check the timing distribution of slow calls that take longer than 3 seconds, check the distribution of error requests on different machines, check the variation of VIP customer traffic, and so on. Next, this paper will introduce how to quickly locate five kinds of classic online problems through link analysis, and more intuitively understand the usage and value of link analysis.

Link analysis K.O “Five Classic Problems”

The use of link analysis based on post aggregation is very flexible. This article only lists the five most typical scenarios, and others are welcome to explore and share.

[Traffic imbalance] Load balancing is incorrectly configured, resulting in a large number of requests being sent to a small number of machines, resulting in “hot spots” affecting service availability. What should I do?

The “hot spot breakdown” problem caused by uneven traffic can easily lead to service unavailability. There are many such cases in the production environment. For example, the load balancing configuration is incorrect, the restart node service cannot go online due to a registry exception, and the DHT hash factor is abnormal.

Uneven flow of the biggest risk is that can discover the phenomenon of “hot spots”, it is more problem representation service response slow or error, the traditional monitoring can’t reflect the hot phenomenon, so most of the students will not take this factor into consideration, the first time to waste the precious emergency processing time, cause failure influence surface spreading.

You can use link analysis to collect link data by IP address group to quickly learn which machines the call requests are distributed to, especially the traffic distribution changes before and after the problem occurs. If a large number of requests are suddenly concentrated on one or a small number of machines, the hotspot problem may be caused by uneven traffic. Combined with the change events at the fault occurrence point, the fault can be quickly located and rolled back in time.

[Single-node fault] Some requests fail or time out due to single-node faults such as network adapter damage, CPU overselling, and disk overloading. How do I rectify the faults?

Single machine failures occur frequently all the time, especially in the core cluster due to a large number of nodes, which is almost an “inevitable” event in terms of statistical probability. A single node failure does not cause service unavailability in a large area, but may cause a small number of user request failures or timeout, which continuously affects user experience and incurs a certain cost for answering questions. Therefore, such problems need to be handled in a timely manner.

Single-machine faults can be divided into host faults and container faults (Node and Pod in K8s environment). CPU oversold, hardware failures, etc., are at the host level and affect all containers. Faults such as disk overflow and memory overflow affect only a single container. Therefore, you can analyze the host IP address and container IP address when troubleshooting single-node faults.

In the face of such problems, abnormal or timeout requests can be screened out through link analysis and aggregated based on the HOST IP address or container IP address to quickly determine whether a single machine fault exists. If abnormal requests are concentrated on a single machine, replace the machine for quick recovery or check the system parameters of the machine, such as whether the disk space is full and the CPU steal time is too high. If the abnormal requests are scattered across multiple machines, the single-machine failure factor can be eliminated, and the downstream dependent service or program logic can be analyzed.

[Slow interface governance] Performance optimization before new applications are launched or promoted. How can I quickly sort out the list of slow interfaces to solve performance bottlenecks?

A systematic performance tuning is often required when a new application is launched or launched. The first step is to analyze the performance bottlenecks of the current system and tease out the list and frequency of slow interfaces.

In this case, you can filter out the calls whose duration exceeds a certain threshold through link analysis. Then, you can collect statistics in groups based on interface names. In this way, you can quickly locate the list and rule of slow interfaces and manage the slow interfaces with the highest frequency one by one.

After finding the slow interface, you can locate the root cause of the slow call based on relevant data such as call chain, method stack, and thread pool. Common causes include the following:

  • If the database or microservice connection pool is too small and a large number of requests are in the connection obtaining state, you can increase the maximum number of threads in the connection pool.
  • N+1 problems, such as an external request and hundreds of internal database calls, can merge fragmented requests and reduce network transmission time.
  • The single request data is too large, resulting in long network transmission and deserialization time and FGC. You can change the full query to paging query to avoid requesting too much data at once.
  • Logging framework “hot lock”, can be changed from synchronous log output to asynchronous output.

[Business Traffic Statistics] How to analyze traffic changes and service quality of reinsurance customers/channels?

In a real production environment, services are usually standardized, but businesses need to be categorized and tiered. For the same order service, we need to classify statistics according to categories, channels, users and other dimensions to achieve fine operation. For example, for offline retail channels, the stability of every order and POS machine may trigger public opinion, and the SLA requirement of offline channels is much higher than that of online channels. Then, how can we accurately monitor the traffic status and service quality of offline retail links in the general e-commerce service system?

Here, you can use custom Attributes filtering and statistics for link analysis to achieve low-cost business link analysis. For example, we label offline orders with {“attributes. Channel “: “Offline “} on our portal service, and then label them individually by store, customer base, and product category. Finally, filter attributes. Channel = offline, and group by different service labels to count the number of calls, time consuming, and error rate, so as to quickly analyze the traffic trend and service quality of each type of business scenario.

[Grayscale release monitoring] 500 machines are released in 10 batches. How can we quickly judge whether there are abnormalities after the first batch of grayscale release?

Change three plate axe “can gray, can be monitored, can be rolled back”, is an important criterion to ensure the stability of the line. Batch gray scale change is the key method to reduce the risk and control the explosion radius. Once found grayscale batch of abnormal service status, should be rolled back in time, rather than continue to release. However, many failures in production environment are caused by the lack of effective gray monitoring.

For example, when the microservice registry is abnormal, the service cannot be registered online on the restarted publishing machine. Due to lack of gray level monitoring, although previous batch to restart the machine all registered failure, lead to all traffic on routing to the last of the machine, but the application of monitoring the overall flow and takes no significant change, until the last of the machine restart also registered after the failure of the entire application into completely unavailable, finally caused the serious fault line.

In the case above, if the version of traffic on different machines is marked {” attribute.version “: “v1.0.x”} and the link analysis is used to group attribute.version, the traffic changes and service quality before and after release or different versions can be clearly distinguished. Gray batch abnormality will not be covered by global monitoring.

Constraints on link analysis

Although link analysis is very flexible, it can meet the needs of customized diagnosis in different scenarios. But it also has several usage constraints:

  1. The cost of analysis based on link detail data is high. The prerequisite of link analysis is to report and store link details as completely as possible. If the sampling rate is low, the link analysis effect is compromised. To reduce the cost of full storage, edge data nodes can be deployed in user clusters for temporary data caching and processing to reduce cross-network reporting costs. Alternatively, hot and cold data is stored separately on the server. Full link analysis is performed on the hot storage device, and fault and slow link diagnosis is performed on the cold storage device.

  2. The query performance of post-aggregation analysis has high overhead and low concurrency. Therefore, it is not suitable for alarms. Link analysis performs full data scanning and statistics in real time. The cost of query performance is much higher than that of pre-aggregate statistics. Therefore, it is not suitable for high-concurrency alarm query. You need to push the post-aggregation analysis statement down to the client for user-defined counter statistics to support alarm and market customization.

  3. To maximize the value of link release analysis, customize label burying points. Link analysis is different from standard application monitoring pre-aggregation indicators. In many user-defined scenarios, labels need to be manually marked to effectively distinguish different service scenarios and achieve accurate analysis.

Link Analysis adds “wings of freedom” to APM

Link data contains rich value. Traditional call chain and service view are only two classical applications in fixed mode. Link analysis based on post aggregation can fully release the flexibility of diagnosis and meet the needs of customized diagnosis in any scenario and dimension. Combined with custom indicator generation rules, you can greatly improve the precision of alarm monitoring, and give your APM “wings of freedom”. We recommend you to experience, explore and share with us!

Click here to learn more about link tracking!