All in One: How to build an end-to-end observable System

Sigue & Bai Yu

Observable past and present lives

System observability and fault analysis are important measurement standards in system operation and maintenance. With the evolution of system architecture, resource unit, resource acquisition mode, and communication mode, the system encounters great challenges. These challenges are also forcing the development of technology related to operation and maintenance. Before we officially start today’s content, we will talk about the past and present life of observables. Throughout the development history of operation and maintenance monitoring, monitoring and observables have been developed for nearly 30 years.

In the late 1990s, when client-server applications became popular as computing moved from mainframe to desktop, there was a focus on network performance and host resources. In order to better monitor the application of this CS, the first generation of APM software was born. Operations teams focus on network performance and host performance during this period, when the application architecture is still very simple. At this point, we also call these tools monitoring tools.

By the year 2000, the Internet had exploded and the browser was the new user interface. Applications evolve into the browser-based three-tier architecture of Browser-app-DB. At the same time, Java becomes popular as the first programming language of enterprise software. The idea of write once and run anywhere greatly improves the productivity of code. However, The Java virtual machine also hides the details of code operation, making it more difficult to tune and troubleshoot errors, so the code-level trace diagnosis and database tuning become new concerns, thus giving birth to a new generation of monitoring tools APM (Application performance monitoring).

After 2005, distributed applications became the first choice of many enterprises, such as SOA-BASED architecture, ESB applications became popular. At the same time, virtualization technology is gradually prevailing, the physical unit of the traditional server gradually faded into invisible, invisible virtual resource mode. Tripartite components such as message queues and caches are also beginning to be used in production environments. In such a technological environment, a new generation of APM software is born. Enterprises begin to need to carry out full-link tracing, monitor virtual resources and three-party component monitoring at the same time. Thus, the core capabilities of a new generation of APM are derived.

After 2010, as the cloud native architecture began to be put into practice, the application architecture began to gradually change from single system to micro-service, and the business logic in it then became the call and request between micro-services. At the same time, virtualization becomes more thorough, container management platform is accepted by more and more enterprises, three-party components are gradually evolving into cloud services, and the entire application architecture becomes cloud native architecture. The service invocation path becomes longer, which makes the traffic direction become uncontrollable and the difficulty of troubleshooting increases. Therefore, a new observable capability is required to continuously analyze the whole application life process of development, test operation and maintenance through a variety of observable data covering the whole stack (indicators, logs, links and events).

As you can see, observable capacity becomes the cloud’s native infrastructure. The whole observable capability evolves from simple operation and maintenance state to test and development state. The observable purpose has expanded from supporting the normal operation of the business to accelerating business innovation and making the business iterate quickly.

Monitoring & APM & Observable cognitive Similarities and differences

From the above process, we can see the evolution from monitoring to APM to observability. Now, let’s talk about the relationship between these three. To better explain, here is a classical cognitive model. We tend to divide things in the world into two dimensions, awareness and understanding.

So, first of all, the things that we know and understand, we call facts. Falling into the topic just discussed, this part corresponds to monitoring. For example, when doing operations and maintenance work, it is an objective fact that the server CPU utilization is designed to be monitored from the very beginning, whether it is 80% or 90%. This is what monitoring is all about, that is, collecting metrics based on knowing what to monitor, and setting up a monitor.

Here’s what we know but don’t understand. For example, CPU utilization is monitored to reach 90%, but why it is so high and what causes it is a process of verification. APM can be used to collect and analyze the application performance of the host and discover a log frame with high latency during the application link invocation, which causes the CPU utilization of the host to soar. This is the reason behind high CPU utilization through application layer analysis with APM.

Then, there are the things we understand but don’t know. Again, this is a high CPU utilization scenario. If you can predict a spike in CPU utilization at some point in the future by learning historical data and associated events, you can achieve an early warning.

And finally, what we don’t know and understand. As in the above example, if CPU usage spikes through monitoring and application logging frameworks are discovered through APM. However, if we further analyze the user access data during this period, we find that in Shanghai, the response time of the request accessed through apple terminal is 10 times longer than that of other cases, and this type of request generates massive Info logs due to the configuration of the log framework, resulting in the CPU surge of some machines. This is an observable process. Observability is the need to solve something you didn’t know (apple terminal access performance issues from Shanghai) and didn’t understand (misconfiguring the logging framework to produce massive info logs)

To summarize, in the monitoring world we focus on metrics, which may be focused on the infrastructure layer, such as machine or network performance metrics. Then build kanban and alarm rules based on these metrics to monitor things in a known range. After monitoring detects problems, APM uses diagnostic tools, such as application-layer links, memory and threads, to locate the root cause of abnormal monitoring indicators.

Observability takes application as the center. By associating and analyzing various observable data sources such as logs, links, indicators, and events, the root cause can be found more quickly and directly. It also provides an observable interface, allowing users to explore and analyze the observable data flexibly and freely. At the same time, the observation capability is integrated with cloud services, which immediately strengthens the elastic capacity expansion, high availability and other capabilities of applications. When problems are found, related problems can be solved more quickly and application services can be restored.

Key points of constructing observable system

While observable capability brings great business value, it also brings considerable system construction challenges. This is not only a tool or technology selection, but also an operation and maintenance concept. This includes the collection, analysis and value output of observable data.

Observable data acquisition

Currently, observable data widely promoted in the industry includes three pillars: Logging, link Tracing, and Metrics, among which some commonalities need attention.

1) Full stack coverage

Cloud service applications at the base layer, container layer and above, as well as corresponding observable data of user terminals and corresponding indicators, links and events, need to be collected.

2) Unified standards

Metrics are being consolidated throughout the industry, starting with Prometheus, which has formed a consensus on metrics for the cloud native era; The standard of link data has gradually occupied the mainstream with the implementation of OpenTracing and OpenTelemetry. In the field of log, although its data is not structured enough to form data standards, but in the collection, storage and analysis side, there are also open source rookie Fluentd, Loki and so on. Grafana, on the other hand, is becoming a standard for displaying all kinds of observable data.

3) Data quality

Data quality is an important part that is easily neglected. On the one hand, data sources of different monitoring systems need to define data standards to ensure the accuracy of analysis. On the other hand, the same event may cause a large number of duplicate indicators, alarms, and logs. Through filtering, noise reduction and aggregation, the data with analytical value can be analyzed to ensure the important part of data quality. This is often where the gap between open source and commercial tools is relatively large. For a simple example, when we collect an application call link, how deep is the call link collected? What is the policy for invoking link sampling? Wrong, slow is it possible to all down? Whether the sampling strategy can be dynamically adjusted based on certain rules determines the quality of observable data collection.

Analysis of observable data

1) Horizontal and vertical correlation

In the present observable system, application is a very good Angle of analysis. First of all, applications are related to each other and can be linked through call chains. This includes how microservices are invoked between each other, how applications are invoked with cloud services and third-party components, all of which can be associated through links. At the same time, applications can be mapped vertically to container layer and resource layer. With the application as the center, the global observable data association is formed through horizontal and vertical. When problems occur and need to be located, unified analysis can be performed from the perspective of applications.

2) Domain knowledge

In the face of massive data, how to find problems more quickly and locate problems more accurately? In addition to application-centric data correlation, domain knowledge of the analysis problem needs to be located. The most important thing for an observable tool or product is to continuously accumulate the best troubleshooting path, common problem location, root cause decision link method and solidify the relevant experience. This is equivalent to equipping the o&M team with experienced o&M engineers to quickly identify problems and locate the root cause. This is also different from traditional AIOps capabilities.

Observable value output

1) Unified presentation

As mentioned above, observables need to cover all layers, and each layer has corresponding observable data. However, currently observational tools are very scattered, and how to present the data generated by these tools is a big challenge. The unification of observable data is actually relatively difficult, including format, coding rules, dictionary values and so on. However, unified presentation of data results can be achieved, and the current mainstream solution is to use Grafana to build a unified monitoring platform.

2) Collaborative processing

After unified display and alarm, ChartOps, which can be used to find and deal with problems more efficiently with the help of collaborative platforms such as Dingding and enterprise wechat, has gradually become a critical requirement.

3) Cloud service linkage

Observable capability has become the original cloud infrastructure. When the observable platform discovers and locates problems, it needs to quickly link with various cloud services to rapidly expand capacity or load balance, so as to solve problems more quickly.

Prometheus + Grafana practice

Thanks to the flourishing cloud native open source ecosystem, it is easy to build a monitoring system, such as base monitoring using Prometheus + Grafana, tracking using SkyWalking or Jaeger, logging using ELK or Loki. However, for the o&M team, different types of observable data are stored in different back ends, and troubleshooting still needs to jump between multiple systems, thus the efficiency cannot be guaranteed. Based on the above, Ali Cloud also provides one-stop observation platform ARMS (real-time monitoring service of application) for enterprises. ARMS, as a product family, includes multiple products in different observable scenarios, such as:

For the infrastructure layer, Prometheus monitors various cloud services, including ECS, VPCS, containers, and third-party middleware.
For the application layer, the application monitoring based on The Java probe developed by Ali Cloud fully meets the requirements of application monitoring. Compared to open source tools, data quality and other aspects are greatly improved. Through link tracing, data can be reported to the application monitoring platform even using open source SDKS or probes.
For the user experience layer, the user’s experience and performance on different terminals are fully covered by mobile monitoring, front-end monitoring, cloud dial testing and other modules.
Unified alarm: Performs unified alarm and root cause analysis on the data and alarm information collected at each layer, and displays the discovery result using Insight.
Unified interface, whether ARMS, Prometheus reported data, log service, ElasticSearch, MongoDB and other data sources, can use the fully managed Grafana service to provide unified data observable data presentation and establish a unified monitoring market. It also provides CloudOps capability through linkage with various cloud services of Ali Cloud.

As mentioned above, ARMS, as a one-stop product, has a lot of capabilities. At present, enterprises have built some capabilities similar to ARMS or adopted some products in ARMS, such as application monitoring and front-end monitoring. However, a complete observable system is still very important for enterprises, and they hope to build an observable system that meets their business needs based on open source. In the following examples, we will focus on how Prometheus + Grafana built an observable system.

Fast data access

In ARMS, we can quickly establish a Grafana exclusive instance, ARMS Prometheus, SLS log service, CMS cloud monitoring data source can be very convenient for data synchronization. If you open Configuration, you can quickly view the data source. At the same time of fast access to various data sources, reduce the workload of daily data source management as much as possible.

Preset data tray

After the data is connected, Grafana automatically creates the corresponding data platters for you. Taking application monitoring and container monitoring as examples, basic data such as gold three indicators and interface changes will be provided by default.

As you can see, Grafana is helping you put together the various data kanban, but it’s still a fragmented market. During routine OPERATION and maintenance, a unified large plate must be created based on service domains or applications. Data at the infrastructure layer, container layer, application layer, and user terminal layer can be displayed on the same large plate for overall monitoring.

Full stack unified market

When establishing the whole stack unified market, we make preparations according to the dimensions of user experience, application performance, container layer, cloud service, and underlying resources.

1) User experience monitoring

Common PV, UV data, JS error rate, first rendering time, API request success rate, TopN page performance and other key data will be presented in the first time.

2) Application performance monitoring

The amount of requests, error rate and response time represented by the gold three indicators. And according to different applications, different services to distinguish.

3) Container layer monitoring

The performance and usage of each Pod, as well as the department created by which these applications run. These Deployment-related Pod performance information are presented in this section.

4) Cloud service monitoring

In addition, it is related to cloud service monitoring, here to the message queue Kafka example, such as the common related data indicators of message service such as consumption accumulation, consumption and other data.

5) Host node monitoring

For the entire host node, CPU, running Pod and other data.

In this way, the big board covers the overall performance monitoring from the user experience layer to the application layer to the infrastructure container layer. More importantly, the whole market contains all the data related to microservices. When a service is cut, the performance data associated with the service is displayed independently. Filter at different levels, such as containers, applications, and cloud services. To give you a glimpse of how this is done, Prometheus collects tags from cloud services when monitoring them. By marking tags, these cloud services can be distinguished based on different business dimensions or applications. When we unify the market, we will encounter many data source management problems. Here we provide globalView capability to aggregate all Prometheus instances under this user name for a unified query. Whether it’s application-layer information, or cloud service information.

With the help of the above scenario, we put forward the design direction of observability platform: From the perspective of system and service observation, different data are integrated and analyzed at the back end, instead of deliberately emphasizing the system to support the query of observability three kinds of data respectively. In terms of product functions and interaction logic, users can be separated from Metrics, Tracing and Logging as much as possible. A complete observable closed loop is established to provide an integrated platform for continuous business monitoring and service performance optimization, from abnormal discovery before the accident, troubleshooting during the accident to active early warning and monitoring after the accident.

Click here to watch the great video talk and learn more about the observable real combat dry goods!