Hello everyone, I am Yan Xun from Aliyunyun native application platform. I am very glad to communicate with you in Kubernetes Monitoring series open class. This open class is expected to bring new solutions to quickly discover and locate problems in Kubernetes container environment.
Why do I need Kubernetes monitoring?
Many students are familiar with application performance monitoring. This kind of monitoring mainly focuses on business application logic, application framework and language runtime, monitoring objects such as thread pool full, database connection failure, MySQL, memory overflow, and various call chain exception stacks. With the evolution of cloud native technology brought by Kubernetes containerization technology, the development and operation and maintenance of upper-layer applications become simpler, but the complexity is constant, and the decrease of upper-layer complexity is inevitably accompanied by the increase of lower-layer complexity. As shown in the figure below, complexity is shifting to the container virtualization layer and the system call kernel layer’s support for various virtualization technologies. Problems can occur at each layer, and these problems affect upper-layer applications. For example, the Kubernetes component of the container virtualization layer is abnormal. If the scheduler is abnormal, Pod will not be able to schedule and affect the application. For example, upper-layer applications cannot read files due to system call exceptions related to file systems, causing application problems. For example, a kernel exception causes an application process to fail to schedule work.
Application to the healthy and stable operation, need is a software stack end-to-end healthy and stable, although now many operations teams use monitoring and monitoring system set up, but no one can monitor the top-down, end-to-end together each software layer of behavior, lead to difficult problems occur, do not know how to start. In the application layer, a network request times out. It seems that there is no problem on both the client and server, but in fact the network layer sends too much RTT, the retransmission rate is too high, or the DNS resolution is slow, or the CNI plug-in is slow. How to achieve end-to-end observability in Kubernetes container environment is the significance of Kubernetes monitoring.
Kubernetes monitoring is based on the Kubernetes container interface and the underlying operating system under application monitoring. At the container virtualization layer, we can obtain observation data from the following five data sources. Our exporter can obtain observation data from the Kubernetes management component. The resource observation data of the container is obtained through the cAdvisor. Kube-state-metrics is used to obtain Kubernetes resource status data, as well as event and Kubernetes resource status and condition data. At the system call layer, we obtain observation data through Linux Tracing technology such as Kprobe/ Tracepoints. In the kernel layer, we obtain observation data through the kernel observable module, and then Kubernetes monitoring through the process, container, Kubernetes resources and business application association relationship upward association through application performance monitoring, to create end-to-end observability. Therefore, Kubernetes monitoring is the integration solution of end-to-end observability of Kubernetes cluster software stack. In Kubernetes monitoring, you can see the observation data of all associated layers at the same time. With a set of best practices for Kubernetes monitoring, we hope to enable you to use Kubernetes monitoring to solve difficult observable problems in the Kubernetes environment.
The first category is problem discovery, which mainly includes five types of problem discovery: application architecture problem, performance problem, resource problem, scheduling problem and network problem. The second type is fault locating, which mainly involves locating the root causes of the above five types of problems and providing suggestions for troubleshooting.
Explore the application architecture and discover unexpected traffic
The topic of the first lecture in the Kubernetes Monitoring series is “How to Use Kubernetes Monitoring to explore application architecture and discover Unexpected Traffic”. It covers the following three topics:
- Background: Challenges of application architecture exploration;
- Typical scenarios: In which scenarios we need to explore the application architecture;
- Best practices: Introduce a model for applying architecture exploration to efficiently identify and locate problems.
I. Challenges of application architecture exploration
(1) Chaotic microservice architecture
In the Kubernetes containerized environment, the microservices architecture is the most common architectural pattern. Under this architecture, as the business grows, there will be more and more microservices, and their relationships will become more and more complex. As complexity increases, some common architectural issues become difficult, such as how the application is currently operating, whether the application’s downstream dependency services are working properly, whether the application’s upstream client traffic is working properly, whether the application DNS resolution is working properly, and whether the connectivity between the two applications is impaired. As a result, it is often very difficult to explore the application architecture.
(2) Multilingual
In microservices architecture, individual microservices can often use different programming languages, as long as standard services are exposed. So how do different languages monitor, do they have the same burying patterns, and do they have useful and efficient burying tools for different languages? What is the performance impact of code intrusion, and does buried code affect business performance? This is an observational challenge in a multilingual scenario.
(3) Multi-communication protocol
In the microservice architecture, different communication protocols can be used for communication between microservices, such as HTTP, gRPC, Kafka, Dubbo, etc. Often, we need to identify these protocols to quickly discover the problems corresponding to dependent services. However, identifying protocols means understanding each protocol and burying points in appropriate places. How to unify buried code intrusion of different communication protocols and whether it will affect service performance are the observation problems faced by communication protocol scenarios.
Two, typical scenes
(1) Architecture perception Architecture perception is based on real network calls, taking microservices as nodes and calls between microservices as edges to draw a topology. By comparing the expected architecture of static design, we can find problems, such as whether there is too much or too little micro-service and whether the relationship between micro-services is correct. It is usually used in scenarios requiring attention to the big picture of the structure, such as the launch of new applications, the opening of services in new regions, and the overall link sorting.
(2) Architecture exception discovery Architecture exception discovery allows you to quickly discover abnormal nodes and edges by displaying the corresponding exception colors based on the exception rules of nodes and edges in the architecture topology. This method is usually used in scenarios where the status of nodes and edges is concerned, such as link carding and health inspection.
(3) Association analysis After a node or edge anomaly is located through anomaly discovery, we usually need to switch the association relationship, quickly check the upstream and downstream of the node or edge and corresponding service instances, and narrow the scope of the problem step by step.
Best practices
The above three typical scenarios constitute a complete practice process: observe whether the actual operating architecture of the application is consistent with the expectation through architecture perception; if there are structural problems, we need to further investigate services with abnormal structure; if there are no structural problems, we can proceed to the next step. If there are no abnormal nodes and edges, it will be better. Otherwise, we will proceed to the next step. After locating specific nodes and edges, we will start association analysis to analyze whether there are problems with their own instances and then whether there are problems with upstream and downstream.
How does Kubernetes monitoring support best practices? The first is Kubernetes’ ability to monitor the architecture awareness of the cluster topology. Kubernetes monitoring maps the application architecture topology by associating real network requests. Currently, two views are provided: Service invocation between Services and Workload invocation between Deployment, Daemonset, and Statefulset.
In the topology view, nodes are grouped by namespace and nodes are grouped by service type by default. After a group and you can see the corresponding node, click on the node can see the performance of the selected time range aggregate values and temporal values, these values will be divided according to network protocol, click on the edge to see the performance of the selected time range aggregate values and temporal values, these values will be divided according to network protocol, coupled with nodes filter, For example, looking at the schema relationship between two specific namespaces, and node queries, looking at a node quickly, is a great way to explore the schema.
Kubernetes monitors the ability to detect anomalies. Kubernetes monitors the nodes and edges by drawing them as abnormal yellow or red colors through three dimensions of abnormal conditions. Specifically, the three dimensions are performance indicator anomalies, such as error rate greater than 10% and average response time greater than 500 milliseconds; Second, the resource indicator is abnormal. For example, the CPU usage is greater than 70%, and the memory usage is greater than 70%. Thirdly, K8S control state is abnormal. For example, POD cannot reach the ready state all the time. When the group is folded up, the abnormal proportion of node group will be displayed. With this capability, we can quickly find exceptions to specific microservices or microservice relationships.
Kubernetes monitoring also has association analysis ability, support to view the upstream and downstream of a specific node, provide 3D view at the same time to view the upstream and downstream relationship associated with the node and its own strength state, can be in a map to explore all associated data, greatly improve the efficiency of problem location.
Iv. Product value monitored by Kubernetes
Aliyun Kubernetes monitoring is a set of one-stop observability products developed for Kubernetes cluster, which will associate all indicators, links, logs and events under the name of Kubernetes. It has six main features:
- Code non-intrusive: Aliyun Kubernetes monitoring through the bypass technology, without the need to bury the code can obtain rich network performance data.
- Language independent: Ali cloud Kubernetes monitoring network protocol parsing in the kernel layer, support any language, any framework.
- High performance: Alibaba Cloud Kubernetes monitoring based on eBPF technology, can obtain rich network performance data with very low consumption.
- Resource association: Aliyun Kubernetes monitors the association of related resources through network topology and resource topology.
- Data diversity: Aliyun Kubernetes monitoring supports all types of observable data (monitoring indicators, links, logs and events), covering the end-to-end software stack.
- Integrity: Aliyun Kubernetes monitors scenario design through the console, associated architecture-aware topology, application monitoring, Prometheus monitoring, cloud dial-up, health inspection, event center, log service, and cloud service.
So what are the similarities and differences between Kubernetes monitoring, application performance monitoring, and Prometheus monitoring? The following figure clearly shows the relationship and difference between the three. Application performance monitoring focuses on application logic, framework and programming language, while Kubernetes monitoring focuses on system network and container interface, and will correlate application performance monitoring upward. Prometheus monitoring is the infrastructure in which Kubernetes monitoring and application performance monitoring indicator data will be stored.
So, for a quick fix to the Kubernetes monitoring problem, start experimenting now! In present Kubernetes monitoring comprehensive free demo, click on the link (www.aliyun.com/activity/mi)… Can open trial! You’re welcome to join the Q&A group, and I’ll see you next time.