Introduction: With the continuous implementation of Kubernetes, we often encounter problems such as load balancing, cluster scheduling, horizontal scaling and so on. In the final analysis, these problems have exposed the uneven distribution of traffic. So how can we discover resource usage and solve the problem of uneven traffic distribution? Today, we’ll talk about this problem and how to solve it using three specific scenarios.
Author | Yan Xun
Hello everyone, I am Yan Xun from Aliyunyun native application platform. I am very happy to continue to share with you the Kubernetes monitoring series of open classes. In the first two sessions, we covered Vol.1, “Exploring application Architectures and Finding Unexpected Traffic through Kubernetes Monitoring” and Vol.2, “How to Find Service and Workload Anomalies in Kubernetes”.
How to use Kubernetes monitoring topology to explore application architecture and use monitoring data collected by products to configure alarms to discover service performance problems. Today we will carry on the third talk “use Kubernetes monitoring found resource use, uneven distribution of traffic problem”, we can nail nail search nail group 31588365, join Kubernetes monitoring qQ group for communication.
With the continuous implementation of Kubernetes, we often encounter more and more problems, such as load balancing, cluster scheduling, horizontal scaling and so on. In the final analysis, these problems have exposed the uneven distribution of traffic. So how can we discover resource usage and solve the problem of uneven traffic distribution? Today, we’ll talk about this problem and how to solve it using three specific scenarios.
System architecture challenge 1: Load balancing
Generally, for a business system, the architecture will be a lot of layers, each layer contains many components, such as service access, middleware, store, we hope every components of the load is balanced, such performance and stability is the highest, but more than a communication protocol scenarios in multiple languages, quickly found the following problems have certain difficulty, such as:
- Does the application server process requests evenly?
- Is the traffic from application servers to middleware service instances even?
- Is the read and write traffic of each sub-database sub-table instance uniform?
- …
We encountered in the actual work practice of the typical scenario is load imbalance, online traffic forwarding strategy or traffic forwarding component itself has a problem, leading to an application service each instance receives the request is not balanced, part instance processing flow is significantly higher than the other nodes, cause this part instance performance deteriorated significantly relative to other instances, In this case, requests routed to these instances cannot be responded to in time, which reduces the overall performance and stability of the system.
In the scenario where the server is not uniform, most cloud users use cloud service instances. In practice, the traffic processed by each instance of the application service is uniform, but the traffic of the nodes accessing the cloud service instance is uneven. As a result, the overall performance and stability of the cloud service instance deteriorates. This scenario is typically entered during application runtime link combing and upstream and downstream analysis of a specific problem node.
So, how do we find problems and solve them quickly? To solve this problem, we can find the problem of client and server from the two aspects of service load and request load, and judge whether the service load of each component instance and external request load are balanced.
(1) Server load
For Service side load balancing problem detection, we need to know Service details and conduct more targeted investigation for any specific Service, Deployment, DaemonSet, and StatefulSet. Through the Kubernetes monitoring service details function, we can see that the Pod list section will list all the PODS at the back end, in the table we list the number of requests for each Pod in the selected time period of aggregation value and request number sequence, by sorting the number of requests column, we can clearly see whether the traffic at the back end is even.
(2) Client load
Kubernetes monitor provides cluster topology function for client load balancing problem detection. For any specific Service, Deployment, DaemonSet, and StatefulSet, we can view their associated topology. After selecting the association relationship, Click the table to list all the network topologies associated with the entity in question. Each item in the table is the topological relation of external requests of the application service node. In the table, we will display the aggregate value and time sequence of request number of each pair of topological relation in the selected period. You can clearly see whether a particular node as a client is accessing a particular server with even traffic.
System architecture challenge 2: Cluster scheduling
In the Kubernetes cluster deployment scenario, the process of distributing PODS to a node is called scheduling. For each Pod, the scheduling process includes two steps: “Find candidate nodes according to filtering conditions” and “find the best node”. In addition to filtering nodes according to the Pod and Node stain, tolerance relationship, there is also a very important point is to filter nodes according to the amount of resource reservation, such as the node CPU only 1 core reservation, then for a request 2 core Pod node will be filtered. In addition to selecting the best node based on the affinity between Pod and Node, it is generally used to select the most idle node among the filtered nodes.
Based on the above theory, we often encounter some problems in practice:
- Why can’t Pod be scheduled when cluster resource usage is low?
- Why is the resource usage of some nodes significantly higher than that of others?
- Why cannot only some node resources be scheduled?
- …
The typical scenario we will encounter in practical work practice is resource hot spot problem, Pod scheduling problem frequently occurs on specific nodes, and the resource utilization of the whole cluster is very low, but Pod cannot be scheduled. As shown in the figure, Node1 and Node2 are fully scheduled with Pod, while Node3 has no Pod scheduled. This problem affects the high availability of cross-region Dr And the overall performance. We usually enter this scenario when Pod scheduling fails.
So how do we deal with it?
There are three points we should generally focus on in troubleshooting problems where Pod cannot be scheduled:
- The node has the upper limit of Pod number scheduling
- The node has the upper limit of CPU request scheduling. Procedure
- The node has the upper limit of memory request scheduling
The cluster node list provided by Kubernetes monitoring shows the above three points. You can view resource hotspots by sorting nodes to check whether they are even. For example, if the CPU request rate of a node is close to 100%, then it means that any Pod requesting the CPU cannot be scheduled to the node. If the CPU request rate of only a few nodes is close to 100% and all other nodes are idle, we need to check the resource capacity and Pod distribution of this node. Further troubleshoot the problem.
In addition to nodes having resource hot spots, containers also have resource hot spots. As shown in figure, for more than a copy of the service, the resources distribution of the container may also have resources issues, mainly reflected on the CPU and memory usage, CPU is compressible in container environment resources, reached after the ceiling will only limit and will not affect container itself life cycle, and exist within the container environment are incompressible resources, OOM will appear when the upper limit is reached. Although each node processes the same number of requests, CPU and memory consumption may be different due to different parameters of different requests. In this way, hot resources of some containers will occur, which will affect the life cycle and automatic capacity expansion.
Based on the theoretical analysis, we need to pay attention to the following points:
- The CPU is a compressible resource
- Memory is an incompressible resource
- Requests for scheduling
- Limits is used for runtime resource limit isolation
Kubernetes monitor in the Pod list of service details show the above four points, support sorting, by viewing each Pod is uniform to view hot resource issues, such as a Pod CPU usage/request rate close to 100%, then it means that automatic scaling may trigger, If only a few pods have CPU usage/request rates close to 100% and all other nodes are idle, you need to check the processing logic to further troubleshoot the problem.
System architecture challenges 3: single point of problem
For single point problem, its essence is high availability problem. There is only one solution to the high availability problem, that is, redundancy: multi-node, multi-region, multi-zone, and multi-machine room. The more scattered the better, the more redundant the better. In addition, under the condition of increasing traffic and component pressure, whether the system components can be expanded horizontally has also become an important issue.
Single point problem, application service only at most 1 node, when the node because the network interruption or other problem, cannot solve through the restart when the system crashes, at the same time, because only one node, when the flow rate increase the processing capacity of more than one node, the performance of the overall system performance will be worse, single point problems will affect the performance of the system and high availability, For this problem, Kubernetes monitors the number of copies of Service, Daemonset, StatefulSet, and Deployment to quickly locate a single point of problem.
Through the introduction of the above we can see that Kubernetes monitoring can be from the server, client multi-perspective support multi-language multi-communication protocol scenario load balancing problem investigation, at the same time, container, node, service resource hot spot problem investigation, and finally through the number of copies check and traffic analysis support single point of problem investigation. In the subsequent iterations, we will use these checkpoints as scene switches, which will automatically check and alarm after one button is opened.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.