The introduction
In the last article we shared some best practices around the topic of how to use resources wisely, this time we will explore how to improve service availability.
How can we improve the availability of our deployment services? K8S is designed to take into account the possibility of various failures and provides some self-healing mechanisms to improve the system’s fault tolerance, but some situations can still result in long periods of unavailability, driving down service availability metrics. This article will provide some best practices to maximize service availability based on production practices.
How to avoid single points of failure?
K8S is designed to assume that nodes are unreliable. Node, the more the chance of hardware and software failure node is not available, so we usually need to service the deployment of multiple copies, adjust the value of the replicas according to actual condition there must be a single point of failure if the value is 1, if more than 1 but all copies are scheduled to the same node, that there is a single point of failure, Sometimes there are disasters to consider, such as an entire machine room not working.
Therefore, we not only need to have a reasonable number of copies, but also need to schedule these different copies to different topology domains (nodes, available areas) to break up the scheduling to avoid single point of failure. This can be achieved by using Pod anti-affinity, which can be mainly divided into strong anti-affinity and weak anti-affinity. For more information about Affinity and anti-affinity, please refer to the official documents Affinity and anti-affinity.
Let’s look at a strong anti-affinity example, forcing DNS services to be scattered on different nodes:
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: k8s-app operator: In values: - kube-dns topologyKey: kubernetes.io/hostname
Copy the code
labelSelector.matchExpressions
Write the labels key and value of the POD corresponding to the service, because the reverse affinity of POD is realized by judging the POD label of replicas.topologyKey
Specifies the antiaffinity topology domain, that is, the key of the node label. Here with thekubernetes.io/hostname
Avoid pod scheduling to the same node, if you have higher requirements, such as avoid scheduling to the same availability area, implement remote live, can be usedfailure-domain.beta.kubernetes.io/zone
. Usually, the scheduling to the same region will not be avoided, because the nodes of the same cluster are generally in the same region. If they cross regions, even the use of dedicated lines will have a large delay, sotopologyKey
Not usuallyfailure-domain.beta.kubernetes.io/region
.requiredDuringSchedulingIgnoredDuringExecution
The anti-affinity condition must be met during scheduling. If no node meets the condition, no node is scheduled Pending.
If it’s not that hard conditions can use preferredDuringSchedulingIgnoredDuringExecution to indicate that the scheduler to satisfy the compatibility conditions, namely the weaker affinity, if you can not meet the conditions, as long as there is enough resource nodes, You can still schedule it to a node, at least not Pending.
Let’s look at another example of weak antiaffinity:
affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: k8s-app operator: In values: - kube-dns topologyKey: kubernetes.io/hostname
Copy the code
Notice that? It’s a little bit different than strong antiaffinity, there’s a weight, which is the weight of the matching condition, and the matching condition is moved under the podAffinityTerm.
How can I avoid service unavailability during node maintenance or upgrade?
Kubectl drain is used to remove the pods from the node so that they migrate to other nodes. When the drain is complete, all the pods from the node migrate to other nodes. At this point we can rest assured of the operation of the node.
One problem is that expulsion of nodes is a lossy operation, and the principle of expulsion is:
- Block nodes (set to unschedulable to prevent new Pod schedules from coming up).
- Delete the Pod on this node.
- When the ReplicaSet controller detects a reduction in PODS, it recreates a Pod and schedules it to a new node.
This process is deleted and then created, not a rolling update, so if all copies of a service are on the expelled node during the update, the service may become unavailable.
Here are a few more situations in which a service becomes unavailable due to expulsion:
- A single point of failure exists in a service where all replicas are on the same node. If the node is expelled, the service may become unavailable.
- The service does not have a single point of failure, but it happens that all the PODS involved in the service are deployed on the expelled nodes, so all the PODS of the service are deleted at the same time, making the service unavailable.
- The service did not have a single point of failure and was not fully deployed to the expelled nodes. However, when the expulsion occurred, part of the Pod of the service was deleted, and the processing capacity of the service decreased in a short period of time, resulting in service overload. Some requests could not be processed, thus reducing the service availability.
For the first point, we can use anti-affinity to avoid single points of failure.
For the second and third point, we can prevent all copies from being deleted at the same time by configuring the PodDisruptionBudget (PDB). K8S “watches” the number of copies currently available and expected on nginx when ejecting, and controls the Pod deletion rate based on the defined PDB. When the threshold is reached, it waits for pods to start and be ready on other nodes before continuing to remove them to avoid removing too many pods at the same time resulting in service unavailability or reduced availability. Two examples are given below.
Example 1 (ensure that at least 90% of nginx copies are available at expulsion time):
apiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata: name: zk-pdbspec: minAvailable: 90% selector: matchLabels: app: zookeeper
Copy the code
Example 2 (ensure that at most one copy of ZooKeeper is unavailable during eviction, i.e. delete one copy one by one and wait for reconstruction to be completed on other nodes):
apiVersion: policy/v1beta1kind: PodDisruptionBudgetmetadata: name: zk-pdbspec: maxUnavailable: 1 selector: matchLabels: app: zookeeper
Copy the code
How do I get the service to update smoothly?
After addressing the single point of failure of services and the availability degradation caused by ejection of nodes, we also need to consider a scenario that could lead to availability degradation, namely rolling updates. Why might normal rolling updates of a service affect the availability of a service? Don’t worry. Let me explain why.
If there are inter-service calls within the cluster:
When a rolling update occurs on the server side:
Two embarrassing situations occur:
- Kube-proxy will schedule the new connection to the old one before updating the forwarding rules. Therefore, connection exception may be reported to Connection refused. No new requests will be accepted) or “No route to host” (the container has been completely destroyed and the network card and IP no longer exist).
- When the new copy is started, the node where the client is located, Kube-proxy, soon watches the new copy, updates the forwarding rules, and schedules the new connection to the new copy. However, the process in the container starts slowly (such as Tomcat Java process), and the port is not monitored yet. When connection refused cannot be processed, connection refused is refused.
In the first case, you can add preStop to the container to make the Pod sleep for some time before destroying the container. The Pod will wait for kube-proxy, the client node, to update the forwarding rules, and then destroy the container. In this way, Pod can continue normal operation for a period of time after Terminating. During this period, if new requests are forwarded due to the delay in updating the forwarding rules on the client side, Pod can still normally process the requests and avoid the occurrence of connection abnormalities. It sounds a little inelegant, but the actual effect is good, there is no silver bullet in the distributed world, we can only try to find and practice the optimal solution to solve the problem under the current design status quo.
In the second case, you can add a ReadinessProbe to the Container to update the Endpoint of the Service after the process in the container is started. Then, the kube-proxy node where the client is located updates the forwarding rules. Let the flow in. In this way, it can ensure that the traffic will not be forwarded until the Pod is fully ready, so as to avoid the occurrence of abnormal link.
Yaml best practices examples:
readinessProbe: httpGet: path: /healthz port: 80 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 10 timeoutSeconds: 1 lifecycle: preStop: exec: command: ["/bin/bash", "-c", "sleep 10"]
Copy the code
Please refer to Specifying a Disruption Budget for Your Application for more information.
What’s the best way to get a health check?
As we all know, configuring Pod health check is also a means to improve service availability. Configuring ReadinessProbe prevents traffic from being forwarded to pods that are not fully started or that are abnormal. LivenessProbe can be configured to restart applications that have bugs that cause deadlocks or hang. However, if the configuration is not good, it can also cause other problems. Here are some guiding suggestions based on the experience of trampling pits:
- Do not use LivenessProbe unless you understand the consequences and why you need it. See LivenessProbes are Dangerous
- If using LivenessProbe, do not set it to the same as ReadinessProbe (failureThreshold is higher)
- Do not have external dependencies (DB, other pods, etc.) in the probe logic to avoid cascading failures caused by jitter
- Business applications should try to expose HTTP probing interfaces to adapt to health checks, and avoid using TCP probing, because TCP probing can still pass when the application hang is dead.
[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan code to pay attention to the public account of the same name, timely access to more dry goods!!