In Kubernetes, the task of scheduling pods to specific nodes in the cluster is done by kube-Scheduler. The default behavior of this component is to filter nodes based on resource requests and limits for each container in the POD you create. The available nodes are then scored to find the best place to place a POD.
Scheduling PODS based on resource constraints is a desirable behavior in many scenarios. However, in some use cases, especially some advanced scheduling scenarios, Kubernetes administrators want to schedule pods to specific nodes based on other constraints.
In this article, I’ll review some scenarios for advanced POD scheduling in Kubernetes and best practices for implementing it in a real-world situation. This is particularly helpful for application engineers and K8s administrators who want to implement advanced application deployment patterns involving data localization, Pod coexistence, high availability, and efficient resource utilization in K8s clusters.
Manually scheduling pods to nodes
In a production Kubernetes setup, it is important to customize how pods are scheduled to nodes. Here are some of the most common scheduling scenarios:
Run Pod on nodes with dedicated hardware: Some Kubernetes applications may have specific hardware requirements. For example, a Pod running ML jobs requires a high-performance GPU rather than a CPU, and Elasticsearch Pods are more efficient on SSDS than HDDS. Therefore, the best practice for any resource-aware K8s cluster management is to assign pods to nodes with the right hardware.
Pod hosting and interdependence: In microservice Settings or tightly coupled application stacks, some PODS should be hosted on the same machine to improve performance and avoid network latency issues and connection failures. For example, it is good practice to run a Web server on the same machine as an in-memory caching service or database.
Data location: Data locality requirements for data-intensive applications are similar to previous usage situations. To ensure faster reads and better write throughput, these applications may need to deploy the database on the same machine as the host application.
High availability and fault tolerance: To make application deployments highly available and fault tolerant, it is a good practice to deploy and run PODS on nodes in different availability zones.
Pod advanced scheduling method
Kubernetes provides a number of API resources and strategies to help implement these scenarios. Next, I will introduce the concepts of nodeSelector, node affinity, and affinity between Pods. I’ll also show you some examples and show you how to implement them in your K8s cluster.
Manually schedule the Pod using nodeSelector
In earlier versions of K8s, users could use PodSpec’s nodeSelector field to implement manual Pod scheduling. Essentially, nodeSelector is a label-based poD-to-node scheduling method in which users assign certain labels to nodes and ensure that nodeSelector fields match those labels.
For example, suppose one of the node labels is “Storage = SSD” to indicate the type of storage on the node.
To use this tag to schedule the Pod to the node, I will specify the nodeSelector field with this tag in the Pod YAMl file.
Node selectors are the simplest advanced Pod scheduling method. However, this is not very useful when other rules and conditions should be considered during POD scheduling.
Node affinity
The node affinity feature is a qualitative improvement over the manual placement of pods discussed above. It uses logical operators and constraints to provide an expressive affinity language for fine-grained control over POD placement. It also supports “soft” and “hard” scheduling rules that allow you to control the severity of node association constraints based on user requirements.
In the example below, we use node association to place the Pod on a specific available node. Let’s take a look at the following layout file:
The “hard” association rules are specified under the Required During Scheduling ignored during Execution field in the nodeAffinity section of the POD orchestration file. In this example, I tell the scheduler to place pod only on nodes labeled kubernetes. IO /cp-az-name with a value of CP-1A or CP-1B.
To do this, I use the In logical operator to filter an array of existing label values. Other operators I can use include NotIn, Exists, DoesNotExist, Gt, Lt.
The “soft” rule is specified under the preferred during Scheduling field of the specification. In this example, it indicates that among nodes that meet the “hard” criteria, I want to use a node with a label with a key name of “custom-key” and a value of “custom-value.” However, if no such node exists, I have no objection to scheduling pods to other candidates that meet the “hard” criteria.
It is a good practice to combine “hard” and “soft” rules to construct node association rules. Following this “do your best” approach — use an option whenever possible, but do not reject scheduling if that option is not available — makes deployment scheduling more flexible and predictable.
Pod affinity
Inter-pod affinity in Kubernetes is a feature that allows you to arrange pods based on their relationship to other pods. This capability supports a variety of interesting scenarios, such as the hosting of a Pod as part of an interdependent service or the implementation of data localization, where a data Pod runs on the same machine as the main service Pod.
Affinity between pods is defined similarly to node affinity. However, in this case, I’ll use the podAffinity field of the POD specification.
Similar to node affinity, Pod affinity supports expression matching and logical operators. In this case, however, they are applied to a POD label selector running on a particular node. If the specified expression matches the POD tag of the target POD, the new POD is placed on the same machine as the target POD.
The affinity
In some cases, it is best to use the “blacklist” approach for Pod scheduling. In this approach, pods are prevented from being scheduled to a particular node when certain conditions are not met. This function is implemented in Kubernetes nodes to Pod anti-affinity and anti-affinity between pods.
The primary use of pod-to-node anti-affinity is to use dedicated nodes. To control resource utilization in a cluster, K8s administrators can assign certain nodes to specific POD types or applications.
Other scenarios of anti-affinity between PODS include:
Avoid single points of failure: This can be achieved by distributing PODS for the same service across different machines, which prevents pods from co-existing with other pods of the same type. Prevent resource competition between services: To improve the performance of some services, avoid lumping them together with other services that consume a lot of resources. Pod-to-node antiaffinity can be achieved through spotty and tolerance in Kubernetes. Let’s take a closer look at this feature.
Stigma and tolerance
Spoiling (conditions) and tolerance can help you control the scheduling of pods to specific nodes without modifying existing pods.
By default, all pods that are not tolerant of stains are rejected or expelled from the node. This behavior allows for flexible cluster and application deployment patterns without changing the POD definition if you don’t want the POD to run on a particular node.
Achieving stigma and tolerance is simple. First, add smudgers to nodes that need to apply some non-standard scheduling behavior. Such as:
The smudge format is =:. The stain effect I used here prevents any pod that does not match tolerance from being scheduled to this node.
Other supported smudge effects include NoExecute and PreferNoSchedule (” soft “versions of NoSchedule). If a PreferNoSchedule stain is applied, kube-Scheduler will not place a pod without the desired tolerance on the stain.
Finally, the NoExecute effect causes all pods to be expelled at once, and the nodes have no tolerance. You can use this tag if you already have pods running on nodes and no longer need them.
Creating a stain is only the first part of the configuration. To allow scheduling of pods on stains, we need to add tolerance:
In this example, I add tolerance for the above smudge using the “Equal” operator, or I can use the “Exists” operator, which will tolerate any node that matches the smudge key. However, value does not have to be specified.
In this case, I will use the stain “Storage = SSD: NoSchedule” to schedule the pod we defined above to this node.
Pod Anti-affinity
Pods can be mutually exclusive through the POD anti-affinity feature. As mentioned above, one of the best practices in Kubernetes is to avoid single points of failure by distributing pods in different available zones. I can configure similar behavior in the anti-affinity section of the POD specification. For pod anti-affinity, we need two pods:
First Pod:
Notice that the first POD is labeled “Security: S1”.
The second pod refers to the spec. Affinity. PodAntiAffinity tag selector under “security: s1”. Therefore, the Pod will not be scheduled to a node that already hosts any Pod with the “Security: S1” tag.
conclusion
Advanced POD scheduling in Kubernetes allows many interesting scenarios and best practices to be implemented to deploy complex applications and microservices on Kubernetes. With Pod affinity, you can implement Pod hosting and data localization for tightly coupled application stacks and microservices.
Below, you can find cheat sheets for some key scheduling scenarios for each resource type.
With node mutexes and smudges, you can run nodes with hardware dedicated to specific applications and services for efficient resource utilization in the cluster. With POD anti-affinity and node anti-affinity, you can also ensure high availability of your application and avoid single points of failure by having different components running on different nodes.