In the highly available K8S cluster, when the Node Node is down and Kubelet cannot provide work, POD will automatically be dispatched to other nodes, and the time to be dispatched to the Node needs to be carefully considered, because it determines the stability and reliability of production. Faster migration can reduce the impact of our business. However, it may cause some pressure on the cluster, causing the cluster to crash.

Kubelet status update process:

  • Kubelet periodically updates the status to apiserver. The node-status-update-frequency parameter specifies the reporting frequency. By default, kubelet reports status every 10 seconds.
  • Kube-controller-manager checks kubelet status every –node-monitor-period (default: 5s);
  • 3. After the node is disconnected for a period of time, Kubernetes determines that node is in the Notready state. The –node-monitor-grace-period parameter is used to set the node tready state.
  • 4. When a Node is disconnected for a period of time, Kubernetes determines that the node is in the unhealthy state. The value of this period is configured by using the –node-startup-grace-period parameter.
  • 5. Kubernetes will remove pods from the original node when the node is out of touch for a period of time, configured by the –pod-eviction-timeout parameter, 5m0s by default.

Kube-controller-manager and Kubelet work asynchronously, which means that delays can include any network delays, apiserver delays, ETCD delays, delays caused by loads on a node, etc. Therefore, if –node-status-update-frequency is set to 5s, data changes in the ETCD will actually take 6-7s or longer.

configuration

You need to configure these parameters based on the cluster scale.

Default community configuration

parameter value
– node – status update – frequency 10s
– node – monitor – period 5s
– node – monitor – grace – period 40s
– pod – eviction – a timeout 5m

Quick updates and quick responses

parameter value
– node – status update – frequency 4s
– node – monitor – period 2s
– node – monitor – grace – period 20s
– pod – eviction – a timeout 30s

In this case, pods will be expelled in the 50s because the node is considered Down in the 20s, and a Pod -eviction-timeout will occur in the 30s, however, this will cause a lot of overhead for the ETCD, Because each node tries to update its state every 2s.

If the environment has 1000 nodes, there will be 15,000 node update operations per minute, which may require large ETCD containers or even dedicated nodes for ETCD.

If you count the number of attempts, the division will give you 5, but in reality the nodeStatusUpdateRetry attempts per attempt will be from 3 to 5. Due to the latency of all components, the total number of attempts will vary between 15 and 25.

Medium updates and average responses

parameter value
– node – status update – frequency 20s
– node – monitor – period 5s
– node – monitor – grace – period 2m
– pod – eviction – a timeout 1m

The controller manager will update the node status 2m60⁄205=30 times before it considers the node to be abnormal, and 1 meter after the node becomes down, the expulsion operation will be triggered.

If there are 1000 nodes, node status updates are performed for 3000 times in 60 seconds /20s x 1000= 1 minute.

Low updates and slow response

parameter value
– node – status update – frequency 1m
– node – monitor – period 5s
– node – monitor – grace – period 5m
– pod – eviction – a timeout 1m

Kubelet will update the node status 1m at a time, with 5m/1m*5=25 retries after it is deemed unhealthy. When Node becomes unhealthy, pods start to be expelled after 1m.

There can be different combinations, such as fast updates and slow responses to suit a particular situation.

Special cases may occur, for example, application high availability after the stateful application host is down. For details, see www.infoq.cn/article/aMs…

Reference Documents:

Github.com/kubernetes-… Github.com/kubernetes/… www.qikqiak.com/post/kubele… www.infoq.cn/article/aMs…