The background,

With a vibrant community and a large user base, Kubernetes still maintains a high frequency of releases every three months. High frequency of release has brought more new features and bug fix in time, landing online environment business to run for a long time, but any change wrong could bring huge economic losses, upgrade for enterprises is relatively difficult, followed by community is almost impossible, so the contradiction between high frequency distribution and stable production container team to measure and trade-offs.

Vivo Internet team to build large-scale Kubernetes cluster, part of the cluster has been using V1.10 version for a long time, but because the proportion of business containerization is getting higher and higher, the stability of large-scale cluster, the diversity of application release and other demands are climbing, cluster upgrade is imminent. The cluster upgrade resolves the following problems:

  • Higher-release clusters are optimized for large-scale scenarios, and upgrades can address a number of performance bottlenecks.

  • Only a higher-version cluster can support CNCF projects such as OpenKruise. Upgrade can solve the version-dependent problem.

  • Advanced cluster features can improve cluster resource utilization, reduce server costs, and improve cluster efficiency.

  • The enterprise maintains multiple clusters of different versions. After the upgrade, cluster version fragmentation is reduced, further reducing o&M costs.

This article will introduce from 0 to 1 how vivo Internet team’s cluster supporting online business can be upgraded from V1.10 to V1.17 without affecting the normal operation of the original business. The reason for upgrading to V1.17 rather than later v1.18 is that the code changes introduced in V1.18 [1] caused advanced resource types such as Extensions /v1beta1 to stop running (this code was removed in V1.18).

Two, non-destructive upgrade difficulties

Container cluster construction usually includes binary SystemD deployment and static Pod container deployment of core components. Multiple copies of cluster API services are used for external load balancing. There is not much difference between the two deployment modes when upgrading, and binary deployment is more appropriate for earlier clusters, so this article will share cluster upgrades in binary deployment mode.

For a cluster deployed in binary mode, the upgrade of cluster components includes binary replacement, configuration file update, and service restart. According to the SLO requirements of the production environment, services must not be restarted due to logical changes of cluster components during the upgrade. Therefore, the difficulties in upgrading are as follows:

First of all, the current internal cluster running version is relatively low, but the number of running containers is large, some of which are still running in a single copy. In order not to affect the service running, it is necessary to avoid container restart as much as possible, which is undoubtedly the biggest difficulty in the upgrade. Between V1.10 and V1.17, Kubelet has changed the way the container Hash value is computed, which means that an upgrade will inevitably trigger Kubelet to restart the container.

Secondly, the recommended method by the community is to upgrade based on deviation policy [2] to ensure that the high availability cluster upgrade will not cause compatibility errors of components such as Kube-Apiserve and Kubelet due to API resources version differences. This requires that there be no more than two Final releases per upgrade of the component; for example, upgrading directly from V1.11 to V1.13 is not recommended.

Moreover, due to the introduction of new features during the upgrade, API compatibility may cause the configuration of the cluster of the earlier version not to take effect, which may cause potential stability problems for the whole cluster. Before the upgrade, familiarize yourself with ChangeLog and identify new features that may cause potential problems.

Iii. Non-destructive upgrade scheme

This section proposes specific solutions to the previous difficulties one by one. In addition, it also introduces the bugs of higher versions encountered after the upgrade and the solutions. I hope the questions about compatibility screening and troubleshooting during the upgrade can give readers some inspiration.

3.1 Upgrade Mode

In the software field, there are two mainstream application upgrade methods, namely, in-place upgrade and replacement upgrade. At present, these two upgrade methods are adopted by Internet giants in the industry, and the specific scheme selection has a lot to do with the business on the cluster.

Replace the upgrade

1) Kubernetes replacement upgrade is to prepare a cluster with a higher version first, and gradually upgrade the nodes in the cluster with a lower version to the new version by draining them one by one, deleting them and finally adding them to the new cluster.

2) The advantages of replacement upgrade are stronger atomicity, and each node is gradually upgraded. There is no intermediate state in the upgrade process, which is more secure for service security. The disadvantage is that the cluster upgrade workload is large, and the draining operation is not friendly to applications with high sensitivity to POD restart, stateful applications, and single-copy applications.

In situ upgrade

1) Kubernetes in-place upgrade is to batch update services on nodes such as Kube-Controller-Manager and Kubelet in a certain order, and manage component versions in batches from node role dimensions.

2) The advantages of in-place upgrade are convenient automation, and the continuity of container life cycle can be well guaranteed through appropriate modification; The disadvantage is that the upgrade sequence of components is very important in the cluster upgrade, there is an intermediate state in the upgrade, and the restart failure of one component may affect the subsequent upgrade of other components, resulting in poor atomicity.

Some services running on Vivo container cluster have low tolerance to restart, so it is the first priority to avoid container restart as much as possible. After the container restart caused by the upgrade version is resolved, select an upgrade mode based on service containerization and service types. In binary deployment clusters, in-place upgrade is recommended, which takes a short time, is easy to perform, and single copy services are not affected by the upgrade.

3.2 Cross-version Upgrade

As Kubernetes itself is an API-based microservice architecture, Kuberntes internal architecture also coordinates resource state through API calls and list-watch of resource objects, so community developers follow the principle of upward or downward compatibility when designing apis. This compatibility rule also follows the deviation policy of the community [2], that is, when API groups are deprecated or enabled, the Alpha version will take effect immediately, and the Beta version will continue to support three versions. Exceeding the corresponding version will lead to incompatible API Resource version. For example, Kubernetes deprecates the Extensions /v1beta1 version of a resource such as Deployment in v1.16, and removes it from the code level in V1.18. Upgrading across more than three versions will result in the related resource not being identified. The corresponding add, delete, change and check operations cannot be executed.

If an upgrade from V1.10 to V1.17 is performed according to the recommended upgrade policy, at least seven upgrades are required. In a production environment with complex service scenarios, o&M complexity and service risks are high.

For similar API breaking change does not exist in every version. Deviation strategy suggested by the community is the safest upgrade strategy. After careful change Log sorting and sufficient cross-version testing, We confirm that API compatibility problems affecting service running and cluster management operations cannot exist between these versions. You can configure parameters in Apiserver to enable and continue using API types to ensure the normal running of environment services.

3.3 Preventing container Restart

During the initial verification of the upgrade scheme, it was found that a large number of containers were rebuilt. The reason for the restart was “Container Definition Changed “in the Kubelet component log after the upgrade. In combination with the source code error in PKG/kubelet kuberuntime_manager. Go file computePodActions method, this method is used to calculate the spec of pod hash values are changing, if the changes in the returns true, Tell Kubelet that the syncPod method triggers a pod container rebuild or POD rebuild.

Kubelet container Hash calculation;

func (m *kubeGenericRuntimeManager) computePodActions(pod *v1.Pod, podStatus *kubecontainer.PodStatus) podActions {
    restart := shouldRestartOnFailure(pod)
    if _, _, changed := containerChanged(&container, containerStatus); changed {
        message = fmt.Sprintf("Container %s definition changed", container.Name)
        // If the Container spec changes, the container restarts forcibly (with the restart flag set to true).
        restart = true}...if restart {
       message = fmt.Sprintf("%s, will be restarted", message)
       // The Container to be restarted is added to the restart list
       changes.ContainersToStart = append(changes.ContainersToStart, idx)
    }
}
 
func containerChanged(container *v1.Container, containerStatus *kubecontainer.ContainerStatus) (uint64, uint64, bool) {
   // Calculates the Hash value of the Container Spec
   expectedHash := kubecontainer.HashContainer(container)
   returnexpectedHash, containerStatus.Hash, containerStatus.Hash ! = expectedHash }Copy the code

Compared to V1.10, v1.17 uses container json serialized data to compute container hashes, rather than v1.10 using Container struct data. The higher version of Kubelet also adds new attributes to the structure of the container. The go-spew library calculates inconsistent results, and further passes the return value up to make the syncPod method trigger the container rebuild.

Do you need to modify the go-spew to remove new fields from the container struct data structure? The answer is yes, but not in an elegant way, because this is a serious intrusion into the core code logic, and each subsequent update will require custom code, with more and more fields added, and more and more maintenance complexity. On the other hand, if pods created by the older version of the cluster Kubelet skip this check during the upgrade transition, container restarts can be avoided.

After communicating with colleagues in the community, I found that similar ideas have been implemented in the community. Locally create a configuration file that records the version information and startup time of the old cluster, maintain a cache read configuration file in kubelet code, and in each syncPod cycle, If kubelet finds that its version is higher than the oldVersion recorded in the cache and the container is started earlier than the current kubelet start time, it skips the container Hash calculation. After the upgrade, run scheduled tasks in the cluster to check whether containerSpec of Pod is consistent with the Hash results calculated in the higher version. If yes, you can delete the local configuration file and restore syncPod logic to be consistent with the community.

The benefit of referring to this implementation is that the native Kubelet code is less intrusive, the core code logic is not changed, and the code can be reused in the future if it needs to be upgraded to a higher version. If all pods in the cluster are created with the current version of Kubelet, the community’s own logic is reverted.

3.4 Pod unanticipated expulsion problem

Kubernetes has iterated over a dozen versions, but the community is still very active in each iteration, maintaining about 30 new features for scalability and stability in each version. A big reason for choosing to upgrade is to introduce many new community-developed features to enrich the cluster and improve cluster stability. The development of new features also follows a bias policy. Upgrading across large releases is likely to result in enabling new features without partial configurations loaded, which brings stability risks to the cluster. Therefore, some features that affect the Pod life cycle need to be sorted out, especially controller-related functions.

Note here that the TaintBasedEvictions feature introduced in v1.13 is used to manage Pod expulsion conditions at a finer level. Before v1.13 conditional based version, a Pod on a node is removed based on NodeController at the same time. A Pod on a node is removed only after NotReady has exceeded the default value of 5 minutes. With TaintBasedEvictions enabled by default in V1.16, the expulsion of NotReady will be processed differently according to each Pod’s self-configured TolerationSeconds.

TolerationSeconds are not set by default for pods created in older clusters, and once TaintBasedEvictions is enabled, pods are ejected from nodes five seconds after they become NotReady. For the temporary network fluctuation, Kubelet restart and other circumstances will affect the stability of the cluster business.

The TaintBasedEvictions controller determines pod expulsion times according to tolerationSeconds in the POD definition, That is, tolerationSeconds in pods can be set correctly to avoid unintended expulsion of pods.

The DefaultTolerationSeconds admittance controller enabled by default in the v1.16 community is based on the default-not-ready-toleration-seconds and k8s-apiserver input parameters Default-unreachable -toleration-seconds sets a default tolerance for Pod to tolerate notready:NoExecute and unreachable:NoExecute stains.

A new Pod adds default tolerations to the Pod through the DefaultTolerationSeconds admittance controller after the request is sent. But how does this logic work for the PODS already created in the cluster? A look at the admittances controller shows that in addition to supporting the CREATE operation, the UPDATE operation also updates the POD definition and triggers the DefaultTolerationSeconds plug-in to set Tolerations. So we can do this by labeling the pods that are already running in the cluster.

tolerations:
- effect: NoExecute
  key: node.kubernetes.io/not-ready
  operator: Exists
  tolerationSeconds: 300
- effect: NoExecute
  key: node.kubernetes.io/unreachable
  operator: Exists
  tolerationSeconds: 300
Copy the code

3.5 Pod MatchNodeSelector

In order to determine whether unexpected expulsion of Pod occurred during the upgrade and whether there is a batch restart of containers in Pod, a script is used to synchronize the Pod in the non-running state and the container that has been restarted in real time on the node.

During the upgrade process, suddenly dozens of additional pods are marked as MatchNodeSelector state, and check that the business container on this node really stops. Kubelet error:

predicate.go:132] Predicate failed on Pod: nginx-7dd9db975d-j578s_default(e3b79017-0b15-11ec-9cd4-000c29c4fa15), for reason: Predicate MatchNodeSelector failed
kubelet_pods.go:1125] Killing unwanted pod "nginx-7dd9db975d-j578s"
Copy the code

After analysis, Pod becomes MatchNodeSelector state because Kubelet cannot find the node label that meets the requirements when it does the access check on Pod on the node when it restarts. Then the Pod state will be set to Failed state. Reason is set to MatchNodeSelector. In kubectl command fetch, printer does the corresponding conversion to show Reason directly, so we see the Pod state is MatchNodeSelector. You can reschedule the Pod by tagging the node, and then delete the Pod in the MatchNodeSelector state.

You are advised to write scripts to check whether all NodeSelector attribute nodes used in pod definitions on nodes have corresponding labels before the upgrade.

3.6 Kube-apiserver Cannot be accessed

After the cluster is upgraded to V1.17, an alarm is generated indicating that a node is in the NotReady state. After the alarm is analyzed, the Kubelet node is restarted. A large number of use of Closed network connection errors are reported in kubelet logs after further analysis of the error causes. A similar problem was found in a community search for related issues, where a developer described the cause of the problem and the solution, and it was coded in V1.18.

The problem is that kubelet’s default connection is HTTP/2.0 long connection. There is a bug in the Golang NET/HTTP2 package used to build the connection from client to server. Broken connections can still be found in the HTTP connection pool. Therefore, Kubelet cannot communicate with Kube-Apiserver properly.

The Golang community circumvented this problem by adding http2 connection health checks, but the fix was still buggy and the community fixed it completely in Golang V1.15.11. We solved this internally by branching backport to V1.17 and compiling binaries with Golang version 1.15.15.

3.7 The number of TCP Connections Fails

During the test run of pre-release environment, it was found accidentally that every node of the cluster Kubelet has nearly 10 long connections to communicate with Kube-Apiserver, which is obviously inconsistent with our knowledge that Kubelet will reuse connections and Kube-Apiserver communication. The v1.10 environment does have only 1 long connection. This increase in TCP connections will put pressure on the LB. As the number of nodes increases, once the LB is overwhelmed and Kubelet cannot report heartbeat, the node will become NotReady and a large number of pods will be ejected, with disastrous consequences. Therefore, in addition to the optimization of LB parameters, it is also necessary to identify the reasons for the increase in connection numbers between Kubelet and Kube-Apiserver.

V1.17.1 kubeadm cluster kubelet to Kube-apiserver has only 1 long connection, indicating that this problem is introduced between V1.17.1 and upgrade target version. After checking, it was found that the added judgment logic caused kubelet to no longer fetch long connections from the cache when obtaining clients. The main function of Transport is to cache long connections, which is used for connection reuse in a large number of HTTP requests and reduces the time loss of ESTABLISHING TCP(TLS) connections when sending requests. In this PR, we define the RoundTripper interface for transport. Once the tlsConfig object has Dial or Proxy attributes, we create a new connection instead of using the connection in the cache.

// client-go obtains the reuse connection logic from the cache
func tlsConfigKey(c *Config) (tlsCacheKey, bool, error) {...ifc.TLS.GetCert ! = nil || c.Dial ! = nil || c.Proxy ! = nil {// cannot determine equality for functions
        return tlsCacheKey{}, false, nil
    }
...
}
 
 
func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) {
    key, canCache, err := tlsConfigKey(config)
    ...
 
    if canCache {
        // Ensure we only create a single transport for the given TLS options
        c.mu.Lock()
        defer c.mu.Unlock()
 
        // See if we already have a custom transport for this config
        if t, ok := c.transports[key]; ok {
            return t, nil
        }
    }
...
}
 
// The kubelet component builds client logic
func buildKubeletClientConfig(ctx context.Context, s *options.KubeletServer, nodeName types.NodeName) (*restclient.Config, func(), error) {... kubeClientConfigOverrides(s, clientConfig) closeAllConns, err := updateDialer(clientConfig) ...return clientConfig, closeAllConns, nil
}
 
// Set the Dial property for clientConfig, so kubelet creates a new transport when building Clinet
func updateDialer(clientConfig *restclient.Config) (func(), error) {
    ifclientConfig.Transport ! = nil || clientConfig.Dial ! = nil {return nil, fmt.Errorf("there is already a transport or dialer configured")
    }
    d := connrotation.NewDialer((&net.Dialer{Timeout: 30 * time.Second, KeepAlive: 30 * time.Second}).DialContext)
    clientConfig.Dial = d.DialContext
    return d.CloseAll, nil
Copy the code

The closeAllConns object is built here to Close connections that are already Dead but not yet closed, but the previous problem was solved by upgrading the Golang version, so we branch back some of the code in this change to resolve the increase in TCP connections.

The tracking community has recently discovered that a solution has been incorporated to reuse custom RESTClient TCP connections by refactoring client-Go’s interface.

4. Non-destructive upgrade operation

The biggest risk of cross-version upgrade is that object definitions are inconsistent before and after upgrade. As a result, upgraded components may fail to resolve objects stored in the ETCD database. There may also be an intermediate state in the upgrade, kubelet has not been upgraded but the control plane component has been upgraded, there is an abnormal reported status, and the worst case is that the Pod on the node is expelled. These are all things to consider and test before upgrading.

After repeated testing, the above problem did not exist between V1.10 and V1.17 except for the partially deprecated API Resources by adding kube-Apiserver configuration mode. To ensure that special cases that are not covered are handled in a timely manner during the upgrade, it is strongly recommended to back up the ETCD database before the upgrade and stop the controller and scheduler during the upgrade to avoid unexpected control logic. However, the code needed to be modified to compile the temporary Controller Manager, which increased the upgrade process steps and management complexity, so the global controller was stopped directly.

In addition to the above code changes and upgrade process considerations, before replacing the binary upgrade, it is only necessary to compare the configuration items of the new and old versions of the service to ensure that the service is up and running successfully. After comparison, it is found that the kubelet component does not support the — allow-Privileged parameter during startup, so it needs to be deleted. It is worth noting that the deletion does not mean that the advanced version will no longer support the running of privilege containers on nodes. After V1.15, a set of Security features for Pod access will be defined through the Pod Security Policy resource object, and Security control will be more fine-grained.

After compiling binaries based on the changes on the code side of the lossless upgrade discussed above and making changes to the various configuration items in the cluster component configuration file, you can proceed with the online upgrade. The upgrade steps are as follows:

  • Backup cluster (binary, configuration file, ETCD database, etc.);

  • Grayscale upgrade some nodes to verify the correctness of binary and configuration files

  • Distribute upgrade binaries in advance;

  • Stop the controller, scheduler, and alarm;

  • Update the control plane service configuration file and upgrade components.

  • Update the compute node service configuration file and upgrade components.

  • Assigning labels to nodes triggers pods to add tolerations properties;

  • Enable the controller and scheduler, and enable the alarm.

  • Cluster services are checked and the cluster is normal.

During the upgrade process, it is recommended that the number of concurrent nodes be not too high. A large number of kubelet nodes are restarted at the same time to report information, which impacts the LB used in front of Kube-Apiserver. In special cases, the node heartbeat may fail to report information and the node status may be between NotReady and Ready.

Five, the summary

Cluster upgrade has been bothering the container team for a long time. After a series of investigations and repeated tests, several key problems mentioned above were solved, the cluster was successfully upgraded from V1.10 to V1.17. It took about 10 minutes for the cluster of 1000 nodes to perform the upgrade operation in batches. It will be upgraded to a higher version after the platform interface transformation is completed.

The cluster version upgrade improves the cluster stability, increases the cluster scalability, and enriches the cluster capabilities. The upgraded cluster is also better compatible with CNCF projects.

As mentioned in the opening paragraph, frequent upgrades to large clusters following a deviant strategy may not be practical, so cross-release upgrades are a risky but widely used approach in the industry. In 2021 China KubeCon Conference, Alibaba also has about zero downtime cross-version upgrade Kubernetes cluster sharing, mainly about application migration, traffic switching and other upgrade key points of introduction, upgrade preparation and upgrade process is relatively complex. Compared with Alibaba’s cluster cross-version replacement upgrade scheme, the in-place upgrade method requires a small amount of modification in the source code, but the upgrade process will be simpler and the degree of operation and maintenance automation will be higher.

The upgrade described in this article may not be widely applicable due to the large selection of cluster versions. The author would like to provide readers with ideas and risk points when upgrading production clusters across versions. The upgrade process is short, but the preparation and research work before the upgrade is time-consuming and laborious, the need for different versions of Kubernetes features and source code in-depth exploration, at the same time the Kubernetes API compatibility strategy and release strategy have a complete understanding, so that they can make adequate testing before the upgrade, It can also face the unexpected situation in the upgrade process more calmly.

6. Reference links

[1]github.com

[2] kubernetes. IO/version – ske…

[3] Specific scheme reference: https://github.comstart

[4] a similar problem: https://github.com/kubernetes

[5] github.com/golang/3497…

[6] github.com/kubernetes/…

[7] github.com/kubernetes/…

[8] github.com/kubernetes/…

Author: Shu Yingya, Vivo Internet Server Team