The author | east Ali cloud after-sales technical experts

Introduction: Students in ali Cloud after-sale technical team are dealing with all kinds of strange online problems every day. Common faults include network connection failures, server downtime, substandard performance, and slow response to requests. But if there is a question to choose, what seems trivial but actually makes people rack their brains, I believe it must be “undeleted” problems, such as files can not be deleted, process can not be finished, the driver can not uninstall, etc.. Such questions are like icebergs, and the logic behind them is often more complex than we think.

background

The problem we are discussing today is related to Namespace of K8s cluster. Namespace is the storage mechanism of K8s cluster resources. Related resources can be “collated” into the same Namespace to avoid unnecessary impact between unrelated resources.

A Namespace is a resource in its own right. Through the cluster API Server entry, we can create namespaces, and we need to clean up namespaces that are no longer used. The Controller of a Namespace, through the API Server, monitors changes in namespaces in the cluster and performs predefined actions based on the changes.

Sometimes we run into the problem shown in the figure below, where the state of a Namespace is marked “Terminating”, but there is no way to delete it completely.

Start at the cluster entrance

Because the deletion is performed through the cluster API Server, we need to analyze the behavior of the API Server. Like most cluster components, API Server provides different levels of logging output. To understand the behavior of the API Server, we set the log level to the highest level. Then, recreate the problem by creating the Namespace tobeDeleteDB.

Unfortunately, API Server doesn’t do much logging about this problem.

Related logs can be divided into two parts:

  • One is the Namespace deletion record, which shows that the client tool is Kubectl, and the source IP address of the operation is 192.168.0.41, which is as expected.
  • The other part is that Kube Controller Manager is repeatedly getting information about this Namespace.

Kube Controller Manager implements most of the controllers in the cluster, and it is repeatedly fetching information about toBeDeleteDB, basically saying, It is the Controller of a Namespace that fetches information about the Namespace.

What is the Controller doing?

Similar to the previous section, we explore the behavior of this component by enabling the highest level logging of Kube Controller Manager. In the Kube Controller Manager log, you can see that the Controller of a Namespace keeps trying to clean up the tobedeleteDB Namespace.

How to delete the resources in the “storage box”?

Here we need to understand that Namespace as the “storage box” of resources is actually a logical concept. It’s not like a real storage tool, where you can put small things in. The “storage” of a Namespace is actually a mapping.

This is important because it directly determines how resources within a Namespace are deleted. If it is physically “storage”, we simply delete the “storage box” and all the resources in it are deleted. For logical relationships, we need to list all resources and delete those that point to the Namespace to be deleted.

API, Group, Version

How do you list all the resources in a cluster? The problem starts with how cluster apis are organized. The K8s cluster API is not monolithic, it is organized by groups and versions. The obvious benefit of this is that different groups of apis can be iterated independently of each other. A common grouping is apps, which has v1, v1beta1 and v1beta2 versions. A complete list of groups/versions can be seen using the kubectl APi-versions command.

Every resource we create must belong to an API group/version. The following edge Ingress, for example, we specify the grouping of the Ingress resources/version for networking. K8s. IO/v1beta1.

kind: Ingress
metadata:
  name: test-ingress
spec:
  rules:
  - http:
      paths:
      - path: /testpath
        backend:
          serviceName: test
          servicePort: 80Copy the code

Use a simple diagram to summarize the API grouping and version.

In fact, the cluster has many API groups/versions, each of which supports a specific resource type. When we orchestrate resources with YAML, we specify resource type KIND and API grouping/version apiVersion. To list resources, we need to get a list of API groups/versions.

Why cannot Controller delete resources in the Namespace

Once you understand the concept of API grouping/versioning, look back at the Kube Controller Manager logs and it becomes clear. Apparently the Namespace Controller was trying to get a list of API groups/versions and failed when metrics. K8s. IO /v1beta1 was encountered. The query failed because “the server is currently unable to handle the request”.

Go back to the cluster entrance again

In the previous section, we found that Kube Controller Manager failed to retrieve the metrics.k8s. IO /v1beta1 API grouping/version. This query request is clearly addressed to the API Server. So let’s go back to the API Server logs and analyze the metrics. K8s. IO /v1beta1 related records. At the same point in time, we see the API Server reporting the same error “The Server is currently unable to handle the request”.

Metrics. K8s. IO /v1beta1 is a grouping version of the API, but the Server is unavailable. To answer this question, we need to understand the “plug-in” mechanism of API Server.

The cluster API Server has its own extension mechanism that developers can use to implement the “plug-in” of the API Server. The main function of this “plug-in” is to implement the new API grouping/version. The API Server, acting as a proxy, forwards the corresponding API calls to its own “plug-ins”.

Take Metrics Server, which implements the metrics.k8s. IO /v1beta1 API grouping/version. All calls to this grouping/version are forwarded to the Metrics Server. The Metrics Server implementation, shown below, uses a single service and a pod.

The last apiservice in the figure above is the mechanism that connects the “plug-in” to the API Server. You can see the apiservice definition in detail below. It includes API groupings/versions and the name of the service that implements Metrics Server. With this information, API Server can forward calls to metrics Server for metrics.k8s. IO /v1beta1.

Communication between nodes and PODS

After some simple testing, we found that the issue was actually a communication issue between API Server and Metrics Server Pod. In the Ali Cloud K8s cluster environment, API Server uses the host network, namely the ECS network, while Metrics Server uses the Pod network. The communication between the two depends on VPC routing table forwarding.

For example, if the API Server is running on Node A, its IP address is 192.168.0.193. If the IP address of the Metrics Server is 172.16.1.12, the network connection from the API Server to the Metrics Server must be forwarded by routing rule 2 in the VPC routing table.

In the VPC routing table of the cluster, the routing entry pointing to the node where the Metrics Server resides is missing. Therefore, the communication between the API Server and the Metrics Server fails.

Why isn’t Route Controller working?

In order to maintain the correctness of cluster VPC routing entries, Ali Cloud implements Route Controller in Cloud Controller Manager. This Controller monitors cluster node status and VPC routing table status at all times. When a route entry is found missing, it automatically fills in the missing route entry.

The current situation, obviously not as expected, Route Controller is clearly not working properly. This can be confirmed by viewing the Cloud Controller Manager logs. In the log, we found that the Route Controller could not obtain the information of the VPC instance when it used the CLUSTER VPC ID to search for the instance.

But the cluster is still there, the ECS is still there, so the VPC can’t be gone. This can be confirmed on the VPC console through the VPC ID. Then the following question is why Cloud Controller Manager cannot obtain the information of this VPC?

Cluster nodes access cloud resources

Cloud Controller Manager Obtains VPC information through the Ali Cloud open API. This is basically equivalent to obtaining information about a VPC instance from inside an ECS in the cloud, which requires the ECS to have sufficient permissions. The current practice is to grant the RAM role to the ECS server and authorize the corresponding ROLE to the corresponding RAM role binding.

If a cluster component, as the node on which it resides, cannot obtain information about a cloud resource, there are basically two possibilities. One is that ECS is not bound to the correct RAM role; Second, RAM role authorization bound by RAM role does not define correct authorization rules. Examining the RAM role of the node and the authorization managed by the RAM role, we find that the authorization policy for the VPC has been changed.

When Effect is changed to Allow, soon all Terminating namespaces are gone.

A larger problem

Overall, this problem relates to six components of the K8s cluster, namely API Server and its extensions Metrics Server, Namespace Controller and Route Controller, And VPC routing table and RAM role authorization.

By analyzing the behavior of the first three components, we found that a cluster network problem caused the API Server to fail to connect to the Metrics Server. The root cause of the problem is that the VPC routing table is deleted and the RAM role authorization policy is changed.

Afterword.

The Namespace of the K8s cluster cannot be deleted. This problem may seem innocuous, but it is not only complex, but also represents a loss of important functionality for the cluster. This article has a comprehensive analysis of such a problem, including the investigation method and principle, hoping to have certain help for everyone to troubleshoot similar problems.

“Alibaba Cloud originators pay close attention to technical fields such as microservice, Serverless, container and Service Mesh, focus on cloud native popular technology trends and large-scale implementation of cloud native, and become the technical circle that knows most about cloud native developers.”