The author | east Ali cloud after-sales technical experts

I don’t know if you’re aware of the fact that most of the time, we don’t use a system the way we used to, either from the command line or through a visual window.

preface

Now when we log on weibo or shop online, what we operate is not the device in front of us, but one cluster after another. Typically, such clusters have hundreds or thousands of nodes, each of which is a physical machine or virtual machine. Clusters are typically located in data centers away from users. In order for these nodes to cooperate and provide consistent and efficient services externally, the cluster needs an operating system. Kubernetes is such an operating system.

Comparing Kubernetes with stand-alone operating systems, Kubernetes is equivalent to the kernel, which is responsible for the management of cluster hardware and software resources, and provides a unified external entrance, through which users can use the cluster and communicate with the cluster.

And the program running on the cluster is very different from the ordinary program. Such procedures are “caged” procedures. They are unusual from being made to being deployed to being used. We cannot understand its nature until we dig deep into its roots.

The caged program

code

We use go language to write a simple Web server program app.go, this program listens on port 2580. Accessing the root path of the service through HTTP returns “This is a small app for Kubernetes…” A string.

package main import ( "github.com/gorilla/mux" "log" "net/http" ) func about(w http.ResponseWriter, r *http.Request) { w.Write([]byte("This is a small app for kubernetes... \n"))} func main() {r := mux.newrouter () r.handlefunc ("/", about) log.fatal (http.listenandserve (" 0.0.0.0.0:2580 ", r))}Copy the code

Compile the program using the go build command, producing the app executable. This is a normal executable file that runs on the operating system and relies on library files in the system.

# ldd app
linux-vdso.so.1 => (0x00007ffd1f7a3000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f554fd4a000)
libc.so.6 => /lib64/libc.so.6 (0x00007f554f97d000)
/lib64/ld-linux-x86-64.so.2 (0x00007f554ff66000)Copy the code

“Cage”

To make the program independent of the operating system’s own library files, we need to create a container image, which is an isolated runtime environment. A Dockerfile is a “recipe” for creating container images. Download a basic centos image and put the app executable in /usr/local/bin.

FROM centos
ADD app /usr/local/binCopy the code

address

The image is saved locally, and we need to upload the image to the image repository. This is the mirror warehouse, the app store. We use the mirror warehouse of Ali Cloud, and the address of the mirror after uploading is:

registry.cn-hangzhou.aliyuncs.com/kube-easy/app:latestCopy the code

The mirror address can be divided into four parts: warehouse address/namespace/Mirror name: mirror version. “> < span style =” color: red; color: red; color: red; At this point, we have a small “caged” program that runs on a Kubernetes cluster.

Get into the door

The entrance

As an operating system, Kubernetes has the concept of API, just like ordinary operating systems. With apis, the cluster has a portal; With the API, we can use the cluster to get in. The Kubernetes API is implemented as a component API Server running on cluster nodes. This component is a typical Web server application that provides services by exposing an HTTP (S) interface.

Here we create an Aliyun Kubernetes cluster. After logging in to the cluster management page, you can view the public network access of the API Server.

API Server network connection endpoint: https://xx.xxx.xxx.xxx:6443Copy the code

Two-way digital certificate authentication

Ali Cloud Kubernetes cluster API Server component, using CA signature based bidirectional digital certificate authentication to ensure secure communication between clients and API Server. This sentence is very round mouth, for beginners is not very good to understand, we will explain in depth.

Conceptually, a digital certificate is a file used to authenticate network communication participants. This is similar to the diplomas that schools give to students. Between the school and the student, the school is the trusted third party CA and the student is the communication participant. If society generally trusts the reputation of a school, then the diplomas issued by the school will also be recognized by society. Participant certificates and CA certificates can be analogous to diplomas and school licenses.

Here we have two types of participants, CA and regular participants; Corresponding to this, we have two types of certificates, CA certificates and participant certificates; In addition, we have two kinds of relationship, certificate issuing relationship and trust relationship. These two relationships are crucial.

Let’s look at the issuing relationship first. As shown below, we have two CA certificates and three participant certificates.

The top CA certificate issues two certificates, one for the middle CA certificate and the other for the participant certificate on the right. The CA certificate in the middle issues the following two participant certificates. The six certificates are related to the issuing relationship, forming a tree-like certificate issuing relationship diagram.

However, certificates, and the issuing relationship itself, do not guarantee that trusted communication can take place between participants. For example, suppose that the right-most actor is a website and the left-most actor is a browser. The browser trusts the data of the website, not because the website has a certificate, nor because the certificate of the website is issued by the CA, but because the browser trusts the uppermost CA, which is the trust relationship.

With an understanding of THE CA (certificate), actor (certificate), issuing relationship, and trust relationship, we come back to “TWO-WAY digital certificate authentication based on CA signatures.” The client and API Server, as normal participants in the communication, each have a certificate. Both of these certificates are issued by cas, which we simply call the cluster CA and client CA. The client trusts the cluster CA, so it trusts the API Server that has the cluster CA certificate. In turn, the API Server needs to trust the client CA before it is willing to communicate with the client.

Ali Cloud Kubernetes cluster, cluster CA certificate, and client CA certificate, implementation is actually a certificate, so we have this relationship diagram.

KubeConfig file

Log in to the cluster management console and we can get the KubeConfig file. This file contains client certificates, cluster CA certificates, and more. The certificate is base64 encoded, so we can use the Base64 tool to decode the certificate and use OpenSSL to view the certificate text.

  • First of all, the client certificate issued to CN is the cluster id c0256a3b8e4b948bb9c21e66b0e1d9a72, while the certificate itself is the CN sub accounts in 252771643302762862;
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 787224 (0xc0318)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72
        Validity
            Not Before: Nov 29 06:03:00 2018 GMT
            Not After : Nov 28 06:08:39 2021 GMT
        Subject: O=system:users, OU=, CN=252771643302762862Copy the code

  • Second, API Server validates client CA certificates only if API Server trusts client CA certificates. Kube apiserver process through the client – ca – this parameter specifies the file its trust the client ca certificate, its designated certificate is/etc/kubernetes/pki/apiserver – ca. CRT. This file actually contains two client CA certificates, one of which is related to cluster management and will not be explained here, and the other one is as follows. Its CN is the same as the CN of the issuer of the client certificate.
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 787224 (0xc0318)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72
        Validity
            Not Before: Nov 29 06:03:00 2018 GMT
            Not After : Nov 28 06:08:39 2021 GMT
        Subject: O=system:users, OU=, CN=252771643302762862Copy the code

  • Again, the certificate of API Server use, by the parameters of kube – apiserver TLS – cert – file, this parameter points to the certificate/etc/kubernetes/pki/apiserver CRT. The certificate of CN is kube – apiserver, is issued to c0256a3b8e4b948bb9c21e66b0e1d9a72, namely cluster CA certificate;
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 2184578451551960857 (0x1e512e86fcba3f19)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72
        Validity
            Not Before: Nov 29 03:59:00 2018 GMT
            Not After : Nov 29 04:14:23 2019 GMT
        Subject: CN=kube-apiserverCopy the code

  • Finally, the client needs to verify the API Server certificate above, so the KubeConfig file contains its issuer, the cluster CA certificate. Comparing the cluster CA certificate with the client CA certificate, we found that the two certificates were exactly the same, which is what we expected.
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 786974 (0xc021e)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=CN, ST=ZheJiang, L=HangZhou, O=Alibaba, OU=ACS, CN=root
        Validity
            Not Before: Nov 29 03:59:00 2018 GMT
            Not After : Nov 24 04:04:00 2038 GMT
        Subject: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72Copy the code

access

Using curl to access the API server with a certificate, you can perform a simple test.

# curl --cert ./client.crt --cacert ./ca.crt --key ./client.key https://xx.xx.xx.xxx:6443/api/ { "kind": "APIVersions", "versions" : [] "v1", "serverAddressByClientCIDRs" : [{" clientCIDR ":" 0.0.0.0/0 ", "serverAddress" : "192.168.0.222:6443}]}"Copy the code

Merit and

Two types of nodes, one type of task

As mentioned at the beginning, Kubernetes is an operating system that manages multiple nodes in a cluster. These nodes do not have to play exactly the same role in the cluster. Kubernetes cluster has two types of nodes: master node and worker node.

This division of roles is actually a division of labor: the master is responsible for the management of the entire cluster, and cluster management components are mainly running on it, including API server to realize the entrance of the cluster. The worker node is mainly responsible for carrying common tasks.

In the Kubernetes cluster, tasks are defined by the concept of POD. Pod is the task-bearing atomic unit of the cluster. Pod is translated as container group, which is a free translation, because a POD actually encapsulates multiple containerized applications. In principle, containers packaged within a POD should have a fair degree of coupling.

Merit and

The problem the scheduling algorithm needs to solve is to choose a comfortable “residence” for pod, so that the tasks defined by POD can be successfully completed at this node.

In order to achieve the goal of “choosing the best”, Kubernetes cluster scheduling algorithm adopts a two-step strategy:

  • The first step is to exclude the nodes that do not meet the conditions from all the nodes, that is, pre-selection;
  • In the second step, score the remaining nodes, and the one with the highest score wins, that is, the preferred one.

Let’s use the image we made at the beginning of this article to create a POD and log it to see how it is scheduled to a cluster node.

Pod configuration

First, we create a pod configuration file in JSON format. There are three key areas in this configuration file: the mirror address, the command, and the container port.

{ "apiVersion": "v1", "kind": "Pod", "metadata": { "name": "app" }, "spec": { "containers": [ { "name": "app", "image": "registry.cn-hangzhou.aliyuncs.com/kube-easy/app:latest", "command": [ "app" ], "ports": [ { "containerPort": 2580}]}]}}Copy the code

The level of logging

The cluster scheduling algorithm is implemented as a system component running on the Master node, similar to API Server. The corresponding process name is kube-Scheduler. Kube-scheduler supports multiple levels of log output, but the community does not provide detailed documentation of log levels. To see how the scheduling algorithm filters and scores nodes, we need to increase the log level to 10, that is, add the parameter –v=10.

Kube - scheduler - address = 127.0.0.1 -- kubeconfig = / etc/kubernetes/scheduler. Conf - leader - well = true - v = 10Copy the code

Create a Pod

Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl

# curl -X POST -H 'Content-Type: application/json; charset=utf-8' --cert ./client.crt --cacert ./ca.crt --key ./client.key https://47.110.197.238:6443/api/v1/namespaces/default/pods - [email protected]Copy the code

primary

Preselection is the first step in Kubernetes scheduling. This step is to filter out nodes that do not meet the criteria according to predefined rules. The pre-selection rules implemented by different versions of Kubernetes vary greatly, but the basic trend is that the pre-selection rules will become richer and richer.

Two of the more common preselection rules are PodFitsResourcesPred and PodFitsHostPortsPred. The first rule is used to determine whether the remaining resources on a node can meet the requirements of POD. The latter rule checks whether a port on a node is already in use by another POD.

The following figure shows the pre-selected rule log output by the scheduling algorithm when processing the test POD. This log records the execution of the CheckVolumeBindingPred rule. Some types of storage volumes (PV) can be mounted to only one node. This rule can filter out nodes that do not meet the REQUIREMENTS of THE POD for PV.

As you can see from the app’s choreography file, POD has no need to store volumes, so this condition does not filter out nodes.

optimization

The second stage of the scheduling algorithm is the optimization stage. At this stage, Kube-Scheduler scores the remaining nodes based on their available resources and other rules.

Currently, CPU and memory are the two main resources considered by scheduling algorithms, but the way of consideration is not simple. The more CPU and memory resources left, the higher the score.

Logging the two kinds of calculation methods: LeastResourceAllocation and BalancedResourceAllocation.

  • The former method calculates the ratio of the remaining CPU and memory of the node to the total CPU and memory after POD scheduling to the node. The higher the ratio, the higher the score.
  • The second method calculates the absolute value of the difference between CPU and memory usage on a node. The greater the absolute value, the lower the score.

In these two ways, one tends to select nodes with low resource usage, while the other hopes to select nodes with similar resource usage. There are some contradictions between these two approaches, and ultimately a certain weight is used to balance these two factors.

In addition to resources, the optimization algorithm will consider other factors, such as the affinity between POD and node, or the degree of dispersion of multiple PODS on different nodes if a service consists of multiple identical pods, which is a strategy to ensure high availability.

score

Finally, the scheduling algorithm will multiply all the points by their weights and sum to get the final score for each node. Because the test cluster uses the default scheduling algorithm, and the default scheduling algorithm sets the weight corresponding to the score items in the log to 1, the final score of the three nodes should be 29, 28 and 29 if the score items recorded in the log are calculated.

Log is born out of the output calculation of the score to score and ourselves, because the log output is not all of the subjects, guessing strategy should be NodePreferAvoidPodsPriority is missing, the strategy of weight is 10000, 10, each node So the result of the final log output is obtained.

conclusion

In this article, we take a simple containerized Web application as an example, focusing on how the client is authenticated by the Kubernetes cluster API Server and how the container application is dispatched to the appropriate node.

During the analysis, we discarded some convenient tools, such as Kubectl, or the console. We used some more low-level experiments, such as dismantling KubeConfig files, and analyzing scheduler logs to analyze how authentication and scheduling algorithms work. Hopefully this will help you understand the Kubernetes cluster better.

Architect Growth series live

“Alibaba Cloud originators pay close attention to technical fields such as microservice, Serverless, container and Service Mesh, focus on cloud native popular technology trends and large-scale implementation of cloud native, and become the technical circle that knows most about cloud native developers.”