Overview

We can produce k8S external exposure service in various ways. One of them uses external-IPS ClusterIP Service clusterIP Service to expose the external exposure service. Kube-proxy uses iptables mode. In this way, external IPS can specify the IP addresses of fixed worker nodes (worker node service has been expelled, as traffic forwarding node, not as computing node), and serve as RS under LVS VIP for load balancing. Access services according to VIP :port, and differentiate services according to port. Compared with NodePort Service, node_IP :port of all worker nodes can be accessed more efficiently and easily. However, how does traffic packet find pod IP based on node_IP :port of worker node outside cluster or Cluster_IP :port inside cluster?

K8s manufactured by us use Calico as cnI plug-in and Peered with TOR (Top of Rack) routers. BGP peer pairs are established between each worker node and its Top switch. The top switch establishes BGP peer pairing with the upper-layer core switch to ensure that pod IP can be directly accessed on the Intranet. However, how does traffic packet jump to POD when it knows pod IP?

The above questions can be combined into one: How do packets jump to pods step by step?

I’ve been thinking about these questions for a long time.

The principle of analytic

In fact, the answer is very simple: access service VIP :port or node_IP :port. When packet arrives at the machine where node_IP is located, such as Worker A node, pod IP will be found step by step according to iptable rules. After finding the POD IP, both the core switch and the top switch have routes to the IP segment where the POD IP is located due to the calico BGP deployment mode. Packet will jump to a certain worker node, such as Worker B. However, there are routing rules route and virtual interface written by Calico on Worker B, and then jump from host network namespace to POD network namespace according to veth pair. Jump to the corresponding POD.

Install minikube with Calico install Minikube with Calico

minikube start --network-plugin=cni --cni=calico

#or
minikube start --network-plugin=cni
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Copy the code

In this example, nginx is used as an example. The number of copies is 2. Create a ClusterIP Service with ExternalIPs and a NodePort Service.


---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo-1
  labels:
    app: nginx-demo-1
spec:
  replicas: 2
  template:
    metadata:
      name: nginx-demo-1
      labels:
        app: nginx-demo-1
    spec:
      containers:
        - name: nginx-demo-1
          image: Nginx: 1.17.8
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              port: 80
              path: /index.html
            failureThreshold: 10
            initialDelaySeconds: 10
            periodSeconds: 10
      restartPolicy: Always
  selector:
    matchLabels:
      app: nginx-demo-1
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-demo-1
spec:
  selector:
    app: nginx-demo-1
  ports:
    - port: 8088
      targetPort: 80
      protocol: TCP
  type: ClusterIP
  externalIPs:
    - 192.16864.57. The IP address of the worker node can be viewed from minikube IP
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-demo-2
spec:
  selector:
    app: nginx-demo-1
  ports:
    - port: 8089
      targetPort: 80
  type: NodePort
---
Copy the code

After the deployment is complete, the ExternalIP ClusterIP Service or NodePort Service can be used to access Service services.

Iptables Writes custom rules

When packets access services through node_IP :port or cluster_IP :port, the Destination Network Address Translation (DNAT) of the current worker node will be pod IP. Reverse packet will be SNAT(Source Network Address Translation). Kubernetes Services: Kubernetes Services:

Cluster-ip service Access process:

Node-port service access flow:

Since kube-proxy for K8S uses iptables mode, these SNAT/DNAT rules are implemented by the Kube-proxy process by calling the iptables command. Iptables using various chain to manage a large number of iptables rules, mainly five chain four tables, five chain include: the prerouting/input/output/forward/postrouting chain, four tables include: Raw, mangle, NAT, filter table, and user-defined chain. The flow chart of packet passing through kernel through five chains and four tables is as follows:

The kube-proxy process defines the kube-services chain within the NAT table, which takes effect within PREROUTING. You can run the following command to check the rules in the kube-Services chain:

sudo iptables -v -n -t nat -L PREROUTING | grep KUBE-SERVICES

sudo iptables -v -n -t nat -L KUBE-SERVICES

sudo iptables -v -n -t nat -L KUBE-NODEPORTS
Copy the code

If you use cluster_IP :port (10.196.52.1:8088) to access services in a cluster or use external_ip:port (192.168.64.57:8088) to access services outside the cluster, The kube-svC-jkocbQalqgd3x3RT chain rule is matched in the kernel, which corresponds to nginx-demo-1 service. For cluster_IP :port 10.196.89.31:8089 within the cluster or nodePort_IP :port 192.168.64.57:31755 outside the cluster, The kube-svC-6JCclzmuqSW27lld chain rule matches the nginx-demo-2 service:

Kube-svc-jkocbqalqgd3x3rt chain kube-SVC-6JCCLzMUqSW27lLD chain kube-SEP-xxx chain kube-SVC-xxx chain kube-sep-xxx chain And since the number of pod copies is 2, there will be two kube-SEP-xxx chains, and there will be a 50% probability to jump to either of the kuBE-SEP-xxx chains, namely rr(round Robin load balancing algorithm), When kube-proxy uses the iptables Statistic Module, it will jump to pod IP 10.217.120.72:80. In short, after kube-proxy calls iptables command and sets the corresponding chain according to service/endpoint, finally jump to pod IP step by step, so that the next hop of packet packet is pod IP:

sudo iptables -v -n -t nat -L KUBE-SVC-JKOCBQALQGD3X3RT
sudo iptables -v -n -t nat -L KUBE-SEP-CRT5ID3374EWFAWN

sudo iptables -v -n -t nat -L KUBE-SVC-6JCCLZMUQSW27LLD
sudo iptables -v -n -t nat -L KUBE-SEP-SRE6BJUIAABTZ4UR
Copy the code

In a word, whether the service is accessed through cluster_IP :port, external_ip:port or node_ip:port, packet finds the next hop POD IP address through various chains defined by the Kube-proxy process.

But how does packet know which node the POD IP is on?

Calico writes custom routes and virtual Interfaces

As mentioned above, we deploy Calico to ensure that pod IP can be routed outside the cluster because there will be Node level routing rules on the switch. When dis BGP routing-table is executed on the switch, there will be routing rules similar to the following. Indicates that the next hop is the IP address of worker B when accessing the 10.20.30.40/26 POD network segment. These routing rules are distributed by the BIRD process (BGP client) deployed on each worker node, and the switch learns them through BGP:

Network NextHop... 26 10.203.30.40 10.20.30.40 /...Copy the code

Therefore, after packet knows pod IP 10.217.120.72:80 (assuming pod nginx-Demo-1-7F67F8bDD8-FXPTT is accessed), it is easy to find worker B node, and the example in this paper is minikube node. Check the routing table and network card of this node, and find the network card cali1087C975DD9 on the side of host network namespace, the number is 13, which is very important. The number tells you which pod Network namespace the other end of the Veth pair is in. Eth0 = pod nginx-Demo-1-7F67F8bDD8-FXPTT eth0 = pod nginx-Demo-1-7F67F8bDD8-FXPTT

#Because the nginx container does not have ifconfig and IP commands, you can create the Nicolaka /netshoot:latest container and add it to the namespace of the Nginx containerDocker ps - a | grep nginx export CONTAINER_ID = f2ece695e8b9 # here is nginx container container id#Address github.com/nicolaka/netshoot nicolaka/netshoot: latest image
docker run -it --network=container:$CONTAINER_ID --pid=container:$CONTAINER_ID --ipc=container:$CONTAINER_ID nicolaka/netshoot:latest ip -c addr
ip -c addr
Copy the code

The above routing table rules and virtual network cards are created by Calico CNI’s Calico Network Plugin. Pod IP and the POD IP CIDR network segment of each node are created and managed by calico IPam Plugin. And the data is written to the Calico datastore. As for the specific process of Calico Network Plugin and Calico IPam Plugin, I will record and learn later when I have time.

conclusion

Whether cluster_IP :port or external_ip:port or node_IP :port is used to access the service, the service will be redirected to the corresponding POD IP through various iptables rules set by the kube-proxy process. Then jump to the worker node where the target POD is located with the help of Calico BGP deployment mode, and find the corresponding POD through the routing table and virtual network card of the node. Packet From host network namespace to POD network namespace. The long-standing question about Service and Calico has been solved.

reference

About Kubernetes Services

Design and implementation of Kube-Proxy

Design and implementation of Kube-proxy Iptables

Principle and implementation of Kube-Proxy IPVS mode