How does Kubernetes network add-on work? Blog.hdls.me /15999158216…

There are many kinds of network solutions for containers, and it is obviously not practical to support each network implementation once, and CNI was invented to be compatible with many network solutions. CNI is short for Container Network Interface. CNI is a standard universal Interface used to connect Container management systems and Network plug-ins.

Simply put, the container Runtime provides a network namespace for the container, and the network plug-in is responsible for inserting the Network interface into the network namespace and doing the necessary configuration on the host machine. Finally, configure the IP address and route of the interface in the namespace.

Therefore, the main work of network plug-in is to provide the network environment for the container, including setting IP address for POD, configuring routing to ensure the smooth network in the cluster. Popular web plugins are Flannel and Calico.

Flannel

Flannel mainly provides Overlay network within the cluster. It supports three back-end implementations: UDP mode, VXLAN mode, and host-GW mode.

UDP mode

UDP mode was one of the earliest methods supported by Flannel project, but it was also one of the least efficient. This mode provides a three-layer Overlay network, that is, it encapsulates the IP packets at the sending end using UDP, then unencapsulates the IP packets at the receiving end to obtain the original IP packets, and forwards the IP packets to the target container. The working principle is shown in the figure below.

When POD1 on node1 requests pod2 on Node2, the traffic goes as follows:

  1. The process in POd1 initiates a request and sends an IP packet.
  2. IP packets enter the CNI0 bridge according to the VETH device pair in POD1.
  3. Because the destination IP address of the IP packet is not on Node1, flannel enters Flannel0 according to the routing rules created on the node.
  4. The Flanneld process receives the packet. Flanneld determines which node the packet should be on and encapsulates it in a UDP packet.
  5. Finally, through the gateway on Node1, it is sent to Node2;

Flannel0 is a TUN device (Tunnel device). In Linux, the TUN device is a virtual Network device that works at the Network Layer. The TUN device delivers IP packets between the operating system kernel and user applications.

As you can see, the reason for the poor performance of this mode is that the flanneld program does the whole packet encapsulation process, namely user mode, which brings about a transition from kernel mode to user mode, and a transition from user mode to kernel mode. Context switching and user-mode operations are costly, while UDP mode brings additional performance costs due to packet unpacking.

Mode of VXLAN

VXLAN, also known as Virtual Extensible LAN, is a network virtualization technology supported by the Linux kernel. By taking advantage of this feature of the Linux kernel, the ability to encapsulate and unencapsulate in the kernel state can also be realized, so as to build an overlay network. Its working principle is shown in the figure below:

In VXLAN mode, flannel creates a VTEP (VXLAN Tunnel End Point) device named Flannel. Just as in UDP mode, flannel encapsulates layer-2 data frames into UDP packets and forwards them. Unlike UDP mode, the entire encapsulation process is completed in kernel mode.

When POD1 on node1 requests pod2 on Node2, the traffic goes as follows:

  1. The process in POd1 initiates a request and sends an IP packet.
  2. IP packets enter the CNI0 bridge according to the VETH device pair in POD1.
  3. Because the destination IP address of the IP packet is not on Node1, flannel can access Flannel.1 according to the routing rules created on flannel.
  4. Flannel.1 Adds a destination MAC address to the original IP packet and encapsulates it into a layer 2 data frame. The kernel then wraps the data frame into a UDP packet.
  5. Finally, through the gateway on Node1, it is sent to Node2;

Caught validation

Deploy an Nginx POd1 on node1 and an Nginx Pod2 on Node2. Then curl Pod2’s port 80 in poD1’s container.

The cluster network environment is as follows:

Node1 ens33:192.168.50.10 POD1 VEth: 10.244.0.7 Node1 CNI0:10.244.0.1/24 Node1 Flannel. 1:10.244.0.0/32 Node2 ENS33:192.168.50.10 POD1 VETH: 10.244.0.1/24 Pod2 Veth: 10.244.1.9 node2 CNI0:10.244.1.1/24 Node2 Flannel. 1:10.244.1.0/32Copy the code

The routing information on node1 is as follows:

➜  ~ ip route
default via 192.168.50.1 dev ens33
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.50.0/24 dev ens33 proto kernel scope link src 192.168.50.10 metric 100
Copy the code

Packet capture of ENS33 on node1:

Only the UDP packets whose source IP address is Node1 and destination IP address is Node2 can be viewed. Flannel.1 uses UDP packets to parse PACKETS in VxLAN format (port number: 8472). The Analyze->Decode As command is used to Analyze UDP packets.

Then look at the packet received on node1:

You can see that the source IP address is POd1 IP and the destination IP address is Pod2 IP. The IP packet is encapsulated in a UDP packet.

The host – gw mode

The last host-GW mode is a pure three-layer network scheme. It works by setting the next hop of each Flannel subnet to the IP address of the host corresponding to the Flannel subnet, which acts as the gateway in the communication path of the Flannel. In this way, IP packets can reach the destination host through the layer 2 network. For this reason, the host-GW mode requires that the network between cluster hosts be layer 2 connected, as shown in the following figure.

The routing information on the host machine was set by Flanneld. Since the flannel subnet and host information are stored in ETCD, Flanneld only needs to update the routing table in real time by changing the data of Watch. In this mode, the process of container communication eliminates the additional performance cost of encapsulation and unencapsulation.

When POD1 on node1 requests pod2 on Node2, the traffic goes as follows:

  1. The process in POD1 initiates a request and sends out IP packets, which are encapsulated into frames from the network layer to the link layer.
  2. According to the routing rules on the host, the data frame from Node 1 reaches Node 2 through the layer 2 network of the host.

Calico

Instead of using the bridge mode of CNI, Calico uses nodes as border routers to form a fully connected network and exchange routes through BGP. So, Calico’s CNI plug-in also needs to configure a routing rule on the host for each container’s Veth Pair device to receive incoming IP packets.

Calico components:

  1. CNI plug-in: Calico and Kubernetes docking part
  2. Felix: Responsible for inserting routing rules on the host (i.e., writing FIB forwarding libraries to the Linux kernel) and maintaining the network equipment needed for Calico.
  3. BIRD (BGP Route Reflector) : a BGP client that distributes routing rule information in clusters.

All three components were installed via a DaemonSet. The CNI plug-in is installed through initContainer; Felix and BIRD are two containers of the same POD.

The working principle of

BGP adopted by Calico is a protocol for sharing node routing information in large-scale networks. The full name is Border Gateway Protocol. It is a no-center routing protocol natively supported by the Linux kernel for maintaining routing information between different “autonomous systems” in large-scale data centers.

Since there is no CNI bridge, Calico’s CNI plug-in needs to set up a Veth Pair device for each container and place one end on the host. It also needs to configure a routing rule on the host for each container’s Veth Pair device to receive incoming IP packets. As shown below:

You can use calicoctl to check node connections for node1:

~ calicoctl get no NAME node1 node2 node3 ~ calicoctl node status Calico process is running. IPv4 BGP status +---------------+-------------------+-------+------------+-------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- + | 192.168.50.11 | node - to - node mesh | up | | 2020-09-28 Established | | 192.168.50.12 | node - to - node mesh | up | | 2020-09-28 Established | +---------------+-------------------+-------+------------+-------------+ IPv6 BGP status No IPv6 peers found.Copy the code

You can see that there are three nodes in the calico cluster. Node1 is connected to two other nodes in the “Node-to-Node Mesh” mode. Look at the routing information on node1 as follows:

~ ip route
default via 192.168.50.1 dev ens33 proto static metric 100
10.244.104.0/26 via 192.168.50.11 dev ens33 proto bird
10.244.135.0/26 via 192.168.50.12 dev ens33 proto bird
10.244.166.128 dev cali717821d73f3 scope link
blackhole 10.244.166.128/26 proto bird
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.50.0/24 dev ens33 proto kernel scope link src 192.168.50.10 metric 100
Copy the code

The second routing rule indicates that the packets of network segment 10.244.104.0/26 are sent to the gateway 192.168.50.11 by ENS33 device through bird protocol. This defines the direction of pod requests on Node2 for destination IP. Routing rule 3 is similar.

Caught validation

As above, send an HTTP request from POd1 on Node1 to Pod2 on Node2.

The cluster network environment is as follows:

Node1 NIC ENS33:192.168.50.10 POD1 IP address: 10.244.166.128 Node2 NIC ENS33:192.168.50.11 POD2 IP address: 10.244.104.1Copy the code

Packet capture of ENS33 on node1:

IPIP mode

IPIP mode to solve the problem that two nodes are not on the same subnet. Simply set the environment variable CALICO_IPV4POOL_IPIP for the Daemonset called calico-node to “Always”. As follows:

            - name: CALICO_IPV4POOL_IPIP
              value: "Off"
Copy the code

IPIP mode Calico uses the TUNL0 device, which is an IP tunnel device. After the IP packet enters TUNL0, the kernel directly encapsulates the original IP packet in the HOST IP packet. The destination ADDRESS of the encapsulated IP packet is the next hop address, that is, the IP address of Node2. Because layer 3 forwarding is configured between hosts using the router, the IP packet can be sent to Node 2 after leaving Node 1 through the router. As shown in the figure below.

Because IPIP mode Calico requires additional packet sealing and unpacking, the network performance of the cluster is affected. Therefore, it is recommended not to use IPIP mode when layer 2 networks of the cluster are connected.

Look at the routing information on node1:

~ ip route
default via 192.168.50.1 dev ens33 proto static metric 100
10.244.104.0/26 via 192.168.50.11 dev tunl0 proto bird onlink
10.244.135.0/26 via 192.168.50.12 dev tunl0 proto bird onlink
blackhole 10.244.166.128/26 proto bird
10.244.166.129 dev calif3c799362a5 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.50.0/24 dev ens33 proto kernel scope link src 192.168.50.10 metric 100
Copy the code

As you can see, the packets destined for Pod on Node2 are sent to gateway 192.168.50.11 via TUNl0.

Caught validation

Send an HTTP request from POd1 on node1 to Pod2 on Node2.

The cluster network environment is as follows:

Node1 ens33:192.168.50.10 node1 TUNl0:10.244.166.128/32 POD1 IP: 10.244.166.129 Node2 ens33:192.168.50.11 POD2 IP: 10.244.166.129 10.244.104.2 node2 TUNl0:10.244.104.0/32Copy the code

Packet capture of tunL0:

Packet capture of ENS33 for NODE1:

As you can see, the IP packet is encapsulated in another IP packet on the TUNl0 device, whose destination IP address is the IP address of Node2 and source IP address is the IP address of Node1.

conclusion

There are many implementation schemes of Cluster network plug-in of Kubernetes. This paper mainly analyzes the working principles of Flannel and Calico, which are common in the community, and analyzes the trend of network packet according to the communication scenarios of pod between different nodes in the cluster.

Flannel mainly provided Overlay network solution. UDP mode was gradually abandoned by the community due to its poor performance due to the multiple context switches involved in the process of packet sealing and unpacking. VXLAN encapsulation and unencapsulation are performed in kernel mode, which has better performance than UDP and is the most commonly used mode. The host-GW mode does not involve packet encapsulation and unencapsulation, so the performance is relatively high. However, the host-GW mode requires layer 2 communication between nodes.

Calico uses BGP instead of THE CNI0 network bridge to exchange routes. When the layer 2 network is unavailable, the IPIP mode can be used. However, the performance is relatively poor because the process of packet encapsulation and unencapsulation is involved, which is comparable to the VXLAN mode of Flannel.