In Kubernetes, networking is essential to ensure that containers can communicate with each other. Kubernetes itself does not implement the container network itself, but through the way of plug-in free access. In container network access, the following basic principles should be met: POD can communicate with each other directly no matter on any node, without using NAT address translation. Node and POD can communicate with each other, allowing the POD to access any network without restrictions. A POD has a separate network stack. The address seen by a POD should be the same as the address seen by the outside, and all containers within a POD share the same network stack.

The Network stack of a Linux container is isolated in its own Network Namespace. The Network Namespace contains: Network Interface, Lookback Device, Routing Table, and iptables rules constitute the basic environment for a server process to initiate requests.

To implement a container network, you need the following Linux networking features:

  • Network namespace: Separate network protocol stacks into different command Spaces and cannot communicate with each other.
  • Veth Pair: Veth device pairs are introduced to communicate between different network namespaces. They are always paired as two virtual network cards (Veth peers). And data sent from one end can always be received from the other.
  • Iptables/Netfilter: The Netfilter implements various mounting rules (filtering, modifying, discarding, etc.) in the kernel. Iptables mode is a process that runs in user mode and assists in maintaining Netfilter rule tables in the kernel. The flexible packet processing mechanism in the whole Linux network protocol stack is realized through the cooperation of the two
  • Bridge: A bridge is a layer 2 virtual device, similar to a switch, that forwards data frames to different ports on the bridge using learned Mac addresses.
  • Routing: Linux includes a complete routing function. When the IP layer processes data sending and forwarding, it uses the routing table to determine where to send the data

Based on the above, how do you communicate with the host’s container time? We can simply think of them as two hosts, which are connected by network cables. If we want to communicate with multiple hosts, we can communicate with each other through switches. In Linux, we can forward data through a bridge. In the container, the above implementation is through the Docker0 bridge, any container connected to docker0, can communicate through it. In order for the container to connect to the Docker0 bridge, we also need the virtual appliance Veth Pair, which is similar to the network cable, to connect the container to the bridge.

We start a container:

Docker run - d - name c1 hub.pri.ibanyu.com/devops/alpine:v3.8 / bin/shCopy the code

Then view the nic device:

Docker exec it c1 /bin/sh / # ifconfig eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02 inet ADDR :172.17.0.2 Bcast:172.17.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14 Errors :0 Dropped :0 Overruns :0 Frame :0 TX packets:0 Errors :0 Dropped :0 Overruns :0 Carrier :0 collisions:0 TXQueuelen :0 RX bytes:1172 (1.1) KiB) TX bytes:0 (0.0b) Lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP Loopback RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 Collisions :0 txqueuelen:1000 RX bytes:0 (0.0b) TX bytes:0 (0.0b) / # route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 172.17.0.1 0.0.0.0 UG 0 0 0 eth0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0  0 eth0Copy the code

You can see that there is an eth0 network card, which is the virtual network card of one end of the Veth peer. Then run route-n to view the routing table in the container. Eth0 is also the default route exit. All requests to 172.17.0.0/16 are routed through eth0. At the other end of the Veth peer, we look at the host’s network device:

ifconfig docker0: Flags = 4163 < UP, BROADCAST, RUNNING, MULTICAST > mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 BROADCAST 172.17.255.255 inet6 fe80::42:6aff:fe46:93d2 prefixlen 64 scopeid 0x20<link> ether 02:42:6a:46:93:d2 txqueuelen 0 (Ethernet) RX packets 0 Bytes 0 (0.0b) RX errors 0 dropped 0 Overruns 0 frame 0 TX packets 8 bytes 656 (656.0b) TX errors 0 dropped 0 overruns  0 carrier 0 collisions 0 eth0: Flags = 4163 < UP, BROADCAST, RUNNING, MULTICAST > mtu 1500 inet 10.100.0.2 netmask 255.255.255.0 BROADCAST 10.100.0.255 inet6 fe80::5400:2ff:fea3:4b44 prefixlen 64 scopeid 0x20<link> ether 56:00:02:a3:4b:44 txqueuelen 1000 (Ethernet) RX packets 7788093 bytes 9899954680 (9.2 GiB) RX errors 0 Dropped 0 Overruns 0 frame 0 TX packets 5512037 bytes 9512685850 (8.8) GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: Flags =73<UP,LOOPBACK,RUNNING> MTU 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixLen 128 scopeid 0x10<host> loop Txqueuelen 1000 (Local Loopback) RX packets 32 bytes 2592 (2.5KIb) RX errors 0 Dropped 0 Overruns 0 frame 0 TX packets 32 bytes 2592 (2.5kib) TX errors 0 Dropped 0 Overruns 0 carrier 0 collisions 0 veth20b3DAC: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::30e2:9cff:fe45:329 prefixlen 64 scopeid 0x20<link> ether 32: E2:9C :45:03:29 TXQueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0b) RX errors 0 Dropped 0 Overruns 0 frame 0 TX Packets 8 bytes 656 (656.0b) TX errors 0 Dropped 0 Overruns 0 carrier 0 collisions 0Copy the code

We can see that the other end of the Veth peer corresponding to the container is a virtual nic on the host named VEth20B3DAC, and the network bridge information can be checked by BRCTL to see that this nic is on docker0.

# brctl show
docker0        8000.02426a4693d2    no        veth20b3dac
Copy the code

Then we start another container and see if the second can be pinged from the first container.

$docker run - d - name c2 - it hub.pri.ibanyu.com/devops/alpine:v3.8 / bin/sh $docker exec - it c1 / bin/sh / # ping 172.17.0.3 PING 172.17.0.3 (172.17.0.3): 56 Data bytes 64 bytes from 172.17.0.3: Seq =0 TTL =64 time=0.291 ms 64 bytes from 172.17.0.3: seq=1 TTL =64 time=0.129 ms 64 bytes from 172.17.0.3: Seq =2 TTL =64 time=0.142 ms 64 bytes from 172.17.0.3: seq=3 TTL =64 time=0.169 ms 64 bytes from 172.17.0.3: seq=3 TTL =64 time=0.169 ms 64 bytes from 172.17.0.3: Seq =4 TTL =64 time= 0.139ms ^C -- -7.07.0.3 ping statistics -- 7.0packets transmitted, 7.0packets received 0% packet loss round-trip min/ AVg/Max = 0.129/0.185/0.291msCopy the code

As can be seen, when the target IP address 172.17.0.3 is pinged, rule 2 of the routing table is matched. The gateway is 0.0.0.0, which means that it is a direct route to the destination through layer 2 forwarding.

To reach 172.17.0.3 over the Layer 2 network, we need to know its Mac address. In this case, the first container needs to send an ARP broadcast to look up the Mac by the IP address.

In this case, the other segment of the Veth peer is the Docker0 bridge, which broadcasts to all the virtual network cards of the Veth peer connected to it, and then the correct virtual network card will respond to the ARP message, and then the bridge will return to the first container.

The above is the communication between different containers of the host machine and Docker0, as shown below:

By default, the container process restricted by network Namespace realizes data exchange between different Network namespaces by means of Veth peer device and host network bridge. Similarly, when you are on a host and access the IP address of the host’s container, the requested packet will first arrive at the Docker0 bridge according to the routing rules, then be forwarded to the corresponding Veth Pair device, and finally appear in the container. Cross-host network communication Under the default configuration of Docker, it is impossible for containers on different hosts to access each other by IP address. To solve this problem, a number of networking solutions have emerged in the community. At the same time, IN order to better control network access, K8s introduced CNI, namely container network API interface. It is a standard K8s call network implementation interface, Kubelet through this API to call different network plug-ins to achieve different network configuration, the implementation of this interface is the CNI plug-in, it implements a series of CNI API interface. Already available include Flannel, Calico, Weave, contiv and more.

In fact, the container network communication flow of CNI is the same as the previous basic network, except CNI maintains a separate bridge instead of Docker0. The name of this bridge is: CNI bridge, and its default device name on the host is: CNI0.

The design idea of CNI is that Kubernetes can directly call the CNI Network plug-in after starting the Infra container, and configure the Network stack that meets the expectation for the Network Namespace of the Infra container.

There are three network implementation modes of CNI plug-in:

  • Overlay mode is implemented based on tunnel technology. The entire container network and host network are independent. When containers communicate with each other across hosts, the whole container network is encapsulated in the underlying network, and then unsealed and delivered to the target container after reaching the target machine. Independent of the implementation of the underlying network. Flannel (UDP, VXLAN), Calico (IPIP) and so on
  • In Layer 3 routing mode, containers and hosts cannot communicate on the same network segment. The communication between containers is based on the routing table, without establishing tunnel packets between hosts. But the restrictions must depend on large layer 2 within the same LAN. The implementation of flannel (host-GW), Calico (BGP) and so on
  • The Underlay network is the underlying network responsible for connectivity. Container network and host network still belong to different network segments, but they are in the same layer of network and in the same position. The entire network can communicate with each other at layer 3. There is no restriction on layer 2, but it needs to be strongly supported by the underlying network. Implemented plug-ins include Calico (BGP) and so on

Flannel Host-GW: Flannel host-GW

As shown in the figure, when container-1 on Node1 sends data to Container2 on Node2, the routing table matches the following rules: 10.244.0/24 via 10.168.0.3dev eth0 Indicates the IP packet destined for destination network segment 10.244.0/24. The IP address of the next hop is 10.168.0.3 (node2). When it gets to 10.168.0.3, it forwards the CNI bridge through the routing table to Container2.

The above can be seen the working principle of host-GW. In fact, the next hop of each node to each POD network segment is the IP address of the node where the POD network segment resides. The mapping relationship between POD network segment and node IP address is saved in ETCD or K8S. Flannel only needs the data changes of Watch to dynamically update the routing table. The biggest benefit of this network mode is to avoid the network performance loss caused by additional packet encapsulation and unencapsulation. The main disadvantage we can see is that when the container IP packet goes through the next hop, it must be encapsulated into a data frame by layer 2 communication and sent to the next hop. If it is not on the same Layer 2 LAN, then it is up to the Layer 3 gateway, which is unaware of the target container network (it is also possible to statically configure POD network segment routing on each gateway). Therefore, flannel host-GW must be connected to the cluster host at layer 2.

In order to solve the limitation of layer 2 communication, the network solution provided by Calico can be better implemented. Calico layer 3 network mode is similar to that provided by Flannel, and the following routing rules are added to each host:

< destination container IP network segment > via < gateway IP address > dev eth0 The gateway IP address has different meanings. If the host is reachable at layer 2, it is the IP address of the host where the destination container is located. If it is a layer 3 LAN, it is the gateway IP (switch or router address) of the local host. Unlike Flannel using k8S or ETCD data to maintain local routing information, Calico uses BGP dynamic routing protocol to distribute routing information for the entire cluster. BGP is the Border Gateway Protocol, which is supported by Linxu natively and is used to transmit routing information between autonomous systems in large-scale data centers. Just remember that BGP is simply a protocol for synchronizing and sharing routing information between nodes on large-scale networks. BGP can replace flannel to maintain the host routing table.

  • Calico is mainly composed of three parts:
  • Calico CNI plug-in: mainly responsible for docking with Kubernetes, for kubelet call use.
  • Felix: Maintains routing rules and FIB forwarding information base on the host.
  • BIRD: distributes routing rules, like a router.
  • Confd: Configures management components.

Calico also differs from Flannel Host-GW in that it does not create a bridge device. Instead, it maintains the communication of each POD through a routing table, as shown below:

As you can see, calico’s CNI plugin sets up a Veth pair device for each container, and then connects the other end to the host network space. Since there is no network bridge, CNI plugin also needs to configure a routing rule on the host for each container’s Veth pair device to receive incoming IP packets. Routing rules are as follows:

10.92.77.163 dev cali93a8a799fe1 scope link
Copy the code

The IP packet 10.92.77.163 should be sent to cali93a8A799FE1 and then to another container.

With such a Veth pair device, the IP packet sent by the container will reach the host through the Veth pair device, and then the host will send the IP packet to the correct gateway (10.100.1.3) according to the next regular address of the path, and then to the target host, and then to the target container.

10.92.160.0/23 via 10.106.65.2 dev bond0 proto bird
Copy the code

These routing rules are maintained and configured by Felix, and routing information is distributed by the Calico Bird component based on BGP. Calico actually treats all nodes in the cluster as border routers. They form a fully interconnected network and exchange routes with each other through BGP. These nodes are called BGP peers.

It should be noted that the default mode of Calico maintenance network is Node-to-Node mesh. In this mode, BGP clients of each host communicate with all the BGP clients of nodes in the cluster to exchange routes. As a result, as the number of nodes N increases, connections grow by N to the power of 2, putting enormous strain on the cluster network itself.

Therefore, the recommended cluster size in this mode is about 50 nodes. If the cluster size exceeds 50 nodes, another Router Reflector (RR) mode is recommended. In this mode, Calico can specify several nodes as RRS. They communicate with the BGP clients of all nodes to learn all routes in the cluster. Other nodes only need to exchange routes with RR nodes. This greatly reduces the number of connections, and RR>=2 is recommended for the stability of the cluster network.

The above working principle is still in layer 2 communication, when we have two hosts, one is 10.100.0.2/24, the container network on the node is 10.92.204.0/24; The other one is 10.100.1.2/24, and the container network on the node is 10.92.203.0/24. At this time, the two machines need to communicate through routing at Layer 3 because they are not in the same layer 2. Then calico will generate the following routing table on the node:

10.92.203.0/23 via 10.100.1.2 dev eth0 proto bird
Copy the code

At this point, the problem comes, because 10.100.1.2 and our 10.100.0.2 are not in the same subnet, it can not be layer 2 communication. After that, Calico IPIP mode is used. When the host is not in the same layer 2 network, overlay network is used and then sent out. As shown below:

In non-layer 2 communication in IPIP mode, Calico adds the following routing rules to nodes:

10.92.203.0 via 10.100.1.2 dev tunnel0Copy the code

As you can see, although the next entry is still the NODE IP address, the exit device is Tunnel0, which is an IP tunnel device, mainly implemented by the Linux kernel IPIP driver. The IP packet of the container is directly encapsulated into the IP packet of the host network. In this way, after reaching Node2, the IP packet of the original container is unpacked by the IPIP driver, and then sent to the Veth pair device through routing rules to reach the target container. Although the above can solve non-layer 2 network communication, it still degrades performance due to packet encapsulation and unencapsulation. If Calico can make router devices between hosts also learn container routing rules, then layer 3 communication can be implemented directly. For example, add the following routing table to the router:

10.92.203.0/24 via 10.100.1.2 dev interface1
Copy the code

For node1, add the following routing table:

10.92.203.0/24 via 10.100.1.1 dev tunnel0
Copy the code

Then the IP packet sent by the container on Node1 is sent to the gateway router 10.100.1.1 based on the local routing table. After receiving the IP packet, the router checks the destination IP address, finds the next hop address through the local routing table, and sends the IP packet to Node2, and finally reaches the destination container. This scheme can be implemented based on the Underlay network. As long as the underlying network supports BGP, EBGP relationship can be established with our RR node to exchange routing information in the cluster. The above are several network solutions commonly used by Kubernetes. In the public cloud scenario, it is easier to use cloud vendors or flannel host-GW, while in the private physical room environment, Calico project is more suitable. Select an appropriate network solution based on your actual scenario. reference

  • Github.com/coreos/flan…
  • coreos.com/flannel/
  • Docs.projectcalico.org/getting-sta…
  • www.kancloud.cn/willseeclou…