Introduction: This article requires readers to be familiar with the basic principles of Ethernet (Ethernet) and the basic network commands of Linux system, as well as the TCP/IP protocol family, and understand the traditional network model and the principle of protocol packet flow. When it comes to the specific implementation of the Linux kernel, the kernel V4.19.215 shall prevail.
The author | | Wilson of shallow source ali technology to the public
This paper requires readers to be familiar with the basic principles of Ethernet (Ethernet) and the basic network commands of Linux system, as well as TCP/IP protocol family, and understand the traditional network model and protocol packet flow principle. When it comes to the specific implementation of the Linux kernel, the kernel V4.19.215 shall prevail.
Kernel network packet receiving process
1 From nic to Kernel protocol stack
As shown in Figure [1], when the Network packet arrives at NC (Network Computer), it is processed by NIC (Network Interface Controller, commonly known as NIC) device, and NIC transmits messages to the kernel in the way of interruption. The Interrupt handling of the Linux kernel is divided into the Top Half and the Bottom Half. The upper part needs to deal with the hardware related work as soon as possible and return, and the lower part is activated by the upper part to deal with the subsequent time-consuming work.
The processing process for a NIC is as follows: When the NIC receives data, it copies the data to the mapped memory area pointed by the descriptors in the Ring Buffer in DMA mode. After the data is copied, an interrupt is triggered to notify the CPU for processing. You can use ethtool -g {device name, such as eth0} to check the size of the RX/TX (receive/send) queue. The INTERRUPT handler that jumps to the NIC after the CPU recognizes the interrupt starts executing. In the previous non-NAPI (New API) [2] mode, interrupt the upper half to update the relevant register information, view the receive queue and assign the SK_buff structure to point to the received data. Finally, call netif_rx() to pass the SK_buff to the kernel for processing. In the flow of the netif_rx() function, the allocated SK_buff structure is placed in the input_pkt_queue, which adds a virtual device to the poll_list polling queue and triggers the soft interrupt NET_RX_SOFTIRQ to activate the second half of the interrupt. At this point, the top half of the interrupt is complete. For details, see the netif_rx() -> netif_rx_internal() -> enqueue_to_backlog() procedure in net/core/dev.c. The NET_RX_SOFTIRQ soft interrupt handler is net_rx_action(), which calls the poll() function registered with the device for processing. The poll() function of the virtual device in the non-NAPI case is fixed to the process_backlog() function. This function moves the SK_buff from input_pkt_queue to process_queue, calls the __netif_receive_skb() function to post it to the protocol stack, and the stack code calls the corresponding interface for subsequent processing based on the protocol type. In particular, the enqueue_to_backlog() and process_backlog() functions are also used for the logic associated with RPS enabled.
Non-napi (New API) mode triggers an interrupt processing flow for each network packet arrival, which reduces overall processing power and is obsolete. Most nics now support NAPI mode. In NAPI mode, after NIC interruption is triggered by the first packet, the device is added to the polling queue to improve polling efficiency. No new interrupts are generated during polling. To support NAPI, each CPU maintains a structure called softnet_data with a poll_list field for all polling devices. In this case, the first part of the interrupt is very simple, just need to update the NIC related register information, add the device to the poll_list polling queue and trigger the soft interrupt NET_RX_SOFTIRQ. The second half of the interrupt is handled by net_rx_action(), which calls the poll() function provided by the device driver. Poll () now points to the polling handlers provided by the device driver (as opposed to the non-NAPI kernel process_backlog() function). The polling () function provided by the device driver also finally calls __netif_receive_skb() to submit the SK_buff to the protocol stack for processing.
The comparison between non-NAPI mode and NAPI mode is as follows (the gray background color is the implementation of the device driver, and the rest are the implementation of the kernel itself) :
For NAPI mode network device driver implementation and detailed NAPI mode processing flow, here is an article and its translation for reference [3] (highly recommended). This article describes in detail the packet receiving and processing details of Intel Ethernet Controller I350 NIC device (its sister packet sending process and translation [4]). In addition, there are also related to the Bonding mode of multiple network adapters (you can check the mode in /proc/net/bonding/bond0), network multi-queue (sudo lspci-VVv check the Ethernet controller Msi-x: Enable+ Count=10 indicates that NIC is supported. You can view the interrupt binding status in /proc/interrupts. These are not described in this article, please refer to relevant materials [5] if you are interested.
2 Kernel protocol stack network package processing flow
__netif_receive_skb() submits the SK_buff structure to the kernel protocol stack. This function first performs RPS[5] related processing, and the packet continues to go around the queue (usually RPS is not required for a network card with RSS enabled). If packets need to be distributed to other cpus for processing, enqueue_to_backlog() is used for queue delivery to other cpus, and IPI (inter-processor Interrupt, process_backlog()) is triggered. Interprocessor interrupts, transmitted over APIC bus, without IRQ) to send notifications to other cpus (net_rps_send_ipi() function).
Finally, the packet is processed by __netif_receive_SKb_core () for the next stage. The main functions of this handler are:
- Process all packet_type->func() on ptype_all, which is typically captured by tcpdump and other tools (paket_type. Type is ETH_P_ALL, Libcap uses AF_PACKET Address Family
- SKB ->dev-> rx_Handler () points to br_handle_frame().
- Process all packet_type->func() on ptype_base, passing the packet to the upper protocol layer for processing, such as the ip_rcv() function that calls back to the IP layer
So far, packets are still in the flow at the data link layer. Here is a review of OSI seven-layer model and TCP/IP five-layer model:
In the hierarchical network model, the later layer is the data portion of the previous layer, which is called the Payload. The format of a complete TCP/IP application layer packet is as follows [6] :
The processing logic of __netif_receive_SKb_core () is concerned with the bridge and subsequent IP and TCP/UDP layers. Looking at the IP layer first, __netif_receive_skb_core() calls deliver_skb(), which calls the.func() interface of the specific protocol. For the IP protocol, this refers to the ip_rcv() function. After doing some statistics and checking, the function passes the packet to the Netfilter [7] framework and specifies the function ip_rcv_finish() for further processing (if the packet is not discarded by Netfilter). After the routing subsystem checks, if the packet is native, ip_local_deliver() is called to forward the packet to the upper-layer protocol. This function is similar to the previous logic, which is presented to the Netfilter framework and specified with the function ip_local_deliver_finish() for subsequent processing, which ultimately checks and selects the corresponding upper-layer protocol interface for processing.
The flow of common upper-layer protocols such as TCP or UDP is beyond the scope of this article. The flow of TCP alone requires more than the length of this article. The entry function tcp_V4_rCV () of TCP protocol v4 and UDP_rCV () of UDP protocol are given here as a guide for self-study, and you can also read other materials for further understanding [9].
3 Netfilter/ Iptables and NAT
The Netfilter framework needs to be emphasized a little bit, because the network policies and many of the service implementations that follow use the mechanisms provided by Netfilter.
Netfilter is the implementation of Packet Filtering Framework in kernel. There are Hook points built into each layer of the protocol stack to register callback functions at these points.
The image is from Wikimedia and can be viewed in larger size by clicking on reference [8].
Iptables, the most common firewall on Linux, is based on Netfilter (NFtables is a new generation of firewalls). Iptables organizes rules around the concepts of Tables and Chains. Not to be misled by the word “firewall”, iptables does more than Filter tables. It also supports NAT tables and Mangle tables. In network virtualization, the most used function is NAT address translation. This type of functionality is common in gateway network devices or load balancing devices. A NC is also a gateway and load balancer when it needs to perform network-related virtualization internally.
Before setting NAT rules for iptables, you need to enable the packet forwarding function of the kernel echo “1” > /proc/sys/net/ipv4/ip_forward. Advice also opened the echo “1”/proc/sys/net/bridge/bridge – nf – call – iptables switch (may need to modprobe br_netfilter). Bridge-nf-call-iptables, as you can see from the source code analysis above, predates the forwarding of the bridge rules. By default, layer 2 bridge forwarding is not restricted by Layer 3 Iptables. However, many virtualized network implementations require Netfilter rules to take effect, so the kernel also supports the bridge forwarding logic to invoke Netfilter rules. This feature is not enabled by default, so you need to check the switch. For specific iptables commands, refer to this article and its translation [10] and are not discussed in this article.
It is important to note that the logic of Netfilter runs in the kernel soft interrupt context. If Netfilter adds a lot of rules, it inevitably incurs some CPU overhead. This is where a significant portion of the overhead comes from when we talk about the performance degradation of virtualized networks.
Virtual network devices
In traditional network cognition, network is a communication set composed by a group of NC with one or more nics using hardware media, switch and Router (picture from [11], the same below) :
Network virtualization, as an implementation of Software Defined Network (SDN), is nothing more than virtualizing vNIC (virtual Network card), vSwitch (virtual switch), vRouter (virtual router) and other devices and configuring corresponding data packet flow rules. Its external interface must conform to the specification of the physical network protocol, such as Ethernet and TCP/IP protocol family.
With the evolution of Linux network virtualization technology, there are several kinds of virtual network devices, which are widely used in virtual machine and virtual container network. Typical Tap/Tun/Veth, Bridge, etc. :
- Tap/Tun is a pair of virtual network devices implemented by the Linux kernel. Tap/Tun works at Layer 2 and Layer 3 respectively. The Linux kernel exchanges data between a Tap/Tun device and the user space bound to the device. Based on the Tap driver, the vNIC functions of the VM can be implemented, and the Tun device performs other forwarding functions.
- Veth devices are always created in pairs (Veth Pair). After receiving data from the kernel, one device sends it to another device. Veth Pair can be thought of as a Pair of VNICS connected by network cables.
- A Bridge is a virtual network Bridge that works on the second floor. This is a virtual device, called a bridge, but similar in design to a vSwitch. When the Bridge is used with a Veth device, one end of the Veth device can be bound to a Bridge, which is equivalent to connecting a NIC to a switch in the real environment.
For example, a KVM uses a Tap device to connect the vNIC of a VM to the network Bridge of the host. The Bridge network mode of the container is to connect Veth pairs in different namespaces to the Bridge to achieve communication (other ways will be discussed below).
The Linux Bridge makes it easy to communicate with hosts or VMS/containers across hosts in Bridge or NAT mode, and the Bridge itself supports VLAN configuration, enabling some layer 3 switch capabilities. However, many manufacturers are developing Virtual switches with richer functions, such as Cisco Nexus 1000V, VMware Virtual Switch and the widely used Open source vSwitch[12]. With vSwitch, more advanced virtual networks supporting more encapsulation protocols can be built:
1 Linux Bridge + Veth Pair Forwarding
Virtual Routing and Forwarding (VRF) is a common term in the network field. Since the 1990s, 4K VLAN broadcast domains have been created on many Layer 2 switches. The FORMAT of VLAN tags in 4K complies with the 802.1Q standard, where the VLAN ID is defined as 12 bits (802.1Q in 802.1Q can be 4094 x 4094, with 0 and 4095 reserved). With the introduction of the VRF concept at tier 3, multiple virtual routing/forwarding instances can now exist on a single physical device. Linux VRF virtualizes the layer 3 Network protocol stack, while Network Namespace virtualizes the entire Network stack. A NetNS Network stack consists of Network interfaces, Loopback devices, Routing tables, and iptables rules. This article uses Netns for the demonstration (we’re talking about containers, after all), and then uses the IP [14] command to create and manage NetNS and Veth Pair devices.
Create, view, and delete a Network Namespace
Create a netns named qianyi-test-1 and add qianyi-test-2 IP netns add qianyi-test-1 IP netns add qianyi-test-2 # Check all Network Namespace IP netns list # Delete Network Namespace IP netns del Qianyi-test-1 IP netns del Qianyi-test-2Copy the code
The execution result is shown as follows:
If you are interested, follow the creation process using the strace command to see how the IP command was created (strace IP netns add qianyi-test-1).
Run a command in netns
Run IP addr in qianyi-test-1 netns (you can even run bash to get a shell) # nsenter IP netns exec qianyi-test-1 IP addr man nsenterCopy the code
The result is as follows:
The picture
This newly created Netns has nothing but a lone LO card, which is DOWN. Turn it on:
Turn on the LO card. This is very important
ip netns exec qianyi-test-1 ip link set dev lo up
ip netns exec qianyi-test-2 ip link set dev lo up
The state becomes UNKOWN, which is normal. This state is provided by the driver, but lo’s driver does not do this.
Example Create a Veth Pair
Veth-1-a type veth peer name Veth-1-b IP link add veth-1-a type veth peer name veth-1-b IP link add veth-1-a type veth peer name veth-1-b IP link add veth-2-a type veth peer name veth-2-bCopy the code
You can view the following information using the IP addr command:
8-9, 10-11 are the two Veth pairs created above. They are not assigned IP addresses and are in DOWN state.
Add Veth Pair to NetNS
# add veth-1-a device to qianyi-test-1 netns IP link set veth-1-a netns qianyi-test-1 Netns IP link set veth-1-b netns qianyi-test-2 # add IP netns exec qianyi-test-1 IP addr add 10.0.0.101/24 dev veth-1-a IP netns exec qianyi-test-1 IP link set dev veth-1-a up IP netns exec qianyi-test-2 IP addr Add 10.0.0.102/24 dev veth-1-b IP netns exec qianyi-test-2 IP link set dev veth-1-b upCopy the code
The routing table (route or IP route command) has been created by default:
IP netns exec {… }, if there are many commands, you can also change the command to bash, so that it is convenient to operate the netns in the shell.
The veth Pair now connects qianyi-test-1 and Qianyi-test-2 netNs via the VEth-1-a /veth-1-b Pair. The two NetNs can access each other via the TWO IP addresses.
The result of packet capture on 101 while ping is as follows:
Eth-1-a (10.0.0.101) queries for the MAC Address 10.0.0.102 through ARP. After receiving a response, they responded with an ICMP (Internet Control Message Protocol) request and reply, which is the Protocol used by Ping.
You can also run the ARP command to view cache information about ARP resolution:
The network connection mode is as follows:
This connection mode is similar to connecting two devices with NICS with network cables in reality, and then configuring IP addresses on the same network segment can communicate with each other. What if more than one device needs to be connected? In reality, network devices such as switches are needed. Remember the Bridge that comes with Linux? Next, the Bridge mechanism is used to establish the network.
Before performing the following experiments, we need to move the VeTH-1-A/Veth-1-B veth Pair from Qianyi-test-1 and Qianyi-test-2 back to the host netNS and restore the initial environment.
The host netns ID is 1. IP netns exec Qianyi-test-1 IP link set veth-1-a netns 1 IP netns exec Qianyi-test-2 IP link set veth-1-b netns 1Copy the code
Create a Linux Bridge and configure the network
Br0 type bridge IP link add br0 type bridge IP link set br0 up Add qianyi-test-1 and Qianyi-test-2 IP link set Veth-1-a netns Qianyi-test-1 IP link to the A end of the two Veth pairs Set veth-2-a netns qianyi-test-1 IP addr add 10.0.0.101/24 set veth-2-a netns qianyi-test-1 IP addr add 10.0.0.101/24 dev veth-1-a ip netns exec qianyi-test-1 ip link set dev veth-1-a up ip netns exec qianyi-test-2 ip addr add 10.0.0.102/24 dev veth-2-a IP netns exec qianyi-test-2 IP link set dev veth-2-a up # Veth-2-a/VEth-2-b Connect the B end of the veth Pair to the BR0 bridge and enable the interface IP link set veth-1-b master br0 IP link set dev veth-1-b up IP link set veth-2-b master br0 ip link set dev veth-2-b upCopy the code
After executing the BRCTL show command, you can view the created bridge and configured IP address. In fact, the BRCTL show command displays the result. You can clearly see that veth-1-b and veth-2-b are connected to the interface of the bridge. When one end of a Veth Pair is connected to a bridge, it degenerates from a “network card” to a “crystal head.”
The results of packet capture in current mode are the same, but the network connection mode is different:
In this mode, if there are more Network Namespaces and Veth pairs, they can be added in the same way to scale horizontally.
However, attempting to ping the host from qianyi-test-1 is of course useless because there are no network rules that can access the host network:
The screenshot above shows a bridge for Docker0. When Docker is installed on the machine, the bridge will be automatically set up for Docker to use. As you may have noticed, the bridge named Docker0 actually has an IP address. Real Bridges do not have IP, of course, but Linux Bridges are virtual devices that can be set up. When the Bridge is set to an IP address, it can be set as the Gateway of the internal network, and routing rules can be combined to achieve the simplest inter-machine virtual network communication (similar to the real layer 3 switch).
Set the default gateway address on veth-1-a and veth-2-a:
Echo "1" > /proc/sys/net/ipv4/ip_forward IP addr add local 10.0.0.1/24 dev br0 # veth-1-a And veth-2-a set the default gateway IP netns exec Qianyi-test-1 IP route add default via 10.0.0.1 IP netns exec Qianyi-test-2 IP route add The default via 10.0.0.1Copy the code
The host routing table is automatically created in the IP link set br0 UP step:
The network model further looks like this:
If another network segment bridge and several NetNs exist on another host, how can they communicate with each other? Configure a routing rule to the destination host on both hosts. If the IP address of the other host is 10.97.212.160 and the subnet is 10.0.0.0/24, then you need to add a rule 10.0.0.0/24 via 10.97.212.160 to the current host. 10.0.0.0/24 via 10.97.212.159 (or iptables SNAT/DNAT). What if there were N? It’s going to be an N by N rule, and it’s going to be complicated. This is a simple Underlay mode container communication scheme. The disadvantages are also obvious, requiring the host to have the right to modify the underlying network, and it is difficult to decouple from the underlying network. If you can build a virtual bridge across all the hosts on the physical network, connecting all the devices in the related NetNS, it can be decoupled. This is the Overlay Network solution, which is described below. As for the installation and configuration of other virtual network devices (such as Open vSwitch) in this section, it is similar and will not be described here. If you are interested, you can check the documentation and test it.
2 Overlay Network solution VXLAN
VXLAN (RFC7348) [16], the Virtual eXtensible Local Area Network (VLAN) protocol, NVO3 (Network Virtualization over Layer 3) is one of the standard technologies defined by IETF (NVGRE and STT are other representative technologies). However, VXLAN and VLAN solve different problems. In essence, VXLAN is a tunnel encapsulation technology. It encapsulates Ethernet frames at the data link layer (L2) into UDP Datagrams at the transport layer (L4) and transmits them at the network layer (L3). The effect is as if Ethernet frames at the data link layer (L2) are transmitted in a broadcast domain, that is, across a layer 3 network without being aware of the existence of layer 3. Because it is based on UDP encapsulation, a large virtual layer 2 network can be constructed as long as the IP network is routable. Because it is reencapsulated based on high-level protocols, the performance of the network is 20% to 30% lower than that of the traditional network. (The performance data may change with the development of technology and only represents the current level.)
The following describes two important concepts of VXLAN:
- VXLAN Tunnel Endpoints (VTEP) encapsulates and decapsulates VXLAN packets and hides the forwarding details of link layer frames from the upper layer
- VXLAN Network Identifier (VNI) : indicates different tenants. Virtual networks belonging to different VNI cannot directly communicate with each other at Layer 2.
The packet format of VXLAN is shown in Figure [17].
The Linux kernel v3.7.0 supports VXLAN networks. However, for stability and other features, try to choose kernel V3.10.0 or later. We use 10.97.212.159 and 11.238.151.74 to create a test VXLAN network.
Create a named netns named qianyi-test-1 and add qianyi-test-2 IP netns add qianyi-test-1 IP netns add qianyi-test-2 IP netns exec qianyi-test-1 IP link set dev lo up IP netns exec qianyi-test-2 IP link set dev lo up IP link add br0 type bridge IP link set br0 up # Create two pairs named veth-1-a, veth-1-b and veth-1-b respectively Veth-2-a/VEth-2-B IP link add veth-1-a type Veth peer name Veth-1-b IP link add veth-2-a type Veth peer Name Veth-2-b # put the A end of veth-1-A/Veth-1-b and veth-2-a/veth-2-b into qianyi-test-1 and Qianyi-test-2 IP link set Veth-1-a netns qianyi-test-1 IP link set veth-2-a netns qianyi-test-2 # configure IP for veth-1-a and veth-2-a and enable IP netns exec Qianyi-test-1 IP addr add 10.0.0.101/24 dev veth-1-a IP netns exec qianyi-test-1 IP link set dev veth-1-a up IP netns Exec qianyi-test-2 IP addr add 10.0.0.102/24 dev veth-2-a IP netns exec qianyi-test-2 IP link set dev veth-2-a up # Veth-1-a/VEth-1-b and VETH-2-A/VEth-2-B connect the B end of the veth Pair to the BR0 bridge and enable the interface IP link set veth-1-b master br0 IP link set dev Veth-1 -b up IP link set veth-2-b master br0 IP link set dev veth-2-b up # 11.238.151.74 Qianyi-test-4 IP netns add qianyi-test-3 IP netns add qianyi-test-4 # IP netns exec qianyi-test-3 IP link set dev lo up IP netns exec qianyi-test-4 IP link set dev lo up IP link add br0 type bridge IP link set br0 up # Create two pairs named veth-3-a, veth-3-b and veth-3-b respectively Veth-4-a/VEth-4-B VEth Pair IP link add veth-3-a type Veth peer name Veth-3-b IP link add veth-4-a type Veth peer Name VEth-4-b # put the A end of veth-3-A /veth-3-b and VEth-4-A/Veth-4-b into qianyi-test-3 and Qianyi-test-4 IP link set Veth-3-a netns qianyi-test-3 IP link set veth-4-a netns qianyi-test-4 # configure IP for veth-3-a and veth-4-a and enable IP netns exec Qianyi-test-3 IP addr add 10.0.0.103/24 dev veth-3-a IP netns exec qianyi-test-3 IP link set dev veth-3-a up IP netns Exec qianyi-test-4 IP addr add 10.0.0.104/24 dev veth-4-a IP netns exec qianyi-test-4 IP link set dev veth-4-a up # Veth-3-a/VEth-3-B and VETH-4-A/VEth-4-B connect the B end of the veth Pair to the BR0 bridge and enable the interface. IP link set Veth-3-b master br0 IP link set dev veth-3-b up ip link set veth-4-b master br0 ip link set dev veth-4-b upCopy the code
This long list of commands is exactly the same as the previous steps, creating a network environment like the one shown below:
In this environment, 10.0.0.101 and 10.0.0.102 are passes, 10.0.0.103 and 10.0.0.104 are passes, But obviously 10.0.0.101/10.0.0.102 and 10.0.0.103/10.0.0.104 cannot communicate.
Configure the VXLAN environment to access the four NetNS environments:
# 10.97.212.159 IP link add vxlan1 type VXLAN ID 1 remote 11.238.151.74dstport 9527 dev bond0 IP link set vxlan1 master br0 IP link set vxlan1 up # 11.238.151.74 Specified) IP link add vxlan2 type VXlan ID 1 remote 10.97.212.159 dstport 9527 dev bond0 IP link set Vxlan2 master br0 IP link set vxlan2 upCopy the code
Run the BRCTL show br0 command to view that both VXLAN devices are connected:
To ping 10.0.0.103 from 10.0.0.101:
Packets captured at 10.0.0.101 look like layer 2 communication:
Checking arp cache is the same as layer-2 communication:
Run the arp -d 10.0.0.103 command to delete the cache entry, re-capture the packet and save the file on the host, and open the file using WireShark. You also need to configure the WireShark to parse the captured UDP to the VXLAN protocol.
We got the result we expected. The network architecture at this time is shown in the figure:
So the question is, can reliable communication be achieved using UDP? Of course, reliability is not a concern for this layer, but for the protocols that are wrapped inside it. The complete communication principle is not complicated. VTEP (VXLAN Tunnel Endpoints) devices are installed on both machines to monitor UDP packets sent on port 9527. After receiving the packet, disassemble it and pass it to the specified device through the Bridge. How does the VETP virtual appliance know which machine to send an address like 10.0.0.3 to? This is a virtual layer 2 network. This address is not recognized on the underlying network. In fact, a Layer 2 Forwarding Database entry (FDB) is maintained on the Linux Bridge, which stores the MAC address of the remote VM/container, IP address of the remote VTEP, and VNI mapping. You can use the bridge FDB command to manipulate FDB tables:
# add FDB add <remote_host_mac_addr> dev <vxlan_interface> DST <remote_host_ip_addr> # delete FDB del <remote_host_mac_addr> dev <vxlan_interface> # Bridge FDB replace <remote_host_mac_addr> dev <vxlan_interface> DST <remote_host_ip_addr> # bridge FDB showCopy the code
In the simple experiment above, two machines directly specify each other’s VTEP address by command, and send the VTEP address to each other when the FDB table fails to find the information, which is the simplest interconnection mode. In large-scale VXLAN networks, you need to consider how to discover other VETP addresses on the network. There are two ways to solve this problem: One is to use multicast/multicast (IGMP) to form nodes into a virtual whole. If the packet is not clear to whom it is sent, it is broadcast to the whole group (in the above experiment, the command to create VETH device is changed to a multicast/multicast address such as 224.1.1.1, and the remote keyword is also changed to group, please refer to other materials for details). The external distributed control center collects FDB information and distributes it to all nodes in the same VXLAN network. Multicast/multicast is limited by the support of the underlying network and performance issues on a large scale, such as many cloud networks that may not be allowed to do so. Therefore, when discussing and studying the network scheme of K8s below, we will see that many network plug-ins are implemented in a similar way to the latter.
So that’s it in this video. Of course, Overlay network schemes are not only used in VXLAN, but also in many mainstream schemes. Other Overlay modes look confusing, but basically they’re L2 over L4, L2 over L3, L3 over L3, and so on, and it’s not a big deal once you understand the basics. There are also many devices and mechanisms of network virtualization [18]. It would take more than a day to elaborate, but after mastering the basic network principles, it is nothing more than the flow of various protocol packages.
Three K8s network virtualization implementation
1 network model of K8s
Each Pod has its own IP address, which means you don’t need to explicitly create links between each Pod, and you almost never have to deal with container port to host port mapping. This will create a clean, backward compatible model in which a Pod can be viewed as a virtual machine or physical host for port allocation, naming, service discovery, load balancing, application configuration, and migration.
The implementation of all network facilities in Kubernetes must meet the following basic requirements (unless some specific network segmentation policy is set up) :
- A Pod on a node can communicate with a Pod on any other node without NAT
- Agents on the node (e.g., system daemons, Kubelet) can communicate with all pods on the node
Note: Only for platforms that support Pods on host networks (e.g. Linux) :
- Pods running on the host network of nodes can communicate with pods on all nodes without NAT
This model is not only uncomplicated, but also compatible with Kubernetes’ original goal of implementing cheap migration from virtual machines to containers. If you start your work in a virtual machine, your virtual machine has an IP so that it can communicate with other virtual machines, it’s basically the same model.
The IP addresses of Kubernetes exist within the Pod scope – containers share their network namespace – including their IP and MAC addresses. This means that containers inside a Pod can reach each port via localHost. This also means that containers in pods need to coordinate port use with each other, but this does not seem to be any different from processes in virtual machines, which is also known as the “one Pod, one IP” model.
These paragraphs are quoted from K8s official document [19], which simply summarizes that a Pod has an independent IP address, and all pods can communicate with each other without NAT. This model approximates a Pod network environment to a VM network environment.
2 K8s mainstream network plug-in implementation principle
The network in K8s is realized through plug-ins, and there are two types of plug-ins:
- CNI plug-in: Complies with the CNI (Container Network Interface) specification and is designed with emphasis on interoperability
- Kubenet plugin: Basic CBR0 is implemented using bridge and host-local CNI plugins
The image is from [20]. This article only focuses on the CNI interface plug-in. Mainstream K8s network plug-ins have these [21], this paper selected github star number in thousands of several projects under the analysis:
- Flannel:github.com/flannel-io/…
- Calico:github.com/projectcali…
- Cilium:github.com/cilium/cili…
Flannel
CNI is a specification proposed by CoreOS, so take a look at CoreOS’s own Flannel project design. Flannel deploys a proxy process named Flanneld on each host. The network segment data is stored using the Kubernetes API/Etcd. The Flannel project itself is a framework. It is the back-end implementation of Flannel that provides us with container network capabilities.
Flannel currently has several backend implementations: VXLAN, Host-GW, UDP, ali Cloud and other large manufacturers support backends (cloud vendors are experimental support), as well as some experimental support for tunnel communication such as IPIP and IPSec. According to the official documents, the VXLAN mode is preferred. Host-gw is recommended for experienced users who want to improve performance (it is usually not available in cloud environments, for reasons described below). UDP is the first low performance solution supported by Flannel and has been largely abandoned.
The following is an analysis of these three modes.
1) VXLAN
Flannel uses the Linux kernel VXLAN to encapsulate data packets. Flannel creates a VETH device named Flannel.1. Because of the Flanneld process, the task of registering and updating new VETH device relationships depends on this process. There is nothing more to be said, except that every new K8s Node creates the Flanneld daemon with DeamonSet mode. Registering and updating new VETH devices is natural, and the global data is stored in Etcd. This approach has already been drawn, but the name of the device is different (VETH called Flannel.1, bridge cNI0).
2) the host – gw
As the name implies, host-GW uses the host as a Gateway to handle the flow of protocol packets. In fact, this method has been demonstrated above. As for the change of nodes and the addition and deletion of routing table, Flanneld also relies on it. The advantages and disadvantages of this scheme are obvious, and the biggest advantage is the performance of direct forwarding (overall performance is 10% lower than the host level communication, VXLAN is 20% or even 30%). The disadvantages are obvious. This approach requires layer 2 connectivity between hosts and requires control over the infrastructure (editing the routing table), which is often difficult to implement in a cloud service environment, and the routing table size increases as the scale increases. This way is relatively simple in principle, I don’t have to draw the picture.
3) the UDP
The Flanneld process on each host creates a Tun device named Flannel0 by default. The Tun device functions very simply to pass IP packets between the kernel and user applications. After the kernel sends an IP packet to the Tun device, the packet is handed to the application that created the device. If the process sends an IP packet to the Tun device, the IP packet will appear in the host’s network stack and be processed according to the routing table for the next hop. In Flannel’s container network, all containers on a host belong to a “subnet” to which the host is assigned. The range information of this subnet, and the IP address of the host to which it belongs, are stored in Etcd. Flanneld processes listen to UDP port 8285 on the host and communicate with each other by using UDP to package IP packets to the destination host. I mentioned earlier that this mode has poor performance, what is the poor performance? The solution is an Overlay network that simulates implementation at the application layer. Compared with the native VXLAN protocol supported by the kernel, data packets have one more pass in and out (flanneld process packet sealing/unpacking process) in user mode, so the performance loss is greater.
Calico
Calico is an interesting project. The basic idea is to completely treat the host as a router without using tunnels or NAT to realize forwarding, convert all layer 2 and 3 traffic into Layer 3 traffic, and complete packet forwarding through the routing configuration of the host.
What is the difference between Calico and Flannel’s host-GW model? First, Calico does not use Bridges, but uses routing rules to forward data between different Vnics. In addition, the routing table is not stored and updated by Etcd, but is directly exchanged with BGP (Border Gateway Protocol) as in the real environment. BGP is a complex protocol, and it takes a bit of effort to explain it in detail (and I don’t know much about it). In this article, it is only necessary to know that this protocol enables routing devices to send and learn routing information from each other to enrich themselves. If you are interested, please refer to other materials for further information. Back in Calico, there is still a daemon named Felix and a BGP client named BIRD on the host.
As mentioned above, Flannel’s host-GW mode requires layer 2 of the host to be interoperable (on a subnet), and Calico still has this requirement. But Calico provides IPIP mode support for environments that are not subnets. After this mode is enabled, a Tun device is created on the host to communicate through an IP tunnel. Of course, after using this mode, the package is L3 over L3 Overlay network mode, and the performance is similar to VXLAN mode.
The communication mode of the full routing table also has no additional components. After the IP routing and forwarding rules are configured, it depends on the flow of the kernel routing module. The architecture diagram of IPIP is much the same and will not be drawn.
Cilium
eBPF-based Networking, Security, and Observability
From this introduction alone, you can see that Cilium exudes a distinctive atmosphere. The current Github Star number of this project is nearly over 10,000, directly beating the first two. After Cilium is deployed, a daemon named Cilium-agent is created on the host. This daemon is responsible for maintaining and deploying eBPF scripts to implement all traffic forwarding, filtering, and diagnosis (all independent of the Netfilter mechanism, kenel > v4.19). The schematic diagram is simple (github home page) :
In addition to supporting basic network connectivity, isolation and service leakage, Cilium relies on eBPF for better observation and troubleshooting of host networks. This is also a big topic, so I’ll leave it at that. Here are two or more well-written articles and their translations for further understanding.
3 K8s container access isolation
The mechanism and implementation principle of network plug-in are introduced above, and a layer 2 / Layer 3 connected virtual network can be constructed eventually. By default, any network access between pods is unrestricted, but it is often necessary to set up some access rules (firewalls) on internal networks.
For this requirement, K8s abstracts a mechanism called NetworkPolicy to support this functionality. Network policies are implemented by network plug-ins. To use network policies, network solutions supporting NetworkPolicy must be used. Why do you say so? Not all network plug-ins support NetworkPolicy. Flannel, for example, does not. As for the basic principle of NetworkPolicy, iptables is used to configure netfilter rules to implement packet filtering. The details of how NetworkPolicy is configured and how iptables/Netfilter works are outside the scope of this article. See other resources for more information. 24
4 Services in K8s are leaking out
Within a K8s cluster, all containers/processes can communicate with each other with the help of network plug-ins. But as a service provider this is not enough, because many times the service users will not be in the same K8s cluster. A mechanism is needed to expose the services within the cluster. K8s uses the Service object to achieve this capability abstraction. Service is an important object in K8s, and even access within K8s often requires Service wrapping (Pod addresses are not always fixed, and there is always a need for load balancing).
Service = kube-proxy + iptables rule When a Service is created, K8s assigns it a Cluster IP address. This address is actually a VIP, and no real network object exists. This IP only exists in the Iptables rules. Access to this VIP:VPort uses iptables’ random mode rules to point to one or more real Pod addresses (DNAT). This is the basic principle of Service. So what does Kube-proxy do? Kube-proxy listens for Pod changes and is responsible for generating these NAT rules on the host. In this mode, Kube-proxy does not forward traffic, kube-proxy is only responsible for dredging the pipeline.
K8s official documents have well introduced various modes and basic principles supported by Kube-Proxy [26]. The earlier Userspace schema is largely deprecated, and the practice of iptables random rules described above is not recommended on a large scale (think about why). By far the most recommended IPVS mode is one that performs better than the previous two on a large scale. If IPVS is a bit strange, LVS is probably familiar. In this mode, kube-Proxy creates a virtual network card named kube-IPVs0 on the host and assigns Service VIP as its IP address. Finally, kube-proxy uses the kernel’s IPVS module to set the back-end POD address for this address (see the ipvsadm command). The implementation of IPVS in the kernel also uses the NAT mechanism of Netfilter. By contrast, IPVS does not set NAT rules for each address. Instead, it puts the processing of these rules into kernel mode, ensuring that the number of iptables rules is almost constant.
This only addresses load balancing, not service exposure. The K8s Service can be externalized through NodePort, LoadBalancer Service (which calls CloudProvider to create a load balancing Service for you on the public cloud) and ExternalName (added in kube-dns) CNAME). For the second type, when there are many services but the traffic is small, the Service of Ingress can also be used to converge [27]. Ingress currently supports only seven layers of HTTP(S) forwarding (Service currently supports only four layers of HTTP forwarding). Guess what Ingress does? Here’s a picture [28] (there are many other controllers [29], of course) :
For this part, this article will not elaborate in detail, nothing more than NAT, MORE NAT to try to convergence NAT entries. As usual, here is a particularly good kube-Proxy principle exposition article for further understanding [30].
Four summarizes
Network virtualization is too big a topic to cover completely in a single article. Although this article tries to weave important knowledge nodes clearly, due to the author’s own energy and cognitive limitations, there may be negligence or even mistakes. If you have any questions, please feel free to comment in the comments section. There are a lot of good sources in the bibliography that are worth studying further (some addresses may need to be accessed in a specific way because of network constraints).
reference
TCP Implementation in Linux: A Brief Tutorial, Helali Bhuiyan, Mark McGinley, Tao Li, Malathi Veeraraghavan, University of Virginia: www.semanticscholar.org/paper/TCP-I… 2, for a, linuxfoundation wiki.linuxfoundation.org/networking/… 3, Monitoring and Tuning the Linux Networking Stack: Identifiers Data, Joe Damato, the Linux network stack monitoring and tuning: Receiving Data (2016) : arthurchiao. Art/blog/tuning… 4, Monitoring and Tuning the Linux Networking Stack: Sending Data, Joe Damato, the Linux network stack monitoring and tuning: Sending Data (2017) : arthurchiao. Art/blog/tuning… Scaling in the Linux Networking Stack, github.com/torvalds/li… 6, Understanding TCP internals step by step for Software Engineers and System Designers, Kousik Nath www.netfilter.org/ 8, Netfilter – packet – flow, upload.wikimedia.org/wikipedia/c… 9, Analysis TCP in Linux, github.com/fzyz999/Ana… 10, Nat-network Address Translation, 英 文 : Nat-Network Address Translation (2016) : ArthurChiao.art /blog/nat-zh… 11, Virtual networking in Linux, By m. Jones, IBM Developer:developer.ibm.com/tutorials/l… Open vSwitch, www.openvswitch.org/ 13. Linux Namespace, man7.org/linux/man-p… 14, IP, man7.org/linux/man-p… 15, Veth man7.org/linux/man-p… 16 and VxLAN en.wikipedia.org/wiki/Virtua… 17, QinQ vs VLAN vs VXLAN, John, community.fs.com/blog/qinq-v… In the 18th and the Introduction to the Linux interfaces for virtual networking, Hangbin Liu:developers.redhat.com/blog/2018/1… 19, Cluster Networking, English address kubernetes. IO/useful/docs/con… 20, THE CONTAINER NETWORKING LANDSCAPE: CNI FROM COREOS AND CNM FROM DOCKER, Lee Calcote: Thenewstack.io/Container-N… 21, CnI-the Container Network Interface, github.com/containerne… 25, Making the Kubernetes Service Abstraction Scale using eBPF, [翻] K8s Service Abstraction Scale using eBPF (LPC, 2019) : Linuxplumbersconf.org/event/4/con… 23, based on BPF/XDP K8s Service load balancing (LPC, 2020) linuxplumbersconf.org/event/7/con…
24, A Deep Dive into Iptables and Netfilter Architecture, Justin Ellingwood:www.digitalocean.com/community/t…
25, Iptables Tutorial 1.2.2, Oskar Andreasson:www.frozentux.net/iptables-tu…
26, Virtual IPs and service proxies, English address: kubernetes. IO/docs/concep…
Address: 27, Ingress, English kubernetes. IO/docs/concep…
28, NGINX Ingress Controller, www.nginx.com/products/ng…
29, Ingress Controllers, English address: kubernetes. IO/docs/concep…
30. Cracking Kubernetes Node Proxy (AKA Kube-proxy) Cloudnative. To/blog/k8s – no…
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.