The original link: fuckcloudnative. IO/posts/ipvs -…
Service in Kubernetes is a set of Service abstractions of the same type as Pod of label, providing load balancing and reverse proxy capabilities for services, representing the concept of a microservice in a cluster. The kuBE-proxy component is the specific implementation of Service. Only by understanding the working principle of Kube-Proxy can we understand the communication process between services, and then we will not be confused when we encounter network failure.
Kube-proxy has three modes: Userspace, iptables, and IPVS, of which the userspace mode is less common. The main problem with the Iptables mode is that too many Iptables rules are generated when there are too many services. Non-incremental updates will introduce certain latency, resulting in significant performance problems in large-scale cases. To address the performance issues of the Iptables mode, V1.11 added the IPVS mode (available in beta at V1.8 and available at V1.11 GA), with incremental updates and the ability to keep connections open during service updates.
At present, the network documents about kube-Proxy working principle almost take the iptables mode as an example, and rarely mention IPVS. In this paper, we will interpret the kube-Proxy IPVS mode working principle as an exception. For a more thorough understanding, this article will not use Docker and Kubernetes, but a more low-level tool to demonstrate.
As we all know, Kubernetes creates a separate Network Namespace for each Pod. This article will simulate the Pod in Kubernetes by manually creating the Network Namespace and starting the HTTP service.
The objective of this article is to explore how kube-Proxy IPVS and ipset work by simulating the following services:
apiVersion: v1
kind: Service
metadata:
name: app-service
spec:
clusterIP: 10.100100.100.
selector:
component: app
ports:
- protocol: TCP
port: 8080
targetPort: 8080
Copy the code
Following my steps, you can use curl 10.100.100.100:8080 to access the HTTP service of a network namespace. To better understand the content of this article, it is recommended to read the following articles in advance:
- How do Kubernetes and Docker create IP Addresses? !
- iptables: How Docker Publishes Ports
- iptables: How Kubernetes Services Direct Traffic to Pods
Note: All the steps in this article were tested in Ubuntu 20.04, please test them yourself for other Linux distributions.
Prepare the experiment environment
First, you need to enable the routing and forwarding function of Linux:
$ sysctl --write net.ipv4.ip_forward=1
Copy the code
The following commands do a few things:
- Create a virtual bridge
bridge_home
- Create two network namespaces
netns_dustin
和netns_leah
- Configure DNS for each network namespace
- Create two Veth pairs and connect to
bridge_home
- to
netns_dustin
Veth devices in the network namespace are assigned an IP address10.0.0.11
- to
netns_leah
Veth devices in the network namespace are assigned an IP address10.0.021
- Set the default route for each network namespace
- Add iptables rules to allow traffic in and out
bridge_home
interface - Add iptables rules for
10.0.0.0/24
Network segment traffic camouflage
$ ip link add dev bridge_home typeBridge $IP address add 10.0.0.1/24 dev bridge_home $IP netns add netns_dustin $mkdir -p /etc/netns-netns_dustinecho "nameserver 114.114.114.114" | tee -a /etc/netns/netns_dustin/resolv.conf
$ ip netns exec netns_dustin ip link set dev lo up
$ ip link add dev veth_dustin type veth peer name veth_ns_dustin
$ ip link set dev veth_dustin master bridge_home
$ ip link set dev veth_dustin up
$ ip link set dev veth_ns_dustin netns netns_dustin
$ ip netns exec netns_dustin ip link set dev veth_ns_dustin up
$ ip netns execNetns_dustin IP address add 10.0.0.11/24 dev veth_ns_dustin $IP netns add netns_leah $mkdir -p /etc/netns-netns_leahecho "nameserver 114.114.114.114" | tee -a /etc/netns/netns_leah/resolv.conf
$ ip netns exec netns_leah ip link set dev lo up
$ ip link add dev veth_leah type veth peer name veth_ns_leah
$ ip link set dev veth_leah master bridge_home
$ ip link set dev veth_leah up
$ ip link set dev veth_ns_leah netns netns_leah
$ ip netns exec netns_leah ip link set dev veth_ns_leah up
$ ip netns execNetns_leah IP address add 10.0.0.21/24 dev veth_ns_leah $IP linkset bridge_home up
$ ip netns execNetns_dustin IP route add default via 10.0.0.1 $IP netnsexecNetns_leah IP route add default via 10.0.0.1 $IPtables --table filter --append FORWARD --in-interface bridge_HOME --jump ACCEPT $ iptables --table filter --append FORWARD --out-interface bridge_home --jump ACCEPT $ iptables --table nat --append POSTROUTING --source 10.0.0.0/24 --jump MASQUERADE
Copy the code
Start HTTP service in network namespace netns_dustin:
$ ip netns exec netns_dustin python3 -m http.server 8080
Copy the code
Open another terminal window and start the HTTP service in the network namespace netns_leah:
$ ip netns exec netns_leah python3 -m http.server 8080
Copy the code
Test whether network namespaces can communicate with each other:
$curl 10.0.0.11:8080 $curl 10.0.0.21:8080 $IP netnsexecNetns_dustin curl 10.0.0.21:8080 $IP netnsexecNetns_leah curl 10.0.0.11:8080Copy the code
The network topology of the whole experimental environment is shown as follows:
Install necessary Tools
To facilitate IPVS and ipset debugging, you need to install two CLI tools:
$ apt install ipset ipvsadm --yes
Copy the code
The ipset and IPVSADm versions used in this article are 7.5-1 to exp1 and 1:1.31-1, respectively.
Services are simulated through IPVS
IPVS creates a Virtual Service to emulate Kubernetes’ Service:
$ ipvsadm \
--add-service \
--tcp-service 10.100.100.100:8080 \
--scheduler rr
Copy the code
- Here we use parameters
--tcp-service
To specify TCP, because the Service we need to emulate is TCP. - One of the advantages of IPVS over Iptables is the ease with which scheduling algorithms can be selected, in this case using the polling algorithm.
Currently, Kube-proxy only allows you to specify the same scheduling algorithm for all services. In the future, different scheduling algorithms will be supported for each Service. For details, see ipvs-based in-cluster Load Balancing Deep Dive.
Once the virtual service is created, it must also be assigned a Real Server on the back end, namely the HTTP service in the network namespace netns_dustin:
$ipvsadm \ --add-server \ --tcp-service 10.100.100.100:8080 \ --real-server 10.0.0.11:8080 \ --masqueradingCopy the code
This command forwards the TCP request to 10.100.100.100:8080 to 10.0.0.11:8080. The –masquerading parameter here is similar to MASQUERADE in Iptables, in that if not specified IPVS will attempt to forward traffic using routing tables and this will definitely not work.
IPVS does not implement the POST_ROUTING Hook, so it requires iptables for IP masquerading.
Test whether it works properly:
$curl 10.100.100.100:8080Copy the code
The experiment was successful and the request was successfully forwarded to the HTTP service on the back end!
Access virtual services in the network namespace
Netns_leah = netns_leah; netns_leah = netns_leah;
$ ip netns execNetns_leah curl 10.100.100.100:8080Copy the code
Oh dear, access failed!
To successfully pass the test, assign IP 10.100.100.100 to a virtual network interface. Why this is done is unclear, but I suspect it is because bridge_HOME does not invoke IPVS and assigning the IP address of the virtual service to a network interface can get around this problem.
.
Netfilter is a Linux kernel framework that implements various network operations based on user-defined hooks. Netfilter supports various network operations, such as packet filtering, network address translation, and port translation, to forward packets or prohibit packets from being forwarded to sensitive networks.
For Linux kernel 2.6 and above, the Netfilter framework implements five system call interfaces that intercept and process data. It allows the kernel module to register the callback function of the kernel network protocol stack. The specific rules for these function calls are usually defined by the Netfilter plug-in. Common plug-ins include iptables, IPVS, etc. Different plug-ins may implement different Hook points. In addition, different plug-ins need to set different priorities when registering with the kernel. For example, by default, if both iptables and IPVS rules exist at a Hook point, iptables is processed first.
Netfilter provides five Hook points. When the system kernel protocol stack processes data packets, each Hook point will call the processing function defined in the kernel module. Which handler is called depends on the direction of packet forwarding. The Hook points triggered by inbound and outbound traffic are different.
There are five predefined callback functions in the kernel protocol stack:
- NF_IP_PRE_ROUTING: Triggers the callback function immediately after the received packet enters the protocol stack. This action occurs before determining the route of the packet (where to send the packet).
- NF_IP_LOCAL_IN: Indicates that the callback function is triggered if the destination IP address of the received packet is on the local machine after the route judgment.
- NF_IP_FORWARD: this callback function is triggered if the destination IP address of the received packet is on another machine after the route judgment.
- NF_IP_LOCAL_OUT: a packet that is generated locally and ready to be sent. This callback function is triggered immediately after it enters the protocol stack.
- NF_IP_POST_ROUTING: This callback function is triggered when the packet generated by the local machine is sent or forwarded by the local machine after being judged by the route.
Iptables implements all Hook points, while IPVS only implements themLOCAL_IN
,LOCAL_OUT
,FORWARD
These three Hook points. Since it’s not implementedPRE_ROUTING
, address translation will not be performed before entering LOCAL_IN. Then, after routing judgment, the packet will enter LOCAL_IN Hook point. If the IPVS callback function finds that the target IP address does not belong to this node, the packet will be discarded.
If the destination IP address is assigned to the virtual network interface, the kernel will discover that the destination IP address belongs to the node when processing the packet and can continue processing the packet.
Dummy interface
Of course, we don’t need to assign IP addresses to any of the network interfaces that are already in use, our goal is to emulate the behavior of Kubernetes. Kubernetes creates a dummy interface here, which is similar to the loopback interface, but you can create as many dummy interfaces as you want. It provides the ability to route packets without actually forwarding them. Dummy interface has two main uses:
- Used for program communication within the host
- Since the dummy interface is always up (unless the administrative state is explicitly set to Down), it is possible to set the service address to the address of the loopback or dummy interface on a network with multiple physical interfaces, The service address is not affected by the state of the physical interface.
Dummy interface = dummy interface = dummy interface = dummy interface
$ ip link add dev dustin-ipvs0 type dummy
Copy the code
Dummy interface Dustin-ipvs0 assigns virtual IP addresses to dummy interfaces:
$IP addr add 10.100.100.100/32 dev dustin-ipvs0Copy the code
At this point, you still can’t access the HTTP service, so you need another hack: bridge-NF-call-iptables. Before explaining Bridge-NF-Call-iptables, let’s review the basics of container network communication.
Container network based on bridge
There are many implementations of Kubernetes cluster networks, many of which use Linux Bridges:
- Each Pod’s network card is a VeTH device, and the other end of the VEth pair is connected to a bridge on the host.
- Since the bridge is a virtual Layer 2 device, the communication between the Pod of the node goes directly through layer 2 forwarding, and the cross-node communication goes through the host eth0.
The Service communicates with the node
Kubernetes uses DNAT to access services in either iptables or IPVS forwarding mode. The packets originally accessing ClusterIP:Port are DNAT to an Endpoint of the Service (PodIP:Port), and the kernel inserts the connection information into the Conntrack table to record the connection. When the destination packet is returned, the kernel matches the connection from the Conntrack table and reverts NAT, so that the original route is returned to form a complete connection link:
However, Linux bridge is a virtual layer 2 forwarding device, and Iptables Conntrack is on layer 3, so if you access the address in the same bridge directly, you will go through Layer 2 forwarding without conntrack:
-
When a Pod accesses the Service, the destination IP address is the Cluster IP address, not the address in the bridge. After layer 3 forwarding, the Pod is converted into a PodIP:Port by DNAT.
-
If the packet is forwarded to the Pod on the same node after DNAT, the destination Pod finds that the destination IP address is on the same bridge when it returns the packet, it directly forwards the packet through layer 2 without calling Conntrack, resulting in no return on the original route (see the following figure).
Because there is no return of the original route, the client and the server are not on the same “channel”. They are not considered to be in the same connection and cannot communicate normally.
Open the bridge – nf – call – iptables
If the bridge-NF-call-iptables kernel parameter is enabled (set to 1), the Bridge device also invokes layer 3 rules configured by iptables (including Conntrack) during Layer 2 forwarding. Therefore, this parameter can be turned on to solve the communication problem between Service and node.
Bridge-nf-call-iptables
$ modprobe br_netfilter
$ sysctl --write net.bridge.bridge-nf-call-iptables=1
Copy the code
Now let’s test connectivity again:
$ ip netns execNetns_leah curl 10.100.100.100:8080Copy the code
Success at last!
Turn on Hairpin mode
Although we can successfully access the HTTP service in netns_dustin from the network namespace netns_leah via a virtual service, We have not yet tested accessing ourselves directly from a virtual service in the network namespace where the HTTP service is located, netns_dustin.
$ ip netns execNetns_dustin curl 10.100.100.100:8080Copy the code
Aha? What’s wrong with failing? Don’t panic. Just turn on the hairpin mode. So what is the hairpin pattern? This is a concept often mentioned in network virtualization technology, namely VEPA mode for switch ports. This technology uses physical switches to solve the traffic forwarding problem between virtual machines. Obviously, in this case, the source and the target are in the same direction, so it’s a pattern of where in and where out.
How do you configure it? It’s as simple as a single command:
$ brctl hairpin bridge_home veth_dustin on
Copy the code
Test again:
$ ip netns execNetns_dustin curl 10.100.100.100:8080Copy the code
Still failed…
It then took me an afternoon to figure out why confounding was not working, because confounding had to be enabled together with the following options for IPVS to work:
$ sysctl --write net.ipv4.vs.conntrack=1
Copy the code
One last test:
$ ip netns execNetns_dustin curl 10.100.100.100:8080Copy the code
Finally succeeded this time, but I still don’t quite understand why conntrack can solve this problem. If you know, please leave a message and tell me!
Note: IPVS and its load balancing algorithm are only for the first packet, subsequent packets must be reversed by Conntrack first, and without Conntrack IPVS will have no effect on incoming packets. You can view this information by running conntrack-l.
Turn on promiscuous mode
If you want all network namespaces to be accessible through virtual services, you need to turn on hairpin mode on all veth interfaces connected to the bridge, which is too much trouble. One way to avoid configuring every VETH interface is to enable promiscuous mode for the bridge.
What is promiscuous mode? In common mode, the network adapter only receives packets (including broadcast packets) sent to the local computer and passes them to the upper-layer program. Other packets are discarded. In promiscuous mode, all packets that pass through the network adapter, including packets that are not destined for the local host, are received. That is, MAC addresses are not verified.
If promiscuous mode is enabled on a bridge, it is equivalent to having hairpin mode enabled on all ports connected to the bridge (in this case, the VETH interface). Bridge_home promiscuous mode can be enabled with the following command:
$ ip link set bridge_home promisc on
Copy the code
Now even if you turn the hairpin mode of the Veth interface off:
$ brctl hairpin bridge_home veth_dustin off
Copy the code
You can still pass the connectivity test:
$ ip netns execNetns_dustin curl 10.100.100.100:8080Copy the code
Optimize the MASQUERADE
In the section at the beginning of the article preparing the experimental environment, this command is executed:
$ iptables \
--table nat \
--append POSTROUTING \
--source 10.0.0.0/24 \
--jump MASQUERADE
Copy the code
This iptables rule camouflages all traffic from 10.0.0.0/24. However, Kubernetes does not do this. It only disguises traffic from specific IP addresses to improve performance.
In order to simulate Kubernetes more perfectly, we continue to modify the rule, first delete the previous rule:
$ iptables \
--table nat \
--delete POSTROUTING \
--source 10.0.0.0/24 \
--jump MASQUERADE
Copy the code
Then add rules for specific IP:
$ iptables \
--table nat \
--append POSTROUTING \
--source 10.0.0.11/32 \
--jump MASQUERADE
Copy the code
Sure enough, he passed all the tests above. Now there are only two network namespaces. What if there are many, each of which creates an iptables rule? What am I using IPVS for? Just in case there are too many iptables rules that drag down performance, it’s back to that.
Don’t panic, keep learning from Kubernetes and use ipset to solve this problem. Delete the previous iptables rule:
$ iptables \
--table nat \
--delete POSTROUTING \
--source 10.0.0.11/32 \
--jump MASQUERADE
Copy the code
Then create a set using ipset:
$ ipset create DUSTIN-LOOP-BACK hash:ip,port,ip
Copy the code
This command creates a collection called dustin-loop-back, which is a Hashmap that stores destination IP, destination port, and source IP.
Then add an entry to the collection:
$ipset add DUSTIN - LOOP - BACK 10.0.0.11, TCP: 8080,10.0. 0.11Copy the code
Now no matter how many network namespaces there are, you only need to add one iptables rule:
$ iptables \
--table nat \
--append POSTROUTING \
--match set \
--match-set DUSTIN-LOOP-BACK dst,dst,src \
--jump MASQUERADE
Copy the code
There was no problem with the network connectivity test:
$curl 10.100.100.100:8080 $IP netnsexecNetns_leah curl 10.100.100.100:8080 $IP netnsexecNetns_dustin curl 10.100.100.100:8080Copy the code
The back-end of a virtual service is added
Finally, we add the HTTP service in the network namespace netns_leah to the back end of the virtual service as well:
$ipvsadm \ --add-server \ --tcp-service 10.100.100.100:8080 \ --real-server 10.0.0.21:8080 \ --masqueradingCopy the code
Add another entry to the set of ipset dustin-loop-back:
$ipset add DUSTIN - LOOP - BACK 10.0.0.21, TCP: 8080,10.0. 0.21Copy the code
Here comes the final test. Try running the following test several times:
$curl 10.100.100.100:8080Copy the code
You’ll see that the polling algorithm works:
conclusion
We believe that through the experiment and explanation of this article, you should understand the work principle of Kube-Proxy IPVS mode. In our experiments, we also used ipset, which helps solve kube-Proxy performance problems in large clusters. If you have any questions about this article, please feel free to let me know.
Refer to the article
- Why does kubernetes require bridge-NF-call-iptables to be enabled?