0. Guide language

The K8s container network involves many kernel subsystems, such as IPVS, Iptable, layer 3 routing, layer 2 forwarding, and TCP/IP protocol stack. These complex kernel subsystems may encounter problems in specific scenarios that the designers did not think of at first.

This article shares the records of the investigation of packet loss caused by the abnormal status of iptable firewall. The investigation process was very tortuous, and finally used the tool systemTap, which is considered very outdated by the author now. Based on the author’s current experience, the cause of this problem can now be found with only one command, which is skbtracker mentioned in the author’s previous shared article using EBPF to analyze the container network DUP package problem. After seven months, the tool is powerful enough to solve 90 percent of everyday network problems.

This article was actually published in Tencent in July 2019. After a year, I still gain a lot from reading it again, so I share it with other peers. In addition, this article will also serve as the opening, and share the author’s recent process and experience of using EBPF tools to troubleshoot various kernel problems.

1. Problem description

An internal service of Tencent encountered a strange network problem in the container scenario. When GIT is used in the container, the SVN fails to pull the code from the internal code repository occasionally. However, when GIT of the same version is used in the Node where the container resides, the SVN has no problem. The reason for using the word “weird” is that the analysis of this problem lasted for a long time and went through the hands of many students, but the root of the problem was not found in the end. Challenging problem solving is quite attractive to me, so I started the interesting debug when THERE was no urgent task at hand.

From the customer’s description, the recurrence probability of the problem is very high. When you pull GIT repository from Pod for 10 times, there is bound to be a jam once. For the inevitable problem, it is generally not a problem. GIT Server stopped sending data to the Client when it was stuck, and there was no retransmission.

1.1 Network Topology

The service uses the TKE container network with one NIC and multiple IP addresses. Node uses eth0, the primary NIC, to bind one IP address, and eth1, the other elastic NIC, to bind multiple IP addresses to containers through routes, as shown in Figure 1-1.

Figure 1-1 TKE container network with one NIC and multiple IP addresses

1.2 Method of recurrence

1.3 Analyzing Packet Capture Files

On eth1, VEth_A1, and veth_B1, the Server continues to send large packets to the Client. When the traffic congestion occurs, the Server stops sending packets to the Client, and no packets are retransmitted.

2. Troubleshooting process

Analyze environment differences: Node and Pod environment differences

  • Node has more memory than Pod. However, Node and Pod have the same TCP receive cache size. This difference may cause memory allocation failure.
  • The data packets in and out of Pod pass through two more network devices: VEth_A1 and VEth_B1, which may be caused by a conflict between some device features of VETH and TCP protocol, a bug in the VETH virtual device, or a rate limiting rule configured on the VEth device.

Analyze the captured packet file: it has the following characteristics

  • Data packets in both directions are buffered on devices eth1 and VEth_A1: they are sent to the next hop after arriving at the device for some time
  • In the stuck case, there is no retransmission on the Server or Client, and the number of packets caught at eth1 is always much higher than that at veth_A1. The number of packets caught at veth_A1 is always the same as that at veth_B1

Analysis: TCP is a reliable transmission protocol. If packets are lost due to unknown reasons (such as rate limiting), packets are retransmitted. Therefore, it must be that the sending end received some control signal and stopped sending packets actively.

Guess 1: The WSCAL negotiation is inconsistent, and the WSCAL received by the Server is small

During TCP handshake negotiation, the wSCAL value received by the Server is smaller than the actual value. Procedure During transmission, the Client did not reply ack messages in a timely manner for a period of time because the buffer received by the Client was still sufficient. However, the Server actually thought that the Client window was full (the Server inferred that the buffer received by the Client was full through the relatively small WSCal). The Server stops sending packets.

If the Server uses IPVS as the access layer, wSCAL negotiation is inconsistent when SynProxy is enabled.

With this conjecture, the following verification was carried out:

  • You can change the size of the TCP received buffer (ipv4.tcp_rmem) to control the Client WSCal value
  • Modify THE Pod memory configuration to ensure that there is no difference between Node and Pod in memory configuration
  • Packets are captured on the IPVS node of the Server and the wSCAL negotiation result is confirmed

The above conjecture has been verified and rejected. In addition, the business side confirmed that although IPVS module was used, synProxy function was not enabled, and the conjecture of inconsistent WSCAL negotiation was not established.

Guess 2: The device buffers packets

The TSO and GSO features are enabled on the device, which greatly improves the packet processing efficiency. In the container scenario, two layers of devices pass through, and the feature is enabled on each layer of devices. As a result, the packets are out of order or cannot be sent in time. As a result, the TCP laminar flow control algorithm is incorrect, and packets stop being sent.

With this conjecture, the following verification was carried out:

1) Disable advanced functions on all devices (TSO, GSO, GRO, TX-nocache-copy, SG)

2) Disable the delay ack function inside the container (net.ipv4. tcp_no_delay_ACK) to enable the Client to actively respond to the packets sent by the Server

The above conjecture also failed to verify.

The ultimate solution: Use the Systamp script to catch the culprit

It proves that conventional wisdom doesn’t work. But the only certainty is that the problem must be inside the CVM. Note that eth1 always caught several more packets than veth_A1. It was supposed to be buffered before, but the buffer must be sent out. However, the state of packet capture continues and the redundant packets are not caught, so this subcontract must be lost. This is very easy to do, as long as the monitoring of this subcontract drop point, the problem is clear. Using SystemTap to monitor the SKB release point and print backtrace, you can quickly find the kernel function that caused the packet loss. Figure 2-2 shows the Systemtap script.

Figure 2-1 Dropwatch script (printed without Backtrce)

Figure 2-2 Dropwatch script (printed with Backtrce)

Run stap –all-modules dropwatch. STP -g and stap dropwatch. STP -g alternately. In combination with the specific address of the function in /proc/kallsyms), the packet loss address is taken as the judgment condition, and the backtrace of the packet loss point is printed accurately (Figure 2-2).

Run the stap –all-modules dropwatch. STP -g script to reproduce the problem, as shown in Figure 2-3:

Figure 2-3 Packet loss function

There is no NF_HOOk_slow when there is a delay. When there is a delay, nF_HOOk_slow appears on the screen, basically confirming that the packet loss point is in this function. But there are many paths to this function, and you need to print backtrace to determine the calling relationship. Run the script again: Stap dropwatch. STP -g, confirm the packet drop address list, compare the /proc/kallsyms symbol table ffffffFF8155C8b0 T nf_hook_slow, Find the value 0xFFffFFFF8155C8B0 that is closest to the address 0xFFffFFFF8155C9a3. Add the backtrace of the packet loss point, the problem is repeated again, and the print in Figure 2-4 appears on the screen.

Figure 2-4 Backtrace on the packet loss point

Figure 2-5 Connection table status

You can see that the ip_forward call to nF_HOOk_slow results in packet loss. Obviously the packet was lost by the FORWARD chain rule on the IPtable. Check the rules on the FORWARD chain. The packet loss logic is -j REJECT — REJECT -with icmp-port-unreachable. When packets are discarded, icMP-port-unreachable control packets are sent. This is basically the cause. Icmp feedback messages are generated on the Node. Therefore, packets cannot be filtered out based on the IP addresses of the Client and Server (the source IP address is Node eth0, and the destination IP address is Server). When the SYSTamp script and tcpdump are run at the same time, icMP-port-unreachable packets are captured.

Next, analyze why a healthy connection transmits one piece of data and subsequent data is lost by the rule. You can run the following command to check the IPtalbe rule: -m conntrack –ctstate RELATED,ESTABLISHED -j ACCEPT Only packets that are in the ESTABLISHED state are allowed. If the connection status changes during data transmission, subsequent incoming packets are discarded and the port is unreachable. The conntrack tool is used to monitor the connection table status. When problems occur, the connection status changes to FIN_WAIT and then CLOSE_WAIT (Figure 2-5). Through packet capture confirmation, GIT will open two TCP connections when downloading data. After a certain period of time, the Server will initiate the FIN packet, while the Client will not immediately send the FIN packet because there is still data to be transmitted. After that, the connection status will change as follows quickly:

ESTABLISHED(Server fin)->FIN_WAIT(Client ack)->CLOSE_WAIT

GIT protocol has a control channel and a data channel, and the data channel depends on the control channel. The control channel state switch conflicts with the firewall rules, so the control channel is abnormal, and the data channel is also abnormal. We will study GIT data transmission protocols when we have time. This indicates that the stateful firewall rules of Iptables do not handle semi-closed connections properly, as soon as one party (the Server in this scenario) actively closes the connection, The subsequent connection status cannot pass the rule (-m conntrack — CTstate RELATED,ESTABLISHED -j ACCEPT).

After understanding the principle, the corresponding solution is easier to come up with. Because the customer is a multi-tenant container scenario, only some service addresses that Pod actively accesses are released, and these service addresses cannot actively connect to Pod. Knowing that Both GIT and SVN are internal services with controllable security, users can pass the firewall rules even if the status is switched because the inbound permission rules are met.

3. Think about

In the process of troubleshooting, it was found that the network environment of the container was worth optimizing and improving, for example:

  • The size of the BUFFER TCP receives and sends is generally a reasonable value calculated from the actual physical memory at kernel startup, but in container scenarios, inheriting the default value of Node is obviously very unreasonable. Other system resource quotas have similar problems.
  • The TSO and GSO features of network adapters are originally designed to optimize the performance of terminal protocol stack processing. However, in container network scenarios, does Node belong to gateway or terminal? There are no other side effects of this optimization for gateways (packets are buffered by multiple devices and then sent centrally). From the perspective of Pod, Node belongs to the identity of gateway in a certain sense. How to play the dual role of terminal and gateway is a more interesting problem
  • Iptables problems

The firewall status in this scenario is incorrect

Slow synchronization of IPtables rules after too many rules

Service load balancing problems (slow rule loading, single scheduling algorithm, no health check, no sticky session, low CPS)

SNAT source port conflict or SNAT source port exhaustion

  • Questions about IPVS

The statistics timer causes the CPU soft interrupt packet receiving delay when the number of configurations is too large

Net.ipv4. vs.conn_reuse_mode causes a number of problems (see: github.com/kubernetes/…)

Up to now, most of the above problems have been solved in TKE platform, part of patch repair has been mentioned to the kernel community, and corresponding solutions have also been shared with the K8s community.

[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan code to pay attention to the public account of the same name, timely access to more dry goods!!