Author introduction: Feng Zhiming, responsible for search algorithm related work from 2019 till now, good at dealing with complex business system, with strong interest in underlying technology.

The paper

In the last article we discussed the structure of Linux network I/O.

Why is there so much structure, or why is network I/O so complex?

Including some nouns, such as MSS, IFG is not clear, this article will clarify these.

Linux network I/O is entirely based on the network Layer 7 protocol, which covers the kernel space from the physical layer to the transport layer. So to understand the structure of Linux network I/O, you need to understand the network layer 7 protocol.

This article focuses on the limitations of the network 7-layer protocol on network I/O and the abstractions each layer provides to the previous layer, rather than on the protocol itself.

The OSI architecture has seven layers, each of which performs a relatively independent part of the information exchange task and has specific functions.

1 the physical

The physical representation of data is “signals” in different “mediums”.

Common media include: wire, optical fiber, air, vacuum, etc. Corresponding signals are: electrical signal, optical signal, electromagnetic signal. What the physical layer does is it abstracts signals from different media into zeros and ones, which serve as the basis for information at the link layer.

2 the link layer

2.1 introduction

Signals are easily interfered with, and the transmitting equipment may not be 100% accurate. So how can guarantee the complete transmission of information?

The answer is: the 0/1 signal is abstracted into data frames through frame spacing, and frame verification ensures that a frame’s data is complete.

Using Ethernet as an example:

Leading code (7 bytes), used for clock synchronization.

Frame start character (1 Byte), indicating the start of a frame.

Data frame is the protocol Data unit of Data link layer.

Frame spacing (IFG,Interframe Gap), 12 bytes, the time break between two adjacent frames, used for clock data recovery.

Data frame consists of three parts: frame header, data part and frame tail.

  • Frame headers (18 bytes) include :MAC destination address (6 bytes), MAC source address (6 bytes), 802.1Q label (4 bytes) Ethernet type (2 bytes)
  • Data section (46 — 1500 bytes)
  • Frame tail, used to validate the data portion (4 bytes)

2.2 the abstract

Through the abstraction of link layer, data is represented as data frame in link layer. The link layer provides the network layer with the data part of the data frame, called a datagram.

2.3 / PMTU MTU

MTU :(Maximum Transmit Unit) indicates the size of packets that can be carried on the link layer.

The network layer must comply with the MTU limit. Packets exceeding the MTU limit cannot be sent and are discarded.

The MTU table of common media is as follows. The default MTU of Ethernet is 1500 bytes.

PMTU :(Path mtu) : specifies the minimum mtu of the communication Path between two hosts.

The PMTU between two hosts is not necessarily constant, it depends on the path chosen at the time, and the route selection is not necessarily symmetric (the route from A to B may be different from the route from B to A). Therefore, the PMTU is not necessarily consistent in both directions.

2.4 Test your PMTU

When the size of an ICMP packet is larger than 1472, the transmission fails and the message MTU=1500ICMP header is 8 bytes, while IP header is 20 bytes. Therefore, the MTU=1472+8+20=1500.

3 Network Layer (IPv4)

3.1 introduction

Internet Protocol (English: Internet Protocol, abbreviated: IP; Also known as Internet protocol, is a protocol used in packet-switched data networks to transmit data based solely on the addresses of the source and destination hosts. The Internet Protocol provides an “unreliable” packet transport mechanism (also known as “best effort” or “best effort delivery”), where IP packets can be sent over a different route each time.

3.2 IPv4 Headers

IPv4 datagrams with variable header sizes. A typical IPv4 header contains 20 bytes.

3.3 shard

IP protocols can support different link layers and adapt to different Mtus. Therefore, IP packets are designed to be sharable.

When receiving an IP packet, the device analyzes the destination address and determines which link to send the packet. The MTU determines the maximum length of the data payload. If the IP packet length is larger than the MTU, IP packets must be fragmented. The length of each piece must be less than or equal to the MTU minus the length of the IP header.

Sharding can be multiple.

For example, 4,300 bytes of data transmitted on a link with an MTU of 2,500 bytes will be fragmented as follows.

Assuming that the MTU of the next hop is 1,500 bytes, each shard is divided into two pieces again.

3.4 the restructuring

When a receiver finds that the DF flag of an IP packet is 0 and the fragment offset is not 0, the receiver knows that the packet is fragmented. The receiver needs to store the fragmented data. When it collects all the fragments of a packet, it can assemble them in the correct order and hand them to the upper-layer protocol stack.

3.5 PROBLEMS Caused by IP Fragmentation

  1. Sharding and regrouping consume CPU resources on both sides and consume a lot of memory resources on the receiving side.
  2. If a fragment is lost during network transmission, the sender must retransmit all fragments, which consumes additional network resources.
  3. A hacker can construct a fragment that does not send the last fragment. As a result, the memory of the recipient is exhausted.
  4. Except for the first shard, other shards do not contain information about the layer 4 protocol, so it is difficult for the firewall to implement control policies.

3.6 the abstract

Data is abstracted as Datagram at the IP layer, and data groups are called fragments. What the IP layer provides to the transport layer is to send “logically complete packets” to the receiver through fragmentation to break the MTU limit. In other words, the abstraction that the IP layer provides to the transport layer is the ability to transport ** packets (or TCP segments).

4 Transport Layer (USING TCP as an example)

4.1 introduction

TCP (Transmission Control Protocol) is a connection-oriented, reliable, byte stream based transport layer communication Protocol. TCP protocol knowledge is too much, this article only mentions a few key parts.

4.2 Transmission Process

  1. The application layer sends data streams to the TCP layer.
  2. The TCP layer divides the data stream into packet/segment of appropriate length, and the IP layer sends the packet to the TCP layer at the receiving end.
  3. To ensure reliable transmission, each packet is assigned a serial number, and the receiving end will send an ACK message after receiving the packet.
  4. If the sender does not receive an acknowledgement within a certain amount of time (RTT), the sender retransmits.
  5. Through the checksum function, the TCP layer can check whether each packet has an error, and the wrong packet will be retransmitted.
  6. Because the IP layer cannot guarantee the order of packet arrival, the TCP receiver needs to reassemble the packet through serial number.

4.3 MSS

The IP layer does not limit packet size. But for performance and transmission efficiency, the TCP layer negotiates a limit based on the MTU, which is the MSS.

MSS: Maximum Segment Size, in bytes. It is a concept of TCP at the transport layer and identifies the maximum length of application data segments that TCP can carry. It is a TCP Payload that does not contain the TCP Header or TCP Option.

During the three-way handshake, TCP uses the MSS option in SYN packets to negotiate the accepted MSS value. The MSS of the sent packet is the smallest value advertised by both parties. Because PMTU is not necessarily consistent in both directions, MSS may also be inconsistent.

Large MSS can save TCP header information and increase transmission efficiency.

Large MSS also cause IP fragmentation, reducing transmission efficiency at the IP layer. In the network layer example above, if we control the size of each packet to be less than 1,480 bytes, then three IP packets can complete the transmission of 4,300 bytes of data.

Therefore, MSS = MTU-IP header – TCP header is recommended. Common Ethernet-based negotiation result: MSS= 1500-20 (IP header) -20 (TCP header) = 1460 bytes.

4.4 Flow Control

Because the receiver needs to reorganize packet and TCP is stream-oriented, TCP reorganize occupies more memory than network layer reorganize.

Traffic control is used to avoid sending data too fast for the receiver to receive it. Therefore, the receiver notifying the sender for control.

TCP uses the sliding window protocol for traffic control. The receiver indicates the number of bytes that can be received in the Receive window field (receive buffer remaining). The sender can send up to as many bytes as the Receive window allows without a new acknowledgment packet. Recipients can modify the value of Receive Window.

The reason why it is called sliding window is that the range of packets that can be sent can be determined according to the size of the receiving window, the serial number of the largest ACK packet and MSS. The range is sliding and changing. And there is no strict order.

4.5 Congestion control

Early TCP implementations had no congestion Windows. If there is no problem with the transfer path, define a sliding window to control the transfer. If there is a problem with the transmission path, for example, bandwidth is too limited to handle this much data, it will go to TCP’s confirmation retransmission mechanism, which further reduces the network’s load capacity.

That’s why TCP introduced congestion control (congestion Windows). One application is slow Start. In simple terms, the sender sends only a small piece of data at first, and after receiving an ACK, adds more data each time until it reaches the maximum processing capacity of the receiver or the maximum load capacity of the transmission path. The length of data sent by the sender each time is the size of the congestion window.

4.6 the abstract

Data is abstracted as packets (or TCP segments) at the TCP layer.

The ABSTRACTION provided by the TCP layer to the application layer is called a Stream. The operating system abstracts the TCP connection into a socket, which is used as a programming interface for programs.