introduce

With the increasing popularity of audio and video related applications such as live broadcasting, short video and online conference in recent years, the requirements for user experience are also increasing. In order to deliver sound and pictures to users, they need to be distributed through the network. Therefore, how to improve the efficiency of the network in various network environments, so as to improve user experience becomes very important. If tcp-based protocol is used to distribute audio and video data, user experience requirements cannot be largely met, such as fast playback, less lag, and low latency in live broadcast scenarios. We can optimize the experience by reducing setup time, reducing retransmission, and so on. Therefore, there are more and more demands for building transport protocols based on UDP, such as QUIC and RTP. Secondly, with the rise of cloud services, Golang has gradually been used in many places. In the CDN scenario of live broadcasting, streaming media server needs to bear high traffic, and the combination of Golang + UDP will meet the performance requirements in our internal practice. Therefore, this paper will share how to improve the performance of udP-based transmission protocol realized with Golang.

Analysis method

In the early stage, under the reliable transmission protocol based on UDP, the bandwidth of our streaming media server was several times lower than that of TCP, resulting in the large area application of UDP protocol. So there is a need to improve related performance, mainly by analyzing the program performance flame chart, operating system indicators (such as lock, soft band, load distribution, etc.) to determine the cause of low performance. Here are some common tools. Pprof is a performance analysis tool of Golang. It is very simple and easy to use, so it frequently appears in performance analysis, bug location and other scenarios. An easy way to use Pprof is to introduce the net/ HTTP /pprof package, which is automatically embedded in the default HTTP server. If the service does not have an HTTP service, you can create one in Goroutine.

   1 import _ "net/http/pprof"
   2 go func() {
   3   _ = http.ListenAndServe(":5601", nil)
   4 }()
Copy the code

Then through 127.0.0.1:5601 / debug/pprof can get to this address. Golang also provides visualization tools, of course, go tool pprof – HTTP = : 5555 http://127.0.0.1:5601/debug/pprof/profile can figure with flames of fire from the web page analysis cost of each part, And then develop our optimization methods. Some of the overhead of the operating system is not captured by pprof and may have to be analyzed by other tools.

The most common ones are TOP, PERF, SAR, SS, ethtool, etc.

Perf can be used to measure CPU overhead, cache miss, etc. Top can compare the analysis process and total CPU usage, soft interrupt usage, etc. SAR collects PPS statistics for network adapters. Ss and ethtool can be used to diagnose some information about sockets and network cards. Through pprof, it is easy to find that the CPU overhead is mainly in memory allocation, system call, Runtime, and memory copy.

Here’s what works and what doesn’t.

The system calls

mmsg

The system calls that send and receive data packets account for a large proportion. These two system calls just provide the ability to send and receive multiple packets in one system call, which greatly reduces the order of magnitude of system calls. Therefore, by using these two APIS well, our performance can be greatly improved.

1 int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
2                    int flags, struct timespec *timeout);
3 
4 int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
5                   int flags);

Copy the code

Parameter specifies the number of packets to read or send, some set flags, and a timeout for receiving, which is a bit of a trap, as we’ll see later. These two system calls are encapsulation of recvmsg/sendmsg, similar to the sendmsg/recvmsg call in the for loop.

sendmmsg

The main point of Sendmmsg is how to aggregate more packets to be sent in a single system call. You can choose to aggregate packets sent from multiple sessions, so the number of aggregations can be greatly increased. However, this can have a quality impact, because aggregation may take a little time to aggregate enough packets.

recvmmsg

The recVMMSG socket, which is a noneblock socket, will return a small amount of things, because when there is nothing to read, it will return EAGAIN, which will cause the system call to return directly, and need to wait a while. If none is specified, packets will not be returned until the specified number of packets have been read, even if the timeout is set. In this case, we can use the MSG_WAITFORONE flag. In the CDN scenario, the main feature of RECVMMSG lies in ack packets of reliable transport protocol. If possible, we can consider lowering the ack quantity ratio to reduce the system call overhead and interrupt overhead of recVMMSG.

The effect

According to the test performance data given, 64 packets at a time can improve performance by 20% **[1]**. This test has some limitations, and in practical application scenarios the improvement is much higher than this value.

In the case of better server configuration and network adapter, the number of MMSGS can be increased to 128, 256, etc. In the protocol stack application can double the effect.

GSO

GSO stands for Generic Segmentation Offload and has another brother, GRO. Due to the MTU limit, UDP data written to packets exceeds the MTU size and IP fragmentation is not disabled, IP fragmentation is implemented. However, IP fragmentation is not conducive to reliable transport because the cost of packet loss is too high. If one IP packet is lost, all packets are lost. As mentioned earlier, reducing system calls is also possible with GSO. We can reduce system calls by writing a larger buffer at a time, but this can be achieved only if the data to be written each time is large enough and kernel version requirements are high. In addition, GSO can write a large buffer at a time, and hardware acceleration can be enabled on the network adapters that support GSO offload. This can be enabled through ethtool -k eth0 tx-UDp-segmentation on.

GSO can be used in two ways, one is to set the socket option to enable, the other is to set the MSG level through OOB. Generally through the OOB way to set, because this is more flexible, the disadvantage is that it will copy a little more memory.

Setsockopt (fd, SOL_UDP, UDP_SEGMENT, &gsoSize, Sizeof (gsoSize)) 3 4 // MSG Level 5 type ctlMsgHdr struct {6 Len uint64 7 Level int32 8 type int32 9} 10 11 HDR := (*ctlMsgHdr)(unsafe.Pointer(&b[0])) 12 hdr.Len = 2 13 hdr.Level = SocketLevelUDP 14 hdr.Type = SocketTypeUDPSegment 15 binary.LittleEndian.PutUint16(b[ctlMsgHdrSize:ctlMsgHdrSize+2], uint16(gsoSize))Copy the code

Buffer smaller than 64K is allowed to be written at one time without fragmentation, and will be divided into multiple IP packets according to gsoSize.

According to Google’s test data, the performance improvement can reach about 1.7 times [2]. However, in the case of a small amount of data written at a time, it is difficult to use. For example, when the bit rate of live broadcast is not high, the amount of data that can be written by a single link is limited.

However, the effect of GSO is not particularly obvious in the case of low bit rate live streams and less data being written at one time.

At high bit rates this still gives a big boost and is worth trying.

If possible, it is also possible to aggregate multiple times of data to increase the amount of data written to GSO. However, since GSO is bisect, it is more difficult to make each UDP packet and so on.

Memory allocation

Package-related memory allocation

1 int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
2                     int flags, struct timespec *timeout);
3 
4 struct iovec {                    /* Scatter/gather array items */
5     void  *iov_base;              /* Starting address */
6     size_t iov_len;               /* Number of bytes to transfer */
7 };
8 struct msghdr {
9      void         *msg_name;       /* Optional address */
10     socklen_t     msg_namelen;    /* Size of address */
11     struct iovec *msg_iov;        /* Scatter/gather array */
12     size_t        msg_iovlen;     /* # elements in msg_iov */
13     void         *msg_control;    /* Ancillary data, see below */
14     size_t        msg_controllen; /* Ancillary data buffer len */
15    int           msg_flags;      /* Flags on received message */
16 };
Copy the code

For example, like the recVMMSG system call above, we need to prepare memory space, IOVEC structure, MSGHDR structure, etc. For a stack of several million PPS, the memory can’t be allocated every time, so we have to reuse it.

For ioVEC, it is possible to reuse the same block of memory each system call and copy the corresponding values.

For the destination address, ooB, etc., we can also allocate and store in the session memory, directly pass its address is ok. For OOB, the delivery may be different each time. We can prepare multiple copies, such as whether to use GSO or not, and just choose the one we need.

Like golang github.com/golang/net/… The provided RECVMMSG API does not deal with memory allocation, so it is not suitable for direct use.

The buffer of data packets is also undoubtedly reusable. For a simpler scenario, we can use sync.pool. This buffer can be reused with sockaddr, OOB and other buffers.

1 type Buffer struct {
2    ref       int64
3    rawBuffer []byte
4    Data      []byte
5    SockAddr  []byte
6    Oob       []byte
7 }
Copy the code

The rawBuffer is the actual allocated memory area.

Data points to the location of the packet in the rawBuffer, SockAddr points to the location in the rawBuffer, and Oob.

This reduces the need for multiple sync.pool calls and memory allocations.

interface

In Golang, an interface (empty interface) is represented by the struct below.

1 type eface struct {
2    _type *_type
3    data  unsafe.Pointer
4 }
Copy the code

When converting a variable of a specific type to interface, if the value is of type copy, then it needs to be allocated space and copied over, so this is a potential memory allocation case. The following code shows how to convert 64-bit integers to interface. Recent versions have added optimizations to pre-allocate numbers less than 256.

1 func convT64(val uint64) (x unsafe.Pointer) {
2   if val < uint64(len(staticuint64s)) {
3       x = unsafe.Pointer(&staticuint64s[val])
4    } else {
5       x = mallocgc(8, uint64Type, false)
6       *(*uint64)(x) = val
7    }
8    return
9 }
Copy the code

Here’s an example:

1 log.Debug(formats string, v ... interface)Copy the code

For example, the debug log function, although the log level is not executed online, because the parameter is interface, it involves type conversion and has a large memory allocation overhead in the case of a large number of calls. Here is a simple scenario test to illustrate the problem.

1 var showDebug bool
2 
3 func debug(format string, args... interface{}) {
4    if showDebug {
5       fmt.Println(format, args)
6    }
7 }
8 
9 func BenchmarkWithInterface(b *testing.B) {
10    for i := 0; i < b.N; i++ {
11       debug("", 1)
12    }
13 }
14 
15 func BenchmarkNoneInterface(b *testing.B) {
16    for i := 0; i < b.N; i++ {
17       if showDebug  {
18          debug("", 1)
19       }
20    }
21 }
Copy the code

The test results

1 BenchmarkWithInterface 2 BenchmarkWithInterface-12 50442586 23.3 NS /op 3 PASS 4 5 BenchmarkNoneInterface 6 BenchmarkNoneInterface-12 1000000000 0.343 ns/op 7 PASSCopy the code

Locking overhead

After we introduced a batch of 96C servers, performance did not improve, but decreased.

Through perf top, the CPU usage of __raw_caller_save___pv_queued_spin_unlock and trigger_load_balance is relatively high.

According to the tests, the service capacity of the 88 core machine and the 44 core machine is about the same. So you can assume that the performance of an 88-core machine can be doubled by replacing it with two 44-core machines.

It is recommended that machines be virtualized/containerized into multiple small core machines or multi-process machines to better utilize machine resources.

Even for machines with two or more cpus, multi-process binding to each CPU can also provide considerable improvement, which can reduce the cost of switching between cpus. Bindings can be controlled by numactl.

zerocopy

1 if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
2         error(1, errno, "setsockopt zerocopy");
3         
4 ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
5 
6 ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
7 if (ret == -1)
8         error(1, errno, "recvmsg");
Copy the code

UDP ZEROCOPY is also supported in kernel 5.0. ZEROCOPY’s API consists of two parts: one is to declare the use of ZEROCOPY when sending, and the other is to receive a callback that can be recycled by memory.

However, for ZEROCOPY with small packets, performance is negative because one more system call is generated. In addition, under UDP, packets larger than MTU cannot be transmitted in one MSG. Even if GSO is enabled, it is only 64K, which shows a 13% improvement according to the data provided by Google [3].

According to our tests, the negative performance is quite large when the number of GSO is small. This feature can be complicated to use, so it is not recommended and should be considered for UIO related technologies.

RPS

After we added a number of VMS with a relatively high number of cores, we found that the load did not increase. By looking at the CPU load, only a few core loads are high. After querying the information, we learned that this may be a mismatch between the number of receiving queues and the number of CPU cores, resulting in the soft interrupt of receiving queues are called to the CPU core in front.

The ethtool allows you to check the number of NIC queues. If the number of CPU cores does not match the number of CPU cores, you can use either of the following methods: A lesser option is to configure RPS, which allows a queue to be bound to a specific CPU core. This actually incurs some extra CPU overhead because it is implemented by receiving soft interrupts and then distributing them to other CPU cores, but the advantage is that empty CPU cores can be utilized.

RPS through the/sys/class/net/eth0 / the queues/rx – [n] / rps_cpus this file to configure file content for each card queue mapping relationship to the CPU core, a binary 1 on behalf of the map, to hexadecimal strings.

IO model

The key measures

  • reuseport
  • The original socket
  • Lock-free queue
  • One processing loop per CPU

Detailed implementation and effect

While doing performance improvement work, I also explored some code structure models. As a reference, each project is different and therefore not necessarily suitable.

  1. Directly create sockets through Syscall, because golang’s API creates sockets bound to Epoll. In our scenario, the number of sockets is fixed and small, and there is no need for I/O multiplexing to increase throughput, thus avoiding the CPU overhead caused by Epoll. Set the socket to non-blocking so that incoming and outgoing packets can be processed in a goroutine. For the created sockets, reusePort is used to balance the load to all CPU cores to improve the utilization of the machine. Today’s network cards are generally multi-queued, so they can be distributed to multiple cores. A distribution policy is usually set using ethtool -n eth0 rx-flow-hash UDp4 SDFN. For details about how to set a distribution policy, see man Page.

  2. Goroutine to handle links. Each of these goroutines handles packet receiving, packet sending, timeout, etc., for a set of links, and can conveniently aggregate data from multiple links for a single system call. Protocol stack and application layer, packet sending and receiving layer transmits data through lockless queue. The main purpose of this structure design is to reduce the trigger and lock overhead of Golang scheduling. The number of Goroutines does not increase with links, which also reduces scheduling stress.

  3. The goroutine in the previous two steps can call Runtime. LockOSThread to put itself into the scheduler global queue to increase scheduling priority and avoid scheduling delays.

conclusion

In the golang protocol stack, the performance problem can be divided into two parts: the language level and the system level. The overhead of language level memory allocation and RUtime is generally high. The overhead of system-level system calls and locks is high. Like the open source QuIC-Go, there are these problems that need to be analyzed and solved step by step.

With the advent of QUIC and HTTP3, the kernel is starting to do more functionality and performance-related work on UDP, and there are likely to be more ways to improve performance. However, Golang is not suitable for high performance, high scheduling accuracy, uncontrollable GC and memory allocation, and abstract isolation from the operating system.

Now that the server configuration is getting higher and the number of cores larger, the code is prone to lock contention, resulting in a significant increase in additional overhead. So containerization, virtualization, and other technologies that move machines into multi-instance mode with a smaller number of cores are also valuable.

reference

[1] lwn.net/Articles/44…

[2] vger.kernel.org/lpc_net2018…

[3] patchwork.ozlabs.org/project/net…

[4] lwn.net/Articles/75…

[5] lwn.net/Articles/35…