This article first briefly introduces the mechanism and application scenarios of TCP Keepalive. Then it introduces how to enable and set TCP Keepalive in Go. However, because the interface at the top of Go language is not flexible enough, it introduces how to use system call to set the file descriptor property of TCP connection in Go language. Then the original author fell into the pit… Finally, you saw how the new interface can be used to set file descriptor properties for TCP connections in versions after Go 1.11. I made some additions and deletions to the article to make it easier to read in Chinese. I didn’t translate it word for word. The original address: Notes on TCP keepalive in Go | TheNotExpert.

I have a TCP server program for clients to connect to. It’s very simple. The problem is that all clients use mobile networks and the network is always unstable. Frequently lost connections are not notified to the server through FIN or RST packets. The server maintains the virtual connection and thinks the client is still online when it is not.

My first solution was to wait a while; If a client does not send any data at a given time, the connection is closed on the server side. (It is worth mentioning that the SetDeadline method is very useful and returns an I/O timeout error on Conn.read when timeout occurs.) But here are some things to consider: I can’t set the timeout too small, because the client might be slow to generate data, and I can’t set the timeout too large, because I would misjudge the client’s online state, when in fact I need some precision.

My idea is to ping the client. But I don’t want to send the client junk data that it doesn’t need. Also, THE client code is not up to me, so I’m not sure how the client will behave if I send it some weird data.

TCP Keepalive – a lightweight ping

TCP Keepalive sends TCP packets with little or no payload to the peer end, and the peer end replies with keepalive ACK packets. It is not part of the TCP standard (although it is described in RFC1122), and it is always disabled by default. However, most modern TCP stacks support this feature.

In most of its implementations, there are, in short, three main parameters:

  • Idle Time – How long it takes to send a ping packet after receiving a packet.
  • Retry interval(Retry interval) – If a ping is sent but no reply is received from the peer endACKIn theRetry intervalThen resend the ping.
  • Ping amount(Number of retries) – Number of retries (No peer is receivedACK) how many times do we think the connection is dead?

For example, the idle time is 30 seconds, the retry interval is 5 seconds, and the number of retries is 3. Here’s how it works:

The server receives a packet of application-layer data from the client. Then the client doesn’t send any more data. The server waits for 30 seconds. Then send a ping to the client. If the server receives an ACK, it waits another 30 seconds and sends the ping again. If the server receives data within 30 seconds, the 30-second timer is reset.

If the server does not receive the ACK, wait 5 seconds and send the ping again. If you don’t get a response after 5 seconds? Send the last ping and wait for the last 5 seconds (yes, wait for the retry interval on the last ping as well). We then assume that the connection timed out and disconnect it on the server.

The default value

It is said that Windows system waits 2 hours by default before sending keepalive ping. Getting the default values under Linux is simple, as described here in Section 3.1.1.

# Idle time cat /proc/sys/net/ipv4/tcp_keepalive_time # Retry interval cat /proc/sys/net/ipv4/tcp_keepalive_intvl # Ping  amount cat /proc/sys/net/ipv4/tcp_keepalive_probesCopy the code

How do you set this up in Go?

Since I’ve been using Go a lot lately, I need to use TCP Keepalive in Go.

Before the discussion begins, it is important to note that the following applies to Linux. I’m not 100% sure it will work for OSX, but I’m almost sure it won’t work for Windows.

A special type of connection

First, I notice that I only use the net.conn type in the server program. But it didn’t work, it lacked the specific approach that we needed. We need the TCPConn type.

This means that we need to use ListenTCP and AcceptTCP instead of Listen and Accept (they are called differently in that ListenTCP uses structures instead of strings to represent addresses). We would call something like this: ListenTCP(” TCP “, &net.tcpaddr {Port: myClientPort}). If you don’t specify it, the default value for IP is 0.0.0.0. It will then return the type we need, TCPConn.

Methods provided by the Go language

If you look through the documentation you may notice two related methods: SetKeepAlive and SetKeepAlivePeriod. Func (c *TCPConn) SetKeepAlive(Keepalive bool) error is called simply by passing true to enable the TCP Keepalive mechanism.

But the following func (c *TCPConn) SetKeepAlivePeriod(d time.duration) error is a bit confusing. What exactly are we setting up with it? The answer can be found in this article (good article, recommended reading) : It sets both idle time and retry interval. The default value of retry interval is used. So if I set 5 * time.second. Then it might wait 5 seconds, send a ping and wait another 5 seconds. And eight retries (depending on system Settings). I needed more flexibility, more precision.

Enter the system level

This can be done by manipulating socket parameters directly. I don’t pay much attention to the details, it’s purely my personal interpretation. Here is how we set the idle time to 30 seconds (we can set it by SetKeepAlivePeriod, since we will set the other parameters separately), retry interval to 5 seconds, and retry times to 3. I stole some code from the article referenced above, thanks.

conn.SetKeepAlive(true) conn.SetKeepAlivePeriod(time.Second * 30) // Getting the file handle of the socket sockFile, sockErr := conn.File() if sockErr == nil { // got socket file handle. Getting descriptor. fd := int(sockFile.Fd()) // Ping amount err := syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPCNT, 3) if err ! = nil { Warning("on setting keepalive probe count", err.Error()) } // Retry interval err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPINTVL, 5) if err ! = nil { Warning("on setting keepalive retry interval", err.Error()) } // don't forget to close the file. No worries, it will *not* cause the connection to close. sockFile.Close() } else { Warning("on setting socket keepalive", sockErr.Error()) }Copy the code

On some line after this I’ll say dataLength, err := conn.read (readBuf), which blocks until data is received or an error occurs. If it is a keepalive Error, err.error () will contain connection timeout information.

Pits about file descriptors

The code above only works if you don’t call it frequently. After writing this article, I learned a small question about it in hard mode…

The problem lies in the Fd function call. Let’s look at the implementation.

func (f *File) Fd() uintptr { if f == nil { return ^(uintptr(0)) } // If we put the file descriptor into nonblocking mode, // then set it to blocking mode before we return it, // because historically we have always returned a descriptor // opened in blocking mode. The File will continue to work,  // but any blocking operation will tie up a thread. if f.nonblock { f.pfd.SetBlocking() } return uintptr(f.pfd.Sysfd) }Copy the code

If the file descriptor is in non-blocking mode, it is changed to blocking mode. According to stackOverflow’s answer, for example, when Go adds a blocking system call, the run-time scheduler removes the owning system thread of the coroutine to which the system call belongs from the scheduling pool. If the number of system threads in the scheduling pool is less than GOMAXPROCS, a new system thread is created. Given that each of my connections uses a separate coroutine, you can imagine the explosion speed. Will soon reach the 10000 thread limit then panic.

Putting it in the independent coroutine doesn’t work.

If, as described by yoko, each connection has an exclusive coroutine (until the connection closes and then exits the coroutine), the system calls are used to set the file descriptor properties, and then the data is sent and received, then the system threads will grow linearly with the number of connections. If a coroutine handles the setting of file descriptor properties before connecting to a coroutine that sends or receives data, the system call will terminate temporarily and the thread will reclaim the coroutine. But it’s not a good model.

But there is a way to do it. Note that this is only if the Go version is higher than 1.11. Look at the following code.

//Uses new interfaces introduced in Go1.11, which let us get connection's file descriptor, //without blocking, and therefore without uncontrolled spawning of threads (not goroutines, actual threads). func setKeepaliveParameters(conn devconn) { rawConn, err := conn.SyscallConn() if err ! = nil { Warning("on getting raw connection object for keepalive parameter setting", err.Error()) } rawConn.Control( func(fdPtr uintptr) { // got socket file descriptor. Setting parameters. fd := int(fdPtr) //Number of probes. err := syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPCNT, 3) if err ! = nil { Warning("on setting keepalive probe count", err.Error()) } //Wait time after an unsuccessful probe. err = syscall.SetsockoptInt(fd, syscall.IPPROTO_TCP, syscall.TCP_KEEPINTVL, 3) if err ! = nil { Warning("on setting keepalive retry interval", err.Error()) } }) } func deviceProcessor(conn devconn) { //............ conn.SetKeepAlive(true) conn.SetKeepAlivePeriod(time.Second * 30) setKeepaliveParameters(conn) //............ dataLen, err := conn.Read(readBuf) //............ }Copy the code

The latest version of Go provides some new interfaces. Net. TCPConn implements SyscallConn, which allows you to take RawConn objects and set parameters. All you need to do is define a function (like the anonymous function in the example above) that takes an argument to the file descriptor. This is a way to manipulate file descriptors in a connection without blocking calls, avoiding the possibility of frantically creating threads.

conclusion

Network programming is complex. And often system dependent. This workaround only works on Linux, but it’s a good start. There are similar parameters in other operating systems, they are just called differently.

Thanks for reading. See you later.