Recently, we fixed a problem with the system thread number of Go program exploding. The number of threads remains at 23,000, sometimes even more, which is obviously not in line with the concurrency principle of Go. The first discovery was that the number of threads was too large because the program suddenly crashed. Because the maximum number of threads available to the program was set, the program would crash if the number of threads was too large.

Docker docker docker docker docker Docker Docker Docker Docker Docker Docker After investigation, it is found that the number of long connections is basically the same as the number of threads. It is totally incomprehensible to see this phenomenon.

At the beginning, I suspected that DNS query was performed in CGO mode, which led to threads being continuously created. However, after constantly looking through the code, I was sure that it was the IP used for connection construction, rather than the domain name. Finally, in order to eliminate the influence of DNS CGO query, I made a configuration to force pure Go mode for DNS resolution. But the number of threads didn’t go down. To force the pure Go DNS parser, you only need to set the following environment variables.

Then, the investigation continued on the road of CGO call, and then began to locate whether the code used CGO method to call C code, etc., but no possibility of CGO call was found, and finally forced to shut down CGO was still useless.

At this time, I also analyzed the PPROF data and found no doubt. There is a threadCreate tool in pprof, which seems to be quite useful, but I don’t know why I get only the number of threads and no documented stack information. If you know the correct posture to use the pprof/ Threadcreate tool, please let me know. Thanks.

The runtime.stack () method was used to print the call Stack where the Goroutine scheduler created the thread, but it failed because the array used to store the Stack would escape onto the heap. Finally, I simply turn on the status information of the scheduler, and it can be seen that many threads are indeed created, and these threads are not idle, that is to say, they are working. I can’t understand why so many threads are created.

To enable the status information of the scheduler:

export GODEBUG=scheddetail=1,schedtrace=1000
Copy the code

This outputs the scheduling state to standard output every 1000 milliseconds.

Pstack: pstack: pstack: pstack: pstack: pstack: pstack: pstack: pstack: pstack: pstack

At this time strace trace this thread appears as follows:

This thread is blocking on a read call for a long time, but I can’t believe my eyes, so I have to verify that fd 16 is a network connection. In the Go world, network IO cannot be blocked, otherwise everything will not work. At this point, I seriously doubted the strace tool tracking problem, so I added a log to the Go NET library to confirm whether I would exit the READ system call, recompile Go and Swarm, and put them on the line. As expected, I found that the Read system call did not return after no data. Let’s look at another thread stack after logging:

Add the start/stop log before and after the read system call, and you can clearly see that read is blocked. At this point, I can explain why thread count inflation and connection count are basically the same, but the underlying NET library in Go does set a nonblock property for each connection, which eventually becomes a block. This is either a kernel bug or the connection property ends up being corrupted. I would certainly prefer to believe the latter, remembering that the custom Dial() method in the code has the TCP option set for connection.

// TCP_USER_TIMEOUT is a relatively new feature to detect dead peer from sender side. // Linux supports it since kernel 2.6.37. It's among Golang experimental under // golang.org/x/sys/unix but It doesn't support all Linux platforms yet. //  we explicitly define it here until it becomes official in golang. // TODO: replace it with proper package when TCP_USER_TIMEOUT is supported in golang. const tcpUserTimeout = 0x12 syscall.SetsockoptInt(int(f.Fd()), syscall.IPPROTO_TCP, tcpUserTimeout, msecs)Copy the code

Remove the TCP option, recompile and run online, and sure enough, everything is correct, and the number of threads is down to the normal dozens.

The TCP_USER_TIMEOUT option was introduced in kernel 2.6.37. It is possible that the kernel we use does not support this option, which causes the force setting problem. Why setting TCP_USER_TIMEOUT causes connection to change from nonblock to block is worth investigating further.
Note: this conclusion is actually not completely correct, updated below

The real reason

This is the code that sets the TCP_USER_TIMEOUT option. I called this function in a comment and thought it was the conn.file () call that actually caused the block. The underlying code looks like this:

// File sets the underlying os.File to blocking mode and returns a copy. // It is the caller's responsibility to close f  when finished. // Closing c does not affect f, and closing f does not affect c. // // The returned os.File's file descriptor is different from the connection's. // Attempting to change properties of the original using this duplicate // may or may not have the desired effect. func (c *conn) File() (f *os.File, err error) { f, err = c.fd.dup() if err ! = nil { err = &OpError{Op: "file", Net: c.fd.net, Source: c.fd.laddr, Addr: c.fd.raddr, Err: err} } return } func (fd *netFD) dup() (f *os.File, err error) { ns, err := dupCloseOnExec(fd.sysfd) if err ! = nil { return nil, err } // We want blocking mode for the new fd, hence the double negative. // This also puts the old fd into blocking mode, meaning that // I/O will block the thread instead of letting us use the epoll server. // Everything will still work, just with more threads. if err = syscall.SetNonblock(ns, false); err ! = nil { return nil, os.NewSyscallError("setnonblock", err) } return os.NewFile(uintptr(ns), fd.name()), nil }Copy the code

Don’t look at the code, just look at the comments. So the posture of setting TCP options here is not correct.


I found this comment in Swarm’s code a long time ago:

// Swarm runnable threads could be large when the number of nodes is large
// or under request bursts. Most threads are occupied by network connections.
// Increase max thread count from 10k default to 50k to accommodate it.

const maxThreadCount int = 50 * 1000
debug.SetMaxThreads(maxThreadCount)
Copy the code

The person who wrote the code noticed that the number of threads was huge from the beginning, but thought that a lot of connections and concurrent requests would consume that many threads. From this point of view, you may not have a deep understanding of Go concurrency and event-driven on Linux.

Screening cost me a lot of time, this problem is mainly the phenomenon of the problem and I understand the “world” does not agree completely, it’s hard for me to Go to think this is the bug, Go do not, so cause a lot of the time don’t know where to start, can let oneself constantly return to the runtime code to find the conditions of the thread is created and other details.

Some previous blog posts: www.skoo.me