The background,

Some tenants said that one of their businesses would often restart after being deployed on the K8S container platform, and some tenants said that another business would generate a large number of close-wait when running for a period of time, and some other businesses would hang when running.

In fact, these three problems, after our investigation, are caused by the same problem, which is also the theme content I want to share today.

Second, the analysis process

As we all know, we are not easy to check the restart, because there are many reasons, such as: the container is not allocated enough memory will restart, running program memory leaks to run out of memory will restart; The container runtime health check is configured in K8S to restart if conditions are not met, etc. It’s not easy to check.

If these three problems are related, could the service hang trigger the other two? If the hang service fails to respond to the client’s closing of the connection, a close-wait is generated. In addition, if hang is used and live Probe is configured on the K8S platform, the container will automatically restart. In this case, the first problem we should solve is close-wait, which is also a better place to start.

Of course, it is also possible that too many close-waits run out of allocated ports or file handles or consume too much CPU, causing the service to respond too slowly as a server. This is easy to check, as long as you look at the current container handle number and total usage, check the container log whether the port is occupied and related abnormal information, and even check the past and current container CPU, memory resource usage, etc., you can rule out this situation.

So, let’s start with close-wait. In fact, this problem has about a year’s time, before the check down, generally caused by the server in the process of handling TCP four wave improperly, so the service code modification, can be resolved. However, according to the feedback from our tenant, the same container image does not appear on another Docker Swarm platform in our company, but it often happens on K8S platform. This information was important enough to indicate that it might not be the code itself, so we decided to investigate the problem further.

Close-wait may be caused by improper handling of TCP waving by the server code. For example, if the server forgets to CLOSE the corresponding socket connection, the FIN packet will not be sent. There are also problems such as an infinite loop that causes close not to be called; In addition, the server program may respond slowly, or there is time-consuming logic, resulting in the delay of the close. It is also possible that the accept backlog is too large for the server to consume, resulting in unwanted requests being closed while they are still in the queue. It is also possible that the kernel parameters are not optimized, such as tcp_keepalive_time/tcp_keepalive_intvl/tcp_keepalive_probes. All kinds of things, but none of them are here.

Three, start to recur

In the process of reproducing with the tenant, we noticed that the business log output was no longer visible in the console corresponding to the business container on the K8S platform, and this information was critical. The ss-ant check shows that the close-wait is increasing. Previously, only a small amount of close-wait was generated when the log output was normal. Therefore, we conclude that the close-wait is caused by the suspension of the service.

Four, the construction of test procedures

Since it was tenant business and in a production environment, we also wrote A Java test program to simulate this situation, which exposed some API interface of the service that could be queried by the client while constantly exporting logs, so that when the problem recurred, the interface could still respond properly. There is indeed no response, thus proving that the service hung causing the close-wait

In addition, we also noticed a very key information, that is, the single log of the business is too long, visually tens of thousands of characters in a single line (of course, the tenant log output so many logs in production, itself is very unscientific, it is this unscientific, which gives us the opportunity to troubleshoot this problem!) , so we asked the tenant to turn off the log output and re-test for two days, but there was no close-wait. By now, we basically concluded that the log output was related.

Next, we construct test logs, one of which is a continuous output log, but the length of each log is limited to 10,000 characters. The other is to continuously append one character to the previous line and print the total number of characters in each line, in order to know how many characters a single character will hang. This test script is as follows:

docker run -t busybox /bin/sh -c 's="a"; while true; do echo $s | wc -c; echo $s; s="a$s"; done'

V. Analysis of test results

After the test, it is found that the log output will not occur when the log output reaches 20,000 to 30,000, that is, the service dies. At this point, we have narrowed our direction to the issue of single log length. The Logging Long lines breaks Container was reported by the Docker community. The Logging Long lines breaks Container describes exactly the same Logging long lines breaks Container that we encountered in the production line.

At this point, we have identified the problem. Root cause is a tty bug that has only been fixed in docker-CE.18.06. See Console_Linux: Fix race: Lock Cond before Signal, tty is enabled when docker logs to the log driver. The signalRead() and signalWrite() functions tell the TTY console whether it can currently read or write. By waiting for Signal to determine whether it is read or written (by calling Signal()), we can see that Signal has become a competing resource, and experienced developers can immediately see that there is a problem, yes! Need to be locked in advance to prevent resource contention.

So, long logs trigger the bug because tty is enabled when we deploy the container. As we all know, TTY is off by default. At this time, we compared the Docker Swarm platform inside the company and found that TTY was also closed on this platform, which explained why close-wait appeared on K8S platform.

Finally, we can upgrade the docker-CE version to a specific docker-CE.18.06 (the bug still exists in other versions, such as docker-CE 19.x), or close the TTY. When creating containers, tty=false will do the trick. After concrete can deploy in deployment of the business, through the docker inspect | grep Tty to check whether the container business open.

Six, afterword.

There is no need to enable tty at container deployment time, that is, no need to allocate TTY resources to the container, because we do not need to interact. Only when we need to interact with the use of scenarios, such as input some commands to the console, the container to give us output, then add -t, such as docker exec it XXXX sh or kubectl exec it XXX — sh temporarily allocated for interaction.

Welcome to follow my wechat official account