Troubleshooting experience and analysis of container I/O flow principle

At the beginning, this is actually an investigation story that explores the essence of the problem deeply. The reason why I want to write this story is that the phenomenon of the problem and the cause of the final analysis seem to be quite different

Because the screening process can be abstracted into a general screening thinking logic, so you can see whether the abstraction is successful

Cause (problem)

The cause of the story, like most investigation stories, is nothing special. It was an ordinary morning, when I was in a happy mood at work, I was suddenly invited to a meeting, and the boss expressed the problem and the seriousness in the meeting in a hurry, so I also started the investigation in a hurry.

The problem is also very common, external large customers found that the application in a container can not respond to requests, especially anxious to find our side.

A summary of what I heard in the meeting was that a server process in the container could not respond to HTTP requests. I, including other students, naturally assumed that there might be a problem with the container network. Then I logged in to the host. Check container network connectivity and routing according to routine, and find that the network is normal and there is no problem

Bear (start investigation, unexpectedly not network problem)

When there was no result from normal routine investigation, I consulted the upper application students about the service, hoping to get some information from the user’s deployment service type. The upper application students checked the POD information from THE K8S cluster, and found that it was a common Java application with no strange parameters.

There is no clue here, I asked again if there is any change in the front line, the result was sure there is, the version of the container engine was just updated last night, that is to say, the container engine has been restarted. Because I know containers well enough to assume that a container engine restart won’t have any effect on running containers, I’m not too keen on this clue for now.

At this point, the online investigation is basically over, and the online problems can only be solved by recreating POD first. We were not familiar with K8S at that time, so we asked the upper class students to try to reproduce the problem first, and then conduct offline investigation. And thanks to a classmate who rediscovered the problem offline, the investigation made further progress

While looking at the POD log in Kubectl for replay, I stumbled across that the container’s standard output was broken when application requests were stuck. Combined with a bug case of online users, a simple analysis is made. It should be that the application process needs to print some content when responding to user requests. When this step is stuck, it cannot continue to respond to requests. Check in here, closer to finding the root cause of the final bug

Transfer (locate root cause)

The problem triggers the conditional process to print logs to the container standard output and restart the container engine

The problem is triggered because the container engine is restarted. After the restart, it is found that the standard output of the container is broken and the process in the container cannot respond to requests. In addition, the debug finds that the container has received SIGPIPE signals. Combined with the principle of how the container forwards STDIO to the container engine, the cause is basically located. It turns out that the FIFO (Linux named pipe) used by the container to forward STDIO is broken. A look at the fifo FD that shim has opened online confirms this

Located reasons, but the code bug haven’t find them, although I at that time the container is very well, but I don’t know my way around the principle of fifo can be very, if I was to understand the principle of fifo, may positioning problems in the afternoon, not with the time of day (this problem happened once again, is also the problem of fifo, but is another A bug was located in the afternoon after I reappeared offline for the second time.

Here’s how the FIFO part of the problem works

Do not open the FIFO read end or re-open the FIFO read end for several times, only open the FIFO write end, if the data written into the FIFO exceeds the buffer, the FIFO write end reports EPIPE (Broken pipe) error exit, send a SIGPIPE signal. If the read/write mode turns on the FIFO write end, this problem does not occur

Compared to the problem code, the way to open THE FIFO is exactly the way O_WRONLY works. There was no problem before because the container engine has never been updated. The first update last night directly triggered the problem. The problem that has been plaguing you all day is that you only need to change one word: O_WRONLY -> O_RDWR

You can also verify this for yourself with this simple code

Okay, that’s it. Now I’m going to start writing about how the container engine takes over STDIO, and for those of you who aren’t interested in the container part, you can jump right into the “merge” section

Principle analysis (Container creation principle and takeover stDIO)

By the way, it’s not the RUNc container that has the problem, it’s the KATA security container, and runc containers are much less buggy after all

Pouch (github.com/alibaba/pou…) + containerd (github.com/containerd/…). + runc (github.com/opencontain…

Flow from low to container 1 process IO

The stdio of the process points to one end of pipe -> the other end of pipe that shim opened -> FIFO write that Shim opened -> Pouch Open FIFO write -> Pouch specifies the IO output address. The default is a JSON file

Pouch creation is related to the need for terminal and stdin. For the sake of simplicity, the following flow is illustrated by running a container in the background, that is, only the stdout and stderr of the container will be created. Pouch command is used to express the pouch, that is, how to exit from pouch after executing the following command Logs sees the log output of the process

pouch run -d nigix
Copy the code

Pouch creates a container and initializes IO

Pouch creates a container. Pouch, like Dokcer, manages the container based on Containerd. That is,pouch starts to pull up a containerd process, and pouch communicates with Containerd through a GRPC when initiating various container-related requests When ontainerd receives the request, it calls the corresponding Runtime interface to handle the container. The runtime interface can be of many types, including runC, kata, or ocI .

Pouch invokes containerd’s NewTask interface to initiate a create container command. The second argument to this function is the function pointer that initializes IO. Take a look at the code,github.com/alibaba/pou…

The function that initializes the IO, that is, creates the FIFO and opens the READ side of the FIFO, is called on the first line of NewTask execution

Pouch is also a call to the CIO package in Containerd to open the FIFO. Pouch specifies the FIFO path, and ultimately, a call to the FIFO package in Containerd to create and open the FIFO, using the fifo.OpenFifo function in the figure

Pouch: Open stdout and STderr2 FIFOs. Pouch: Open 2 FIFOs to start copying 2 container IO streams, with f as the read end Ifo input, write side can be customized, the write side can be JSON file,syslog or other. In other words, the write side here is the log-driver configured by the container engine

So notice the part where you turn on the FIFO, containerd Fifo package encapsulation of the whole process, and direct call is not the same, the most direct difference is the number of open files, put a picture of the above case sent in the example, the code of STdout and STderr2 FIFO files only opened once, but the FD display file was opened 2 times, this is because the FIFO package The first time the FIFO file is opened, which is the path below. The second time the fd file is opened, which is /proc/self.f/22, according to the flag specified by the parameter.

The reason for opening it twice is so that the FDS opened in memory can be closed after the FIFO file is physically deleted

Containerd creates the container and initializes the IO

The shim process actually manages the container process. In other words, the SHIM process is the parent process of container 1. Containerd and shiM interact with each other via TTRPC (TTRPC is contai) A low-memory GRPC version implemented by the Nerd community),containerd creates a SHIm process when it receives a request to create a container, and sends subsequent requests via TTRPC.

Shim initializes container IO while creating the container. See github.com/containerd/…

Shim creates os.Pipe first. Because only stdout and stderr are needed for the container, only two pipes of stdout and stderr will be created. One end will be used as input and output for process 1 of the container, and the other end will be output to the FIFO created by pouch. Thus pouch reads the standard output of the container process

If you look at the following diagram, CMD encapsulates the shim call runc create. The stdio of CMD is the stdio of the container process. The reason for this is explained in step 3, Runc create container

After returning from the call to runc create,shim began copying the container IO into the FIFO created by pouch. The code is here,github.com/containerd/…

The diagram below shows the logic for copying the IO stream from STdout. Similarly, copying stderr, rio.stdout () is the other end of the pipe created by Shim above

Just look at the rio.stdout () function

I.ut. W is the STDOUT of CMD, meaning that the container process outputs to the other end of I.ut. W and PIPE to read the data and then copy the data to the FIFO. Wc is a pouch FIFo file that is open only by writing. Pouch pouch open FIFO to read the input data here, and take the container process output

Stdout and STderr FIFO are both opened twice. This is because the FIFO read is not opened or re-opened several times. Only the FIFO write is opened. If the amount of data written to the FIFO exceeds the buffer, the FIFO writer reports EPIPE (Broken Pipe) error exit, so the FIFO is opened twice in read/write mode

Runc creates containers and initializes IO

So this is the last step to create a container, where containerd calls the actual container runtime to create the container, and I’m going to start with runc, the most common one

In this case, the kATA security container is also a kind of OCI standard runtime. Simply speaking, the security container has its own kernel, and does not share the kernel with the host, so that it is safe and reliable. The first layer is on the host, interacting with QEMu and processes in QEMU, and the second layer is in QEMU, receiving requests from the first layer. The actual code is the libcontainer that encapsulates RUNC. So Kata’s stdio retweets one more time than Runc

Shim calls runc twice to create a container. The first call is runc create. After this command is done,runc starts an init process before the user process of the container is pulled. Including switching ns,cgroup isolation, mounting mirrors rootfs, volume, etc.,runc init processes end up in a FIFO (not related to pouch FIFO, Runc itself uses a FIFO file) to write a 0, and runc init hangs until the 0 is read

The second call to shim is runc start. Runc start does the simple job of reading the data out of the FIFO, at which point runc init hangs down and calls execve to load the user process, at which point the container’s user process starts running

Since there is no standard input, fd 0 points to /dev/null. Fd 1 and FD 2 point to a pipe. This pipe is the pipe shim created in step 2

The proc file of the shim process can be opened to confirm that the fd number 13 and 15 is the same as the pipe number opened by the container process

The first process that runc starts is runc init. The process that starts a process also encapsulates a CMD command. The stdio of CMD is the stdio of process

In a runc this process represents the init or exec process, when the container don’t need a tty, the process of the stdio runc set to inherit their stdio, figure in the OS. Stdin/OS Stdout/OS Stderr refers to the process of 0, fd

So when the real container process starts, it naturally inherits runc Init’s STDIO

Combine (postscript)

It seems to be a network problem, and finally it was found to be a FIFO problem, but after step by step analysis, everything is reasonable

Similar to the routine of troubleshooting network problems, problem investigation also has routines (abstract methods) to follow, but also read the summary of this aspect, but still write down their own understanding, when the routine is compressed to the extreme, it becomes a lofty way of logical thinking

Detailed analysis of the phenomenon of problem, the process of problem about workflow, trigger conditions Don’t judge which components by experience won’t be a problem, a detailed analysis of the component log and code, especially not for any code has the fear of (do not fear, but respect all code), especially don’t think the kernel, system libraries are basically stable It is best to learn the principles involved in the problem link