Introduction:

Author Li Zhiyu, Tencent Cloud background development engineer, daily responsible for cluster nodes and run-time related work, familiar with Containerd, Docker, Runc and other runtime components. Recently, when I was providing technical support for a customer, I encountered a problem of containerd image file loss. After a series of analysis, inference, reoccurrence, and troubleshooting, I finally found the root cause and provided a solution. Now the whole detailed processing process is sorted out and shared, hoping to provide you with a valuable problem processing ideas and help you better understand the relevant principles.

Description Containerd Image file loss

Recently, some customers reported a strange phenomenon of file loss in some container images. After simulation, the loss is summarized as follows:

Certain images stabilize file loss;

“Lost” recurred steadily on some distributions, but not on Ubuntu;

V1.2 containerd loses files, but V1.3 does not.

After reading the source code and documentation, I finally solved the containerd image loss problem and wrote this article to share my experience with you on how to solve the problem and how to generate the image. For the convenience of those who are in a hurry, the answer to that question will be revealed first in this article

Root causes and solutions

Due to a kernel overlay module Bug, when Containerd downloads the “compressed package” of the image from the mirror repository to generate the “layer” of the image, the overlay mistakenly passes the Xattrs trusted. Overlay. Opaque =y from the lower layer to the upper layer. If the overlay is set on a directory, the overlay will be considered opaque so that the directory overwrites the following directories during the union mount, causing the image file to be lost.

There are two solutions to this problem. One is simple and crude: directly upgrade the overlay module in the kernel.

Another option is to upgrade Containerd from V1.2 to v1.3, because containerd v1.3 automatically sets the opaque property and does not trigger the overlayfs bug. Of course, this is a way to circumvent the Bug rather than solve it completely.

Snapshotter mirror generation principle analysis

Although the root cause seems simple, the process of analysis is tortuous. This section will focus on containerd and Overlayfs for your convenience before we share the troubleshooting process and what we learned about this problem. If you are familiar with it or not interested in it, you can skip it.

Unlike the original Docker daemon design, containerd is made up of modules through plug-ins to reduce coupling. As can be seen from the following figure, mirror-related modules include the following types:

  • Metadata is a KV storage module implemented by Containerd using BBOLt to store meta information such as images, containers, and layers. For example, the command line CTR lists all snapshots or kubelet retrieves all pods through the metadata module.

  • Content is the module responsible for storing bloBs. The content stored about the image is generally divided into three types:

    1. Mirrored manifest (a plain JSON that specifies the mirrored config and mirrored layers array)
    2. Config for the image (also json, which specifies the meta information of the image, such as startup commands, environment variables, etc.)
    3. Image layer (tar package, which generates image after decompression and processing)
  • Snapshots are snapshots that can be described as overlayfs, AuFS, or Native. Snapshots that are unpacked call mirror layers and save them to the file system. When running a container, you can call the snapshots module to provide the container with rootfs.

Container image specifications mainly include Docker, OCI V1 and V2. Considering that these three specifications are almost the same in principle, we can refer to the following example to regard manifest as a meta-information with only one copy for each image, which is used to point to the image’s config and each layer. Config is the image configuration, which is required when the image is used as a container. Layer is each layer of the mirror.

type manifest struct {
  c config
  layers []layer
}
Copy the code

The process of image downloading is in the same order as indicated in Figure 1. The functions of each step are summarized as follows:

First add an image to the metadata module so that we can see the image when we execute the List image.

Secondly, we need to download the image. Because the image is composed of manifest, Config, layers and other parts, we need to download the manifest of the image and save it in the Content module. Then parse the manifest to get the address of config and the address of layers. Next, download config and each layer respectively and save them in the Content module. It should be emphasized here that the mirrored layer should have been a directory. When creating the container, it should be mounted to root, but in order to facilitate network transmission and storage, it will be saved with tar + compression. Again, save to content without decompression.

The effects of ③, ④ and ⑤ are strongly correlated and explained together here. The snapshot module goes to the Content module to read the manifest, find all layers of the image, and then go to the Content module to read these layers from “bottom” to “top”, decompress and process them one by one, and finally put them in the directory of the Snapshot module. Such as 1001/fs and 1002/fs in Figure 1 are mirrored layers. When creating containers, these layers need to be combined to mount the container rootfs, which can be read as 1001/fs + 1002/fs +… = > 1008 / work).

The function call relationship of the whole process is shown in Figure 2 below. Students who like to read the source code can follow this to see.

For ease of understanding, layer represents the layers in snapshot and calls the “layers” that have just been downloaded unprocessed tar or tar for the mirror layer.

The process of downloading the image and saving it into the content is relatively simple, so just skip it. The process of generating layer in snapshot from mirrored tar package is quite clever, and even bugs appear here, which will be described in the next section.

We first get the manifest of the image through content, so that we know what layers the image is made of. The image at the bottom is relatively simple and can be unzipped directly to the directory provided by Snapshot, such as 10/fs. Suppose you then want to generate a second layer at 11/fs (where 11/fs is still empty), Mount -t overlay overlay -o lowerdir=10/fs,upperdir=11/fs,workdir=11/work TMP 11 is mounted to a TMP directory where the write layer is 11/fs which is the layer we want to generate. Go to content and get the tar package corresponding to Layer 11, iterate through the tar package, and write or delete files to the mount point TMP based on the files in the tar package (because it is a joint mount, the operation on the mount point will become the operation on the write layer). If there is a whiteout file in the tar package or if the current layer (e.g. 11/fs) conflicts with the previous layer (e.g. 10/fs), the underlying directory will be removed. After the tar file is written to the directory, xattr is added to the file based on the PAXRecords recorded in the tar file. PAXRecords can be seen as the KV array attached to each file in the tar, which can be used to map file attributes in the file system.

// TMP is the mount point for overlay
applyNaive(tar, tmp) {
  for tar.hashNext() {
    tar_file := tar.Next()										// Files in the tar package
    real_file := path.Join(root, file.base)		// Real world files
    // Delete files according to the rules
    if isWhiteout(info) {
      whiteRM(real_file)
    }
    if! (file.IsDir() && IsDir(real_file)) { rm(real_file) }// Write the tar file to layer
    createFileOrDir(tar_file, real_file)
    for k, v := range tar_file.PAXRecords {
      setxattr(real_file, k, v)
    }
  }
}
Copy the code

The cases that need to be deleted are summarized as follows:

If a directory with the same name exists, merge the two directories

If not all directories have the same name, delete the lower directories (upper files and lower directories, files under the upper directory, and upper files and lower files).

If the.wh. file exists, the underlying directory that should be overwritten needs to be removed, such as the.wh.. Wh.opaque: delete the corresponding directory in lowerdir.

Of course, deletion is not that simple. Remember that the current operation is to delete the underlying files through mount points? In the overlay, if you delete the contents of the lower layer with a mount point, you don’t actually get rid of the file from the lower directory. Instead, you add whiteout to the upper layer, One way to add whiteout is to set xattr trusted. Overlay.opaque =y for the top directory.

Umount TMP to get 11/fs as the desired layer. When we want to generate 12/fs as the lowerdir, 10/fs, 11/fs, Just mount 12/fs as upperdir. In other words, each layer generated after the mirror needs to mount the previous layer. The following figure illustrates the whole process.

Why go to all this trouble? There are two key points.

First, the removal of the underlying file in the image follows the image-spec definition of the whiteout file, which is only identified in the tar package and has no real impact. When applyNaive encounters a whiteout file, it calls the federated file system to delete the underlying directory. The overlay is opaque.

Second, because files and directories overwrite each other, the files in each tar package need to be compared with the contents of all previous tar packages. Without the “super power” of the federated file system, we can only take each file in the tar and traverse the previous layers.

Troubleshooting Process

Understanding the mirror related knowledge, let’s take a look at the troubleshooting process of this problem. First we observe the user container, after simplification and coding directory structure is as follows, where directory modules is accident prone.

/ data └ ─ ─ a PROM ├ ─ ─ bin └ ─ ─ modules ├ ─ ─ the file └ ─ ─ lib /Copy the code

Take a look at the layers of the user’s image. We label the layers of the mirror with increasing IDS from bottom to top. There are 5099, 5101, 5102, 5103 and 5104 layers in this directory. When you run the container, you see the modules directory as provided by 5104. 5104 overwrites all of the directories below (there are differences between 5104 and 5103 files, of course).

Why is the subdirectory 5104 overwritten?

Is there a problem with the parameters used to create the rootfs for the container, causing a few layers to be mounted? Mount -t overlay overlay -o lowerdir=5104:5103 point There are probably.wh. files with overlays, so try to search for.wh. in both layers. File, nothing. So go to the overlayfs documentation:

A directory is made opaque by setting the xattr “trusted.overlay.opaque” to “y”. Where the upper filesystem contains an opaque directory, any directory in the lower filesystem with the same name is ignored.

A directory with trusted. Overlay. Opaque =y becomes “opaque”. When the upper file system is set to “opaque”, the directory with the same name in the lower file system is ignored. Overlay If you want to overlay layers on top of layers, you need to set this property.

By running getFattr -n “trusted. Overlay. Opaque “dir, you can find that /data/asr_offline/modules under 5104 has this property, which causes the lower directory to be overwritten.

[root@]$ getfattr -n "trusted.overlay.opaque" 5104/fs/data/asr_offline/modules
# file: 5102/fs/data/asr_offline/modules
trusted.overlay.opaque="y"
Copy the code

So the question is, why does this happen only with certain distributions? When you pull down the mirror in Ubuntu, opaque is not set in the “Origin” directory. Since the image layer is generated by decompression and unpacking the source file, we decided to unpack the image source file through tar -zxf and manually mount it again on each operating system after ensuring that the MD5 of “image source file” in different operating systems is the same, and found that 5104 will not overwrite 5103.

It is possible that containerd in some distributions has a problem reading the tar package from Content and unpacking the layer that makes the snapshot, incorrectly setting this property to the snapshot directory.

To test this hypothesis, we decided to comb through the source code, There is a problem with this: traversing the tar while generating layers will read each file’s PAXRecords and set this to the file’s Xattr. Equivalent to Pod labels).

func applyNaive(a) {
  // ...
  for k, v := range tar_file.PAXRecords {
		setxattr(real_file, k, v)
  }
}

func setxattr(path, key, value string) error {
	return unix.Lsetxattr(path, key, []byte(value), 0)}Copy the code

Because containerd didn’t have this problem with v1.3, we checked the code and found that the logic for extracting PAXRecords from the tar is different. The code for V1.3 is as follows:

func setxattr(path, key, value string) error {
	// Do not set trusted attributes
	if strings.HasPrefix(key, "trusted.") {
		return errors.Wrap(unix.ENOTSUP, "admin attributes from archive not supported")}return unix.Lsetxattr(path, key, []byte(value), 0)}Copy the code

This means that V1.3.0 will not set xattr! With trusted. If a directory in the tar package has a PAX named trusted. Overlay.opaque =y, containerd of a lower version may set these properties to the snapshot directory, but containerd of a higher version may not. If opaque is added to a tar package, the layer directory will have this property. This is probably why the directory 5104 became opaque.

To verify this idea, I wrote a simple program to scan the content corresponding to layer to look for this attribute, and found that layers 5102, 5103 and 5104 did not have this attribute. At this point I began to doubt this view, after all, if only the tartar had a special logo, it should not behave differently on different operating systems.

I scanned 5099 and 5101 in the last hope, and sure enough, they didn’t have it either. /data/asr_offline/modules/.wh.. Wh.opq this file. Remember when I was looking at applyNaive and I had.wh.. Wh.opq will delete /data/asr_offline/modules at the mount point, and overlay will add trusted. Overlay. Opaque =y to the upper overlay. In other words, when layer 5101 is generated (5100 and 5099 need to be mounted ahead of time), and the tarball encounters the WH file, you should delete modules at the mount point, i.e. add opaque=y to the directory of 5101.

Once again, to validate the source code, go to snapshot’s 5101/fs and view the opaque directory modules, as expected. Overlayfs is trusted. Overlay. Opaque =y in /data/ ASr_offline /modules on layer 5101. When I went to check the directory of 5101, it was sure to have this attribute. Driven by curiosity, I continued to check the directories of 5102, 5103 and 5104, and found that they all had this attribute.

So each of these layers is going to cover the bottom, right? It doesn’t seem to make sense. So, I went to look in normal ubuntu and found that only 5101 had this property. 5102, 5103, and 5104 do not have modules whiteout files in the tar package, which means that the image was intended to make 5101 overwrite the following layers. Merge modules directories 5101, 5102, 5103, and 5104. In the whole process of image generation, only the layer that “borrows” overlay to create snapshot will involve the operating system.

As the clouds disperse, take a bold guess

Let’s go out on a dare and guess, will modules add opaque attributes due to a kernel or overlay bug when generating Layer 5102?

To do a separate test of this feature, I wrote a simple script. After running the script, sure enough, in this distribution, if the overlay’s lower level directory has this property and the same directory is created in the upper layer, the Opaque will “propagate” to the upper layer directory. If containerd recursively generates the image, every layer above it must have this property starting with the Whiteout layer, resulting in the container only seeing the top layer in certain directories.

`#! /bin/bash

mkdir 1 2 work p
mkdir 1/func
touch 1/func/min

mount -t overlay overlay p -o lowerdir=1,upperdir=2,workdir=work
rm -rf p/func
mkdir -p p/func
touch p/func/max
umount p
getfattr -n "trusted.overlay.opaque" 2/func

mkdir 3
mount -t overlay overlay p -o lowerdir=2:1,upperdir=3,workdir=work
touch p/func/sqrt
umount p
getfattr -n "trusted.overlay.opaque" 3/func`
Copy the code

Final summary

With the help of several kernel bigwigs, a bug in the kernel overlayfs module was confirmed. The xattR was not detected when copy_UP was called from the lower layer, causing Opaque xattr to propagate to the upper layer. If the upper layer file gets this attribute, the lower layer file will be overwritten, and the image will lose files. Reflecting on the whole troubleshooting process, in fact, it is difficult to locate the problem to a certain module of the kernel at the beginning, fortunately, we can find a new way to gradually approach the “truth” through testing and reading the source code, and successfully find a solution.