“This is the third day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.
Hello everyone, I’m Zhang Jintao.
All of the container technologies and virtualization technologies mentioned so far (virtualization at any level of abstraction) can achieve resource level isolation and limitation.
Container technology, which implements resource-level restrictions and isolation, relies on the CGroup and Namespace technologies provided by the Linux kernel.
Here’s a quick overview of what these two technologies can do:
- Main functions of cgroup: Manage resource allocation and restriction;
- The main purpose of a namespace is to encapsulate abstractions, constraints, and isolation. Processes in a namespace seem to have their own global resources.
For this article, we will focus on Cgroup.
Why care about CGroup & Namespace
The explosion of cloud native/container technology
Since 1979, Unix version 7 introduced Chroot jails and Chroot system calls in the development process, until 2013 open source Docker, 2014 open source Kubernetes, and now the hot cloud native ecosystem. Container technology has gradually become one of the mainstream basic technologies.
As more and more companies and individuals choose cloud service/container technology, resource allocation and isolation, as well as security, have become hot topics of concern and discussion.
Container technology is not difficult to use, but to really use it well in a large-scale production environment, we need to master its core.
The following is the general development history of container technology & cloud native ecology:
Figure 1. Development history of container technology
From the diagram, we can see the development trajectory of container technology, cloud native ecology. Container technology has been around for a long time, but why did it start to develop significantly after Docker? What were the problems with the early Chroot and Linux VServer?
Security issues caused by Chroot
Figure 2, chroot example
Chroot can isolate a process and its children from the rest of the operating system. However, for root process, you can exit chroot at will.
package main
import (
"log"
"os"
"syscall"
)
func getWd(a) (path string) {
path, err := os.Getwd()
iferr ! =nil {
log.Println(err)
}
log.Println(path)
return
}
func main(a) {
RealRoot, err := os.Open("/")
defer RealRoot.Close()
iferr ! =nil {
log.Fatalf("[ Error ] - /: %v\n", err)
}
path := getWd()
err = syscall.Chroot(path)
iferr ! =nil {
log.Fatalf("[ Error ] - chroot: %v\n", err)
}
getWd()
err = RealRoot.Chdir()
iferr ! =nil {
log.Fatalf("[ Error ] - chdir(): %v", err)
}
getWd()
err = syscall.Chroot(".")
iferr ! =nil {
log.Fatalf("[ Error ] - chroot back: %v", err)
}
getWd()
}
Copy the code
Run as normal user and sudo respectively:
➜ chroot go run main.go
2021/11/18 00:46:21 /tmp/chroot
2021/11/18 00:46:21 [ Error ] - chroot: operation not permitted
exitStatus 1 ➜ chroot sudo go run main.go 2021/11/18 00:46:25 / TMP /chroot 2021/11/18 00:46:25/2021/11/18 00:46:25 (unreachable)/ 2021/11/18 00:46:25 /Copy the code
You can see that when running with Sudo, the program switches between the current directory and the system’s original root directory. Ordinary users have no permission to perform operations.
Linux VServer security vulnerabilities
Linux-vserver is a software partitioning technology for Security Contexts that can isolate virtual servers and share the same hardware resources. The main problem is that VServer applications are not properly protected against “chroot-again” attacks, which allow an attacker to escape the restricted environment and access any file outside the restricted directory. (Since 2004, the national information security vulnerability database has been published related vulnerabilities.)
Advantages brought by modern container technology
-
Lightweight, low cost container creation based on cgroup and namespace capabilities provided by the Linux kernel;
-
Certain isolation;
-
Standardization, by packaging and distributing applications using container images, eliminates many of the problems caused by inconsistent environments;
-
DevOps support (easy migration of applications between different environments, such as development, test, and production, while retaining full application functionality);
-
Add protection to infrastructure to improve reliability, scalability, and trust;
-
DevOps/GitOps support (quick and efficient continuous release, release management and configuration);
-
Team members can effectively simplify, speed up, and orchestrate application development and deployment;
Now that you know why you should focus on cgroups and namespace technologies, let’s move on to the focus of this article and learn about Cgroups.
What is the cgroup
Cgroup is a feature of the Linux kernel that restricts, controls, and separates the resources (such as CPU, memory, disk input and output) of a process group. It was developed by two Google engineers and has been available since the Linux kernel V2.6.24, which was officially released in January 2018.
Cgroup so far, there are two large versions, Cgroup V1 and V2. The following content is mainly about CGroup V2 version. The differences between the two versions will be detailed in the following sections.
The main resources restricted by Cgroup are:
-
CPU
-
memory
-
network
-
Disk I/O
When we allocate available system resources to Cgroups by a specific percentage, the remaining resources are available to other Cgroups or other processes on the system.
Figure 4, sample cgroup resource allocation and remaining available resources
The composition of the cgroup
Cgroup stands for “control group” and is not capitalized. A Cgroup is a mechanism for organizing processes hierarchically, allocating system resources in a controlled manner along a hierarchy. We usually use the singular form to specify the entire feature and also as a qualifier such as “cgroup Controller”.
Cgroup has two main components:
- Core – Responsible for hierarchical organization processes;
- Controller – Typically responsible for allocating specific types of system resources along a hierarchy. Each cgroup has one
cgroup.controllers
File that lists all controllers available for cgroup to enable. When incgroup.subtree_control
When multiple controllers are specified in, either all succeed or all fail. If multiple operations are specified on the same controller, only the last one takes effect. The destruction of each cgroup’s controller is asynchronous, and there is also the problem of late reference.
All cgroup core interface files are prefixed with cgroup. The interface file for each controller is prefixed with the controller name and a dot. The controller name consists of lowercase letters and “, but never starts with “.
Cgroup core file
-
Cgroup. type – (single value) The read/write file that exists on a non-root cgroup. To change the cgroup to a threaded Cgroup, write “threaded” to the file. The options are as follows:
-
- Domain – A normal valid domain Cgroup
-
- Domain threaded – Cgroup of the thread domain of the subroot
-
- Domain invalid – An invalid Cgroup
-
- Threaded – Threaded cgroup, threaded subtree
-
-
Cgroup. procs – (newline delimited) All cgroups have read-write files. Each line lists the Pids of the processes that belong to the Cgroup. Pids are not ordered, and if the process moves to another Cgroup, the same PID may appear more than once;
-
Cgroup. controllers – (Spaces delimited) Read-only files that all cgroups have. All controllers available for cgroup are displayed.
-
Cgroup. subtree_control – (space delimited) All cgroups have read/write files, initially empty. If a controller appears more than once in the list, the last one is valid. When you specify multiple enable and disable operations, either all succeed or all fail.
-
- The controller name prefixed with + indicates that the controller is enabled
-
- The controller name prefixed with – indicates that the controller is disabled
-
-
Cgroup.events – A read-only file that exists on a non-root cgroup.
-
- Search-cgroup and its children contain active processes, with a value of 1; No active process and the value is 0.
-
- Frozen – cgroup Specifies whether the frozen value is 1. The unfrozen value is 0.
-
-
Cgroup.threads – (newline delimited) All cgroups have read-write files. Each line lists the tiDs of the threads belonging to the Cgroup. Tids are not ordered, and if the thread moves to another Cgroup, the same TID may appear more than once.
-
Cgroup.max. Descendants – (single value) read and write files. Maximum number of Cgroups Maximum number of child nodes.
-
Cgroup.max-depth – (single value) Reads and writes files. Lower than the maximum allowed tree depth of the current node.
-
Cgroup. stat – Read-only file.
-
- Nr_descendants – number of cgroups that show descendants.
-
- Nr_dying_descendants – number of Cgroups deleted by the user that will be destroyed by the system.
-
-
Cgroup. freeze – (single value) The read/write file that exists on a non-root cgroup. The default value is 0. When the value is 1, the Cgroup and all its child cgroups are frozen, and related processes are stopped and no longer run. It takes a certain amount of time to freeze the Cgroup. After the action is complete, the value of frozen in the cgroup.events control file is updated to 1 and a corresponding notification is sent. The frozen state of the Cgroup does not affect any cgroup tree operations (delete, create, etc.).
-
Cgroup. kill – (single-value) A read-write file that exists on a non-root cgroup. The only allowed value is 1, which kills the Cgroup and all its children (the process is killed by SIGKILL). Generally used to kill a Cgroup tree to prevent leaf node migration;
Cgroup ownership and migration
Each process in the system belongs to a Cgroup, and all threads of a process belong to the same Cgroup. A process can migrate from one Cgroup to another. The migration of a process does not affect the Cgroup to which existing descendant processes belong.
Figure 5, cgroup assignment of a process and its children; Examples of cross-CGroup migration
Migrating processes across CGroups is an expensive operation and stateful resource constraints (for example, memory) do not apply dynamically to the migration. As a result, processes are often migrated across CGroups only as a means. Direct application of different resource limits is discouraged.
How is cross-CGroup migration implemented
Each Cgroup has a read-write interface file “cgroup.procs”. One PID per row records all processes managed by cgroup limits. A process can migrate by writing its PID to another cgroup’s “cgroup.procs” file.
This way, however, you can only migrate a process’s calls on a single write(2) (if a process has multiple threads, all threads will be migrated at the same time, but also refer to the subtree to see if there is a record of putting the process’s threads into different Cgroups).
When a process forks a child process, that process is born in the cgroup to which its parent process belongs.
A Cgroup without any children or active processes can be destroyed by deleting the directory (even if there are associated zombie processes, it is considered removable).
What is a cgroups
Use the plural “cgroups” when explicitly referring to multiple separate control groups.
Cgroups form a tree structure. Each non-root Cgroup has a Cgroup.events file containing a Populated field indicating whether a cgroup’s child hierarchy has real-time processes. All non-root cgroup.subtree_control files can contain only controllers that are enabled in the parent.
Figure 6, cgroups example
As shown in the figure, CPU and memory resources are restricted in cgroup1, which controls the CPU cycles and memory allocation of child nodes (i.e., CPU and memory resources are restricted in Cgroup2, cgroup3, and cgroup4). In cgroup2, memory limits are enabled, but CPU limits are not enabled. As a result, the memory resources of cgroup3 and cgroup4 are limited by the meM Settings in cgroup2. Cgroup3 and CGroup4 compete for CPU resources within the CPU resource limit set by CGroup1.
Therefore, it can also be clearly seen that cgroup resources are restricted by top-down distribution. Only when the resource has been distributed from the upstream Cgroup node to the downstream, the downstream Cgroup can further distribute constrained resources. All non-root cgroup.subtree_control files can only contain controller content enabled in the parent cgroup.subtree_control file.
Then, will there be internal process competition between child cgroup and parent Cgroup?
Of course not. In Cgroup V2, a non-root Cgroup can distribute domain resources to cgroups of child nodes only when there is no process. In short, only a Cgroup that does not contain any process can have a domain controller enabled in its cgroup.subtree_control file, which ensures that the process is always on the leaf node.
Mount and delegate
Cgroup mounting mode
-
Memory_recursiveprot – recursively applies memory.min and memory.low protection to the entire subtree, without explicitly propagating down to the cgroup of leaf nodes, leaf nodes within the subtree can freely compete;
-
Memory_localevents – can only be set at mount time or modified by remounting from the init namespace, which is a system-wide option. Only the current Cgroup data is used to populate memory.events. If this option is not available, the default is to count all subtrees.
-
Nsdelegate – can only be set at mount time or modified by remounting from the init namespace, which is also a system-wide option. It treats the Cgroup namespace as a delegate boundary, which is one of two ways to delegate a Cgroup;
Cgroup delegate method
- Set the mount option nsDelegate;
- Authorized users to the directory and its
cgroup.procs
,cgroup.threads
和cgroup.subtree_control
Write access permission to the file
You get the same result either way. Once delegated, the user can create child hierarchies under the directory, with all resource allocation subject to the parent node. Currently, cgroups have no limit on the number of Cgroups in the delegate subhierarchy or the depth of nesting (it may be explicitly limited later).
We mentioned cross-CGroup migration earlier, and it is clear from the delegation that cross-CGroup migration has limitations for ordinary users. That is, whether you have write access to the cgroup.procs file of the current Cgroup and the cgroup.procs file of the common ancestor of source Cgroup and destination Cgroup.
Delegate and Migrate
Figure 7 shows an example of delegating permissions
As shown, the common user User0 has delegate permissions for cgroup[1-5].
Why did User0 fail to migrate the process from cgroup3 to cgroup5?
This is because User0 only has cgroup1 and cgroup2 permissions and does not have cgroup0 permissions. The authorized user in the delegate explicitly states that the “cgroup.procs” file that needs the common ancestor has write access! (That is, the permission of cgroup0 in the figure is required to achieve this)
Resource allocation model and function
Here is the resource allocation model for CGroups:
-
Weight – (for example, cpu.weight) ownership weights are in the range [1, 10000], with a default value of 100. Allocate resources according to weighting ratios.
-
Limit – [0, Max] range, default is “Max”, i.e. Noop (for example, IO. Max). Limits can be overused (the sum of child node limits can exceed the amount of resources available to the parent node).
-
Protection – [0, Max] range, the default is 0, noop (for example, IO. Low). Protection can be a hard guarantee or a soft boundary of best effort, or protection can be overused.
-
Allocation – [0, Max] range, default is 0, that is, no resources. The allocation cannot be overused (the total amount of resources allocated by child nodes cannot exceed the amount available to the parent node).
Cgroups provides the following features:
-
Resource restrictions – As illustrated in the cgroup section above, Cgroups can be nested to restrict resources in a tree structure.
-
Priority – When resource contention occurs, which process resources are secured first.
-
Audit – Monitor and report on resource constraints and usage.
-
Control – Controls the state of a process (start, stop, suspend).
Cgroup V1 and Cgroup V2
Deprecated core functionality
Cgroup V2 is quite different from Cgroup V1. Let’s take a look at which cgroup V1 features are deprecated in cgroup V2:
-
Multiple hierarchies including named hierarchies are not supported;
-
Not all V1 installation options are supported;
-
The Tasks file is deleted, and the Cgroup. procs file is not sorted
-
- List of thread group ids in cgroup v1. There is no guarantee that this list is sorted or that there are no duplicate Tgids, and if this property is required, user space should sort/unify the list. Writing the thread group ID to this file moves all threads in that group to the cgroup;
-
Cgroup. clone_children was deleted. Clone_children affects only the Cpuset controller. If clone_children is enabled in cgroup (setting: 1), the new CPUSet cgroup copies the configuration from the parent cgroup during initialization;
-
/proc/cgroups makes no sense for v2. Use the “cgroup.controllers” file in the root directory instead.
Cgroup V1
The most significant difference between Cgroup V2 and V1 is that Cgroup V1 allows any number of hierarchies, but this can cause some problems. Let’s talk more about it.
When mounting a CGroup hierarchy, you can specify a comma-separated list of subsystems to mount as filesystem mount options. By default, mounting a Cgroup file system attempts to mount a hierarchy containing all registered subsystems.
If an activity hierarchy already exists with the exact same set of subsystems, it will be reused for the new installation.
If the existing hierarchy does not match and any requested subsystem is being used in the existing hierarchy, the mount will fail and display -eBUSY. Otherwise, a new hierarchy associated with the requested subsystem is activated.
There is currently no way to bind the new subsystem to the active CGroup hierarchy or unbind the subsystem from the active CGroup hierarchy. When the Cgroup file system is unmounted, if any subCgroups are created under the top-level Cgroup, the hierarchy will remain active even if unmounted; If there are no subcgroups, the hierarchy is deactivated.
This is the problem in Cgroup V1, which is well solved in Cgroup V2.
Cgroup and container associations
Here we take Docker as an example. Create a container and limit the CPU and memory it can use:
➜ ~ docker run --rm -d --cpus=2 --memory=2g --name= 2c2G redis:alpine E420a97835d9692df5b90b47e7951bc3fad48269eb2c8b1fa782527e0ae91c8e ➜ ~ cat/sys/fs/cgroup/system. Slice/docker - ` docker ps Lq - no - trunc `. Scope/CPU. The Max 200000 100000 ➜ ~ cat/sys/fs/cgroup/system. Slice/docker - ` docker ps - lq Max 2147483648 ➜ ~ ➜ ~ docker run --rm -d --cpus=0.5 --memory=0.5g --name=0.5c0.5g Redis :alpine 8 b82790fe0da9d00ab07aac7d6e4ef2f5871d5f3d7d06a5cdb56daaf9f5bc48e ➜ ~ cat/sys/fs/cgroup/system. Slice/docker - ` docker ps Lq - no - trunc `. Scope/CPU. The Max 50000 100000 ➜ ~ cat/sys/fs/cgroup/system. Slice/docker - ` docker ps - lq --no-trunc`.scope/memory.max 536870912Copy the code
As you can see from the example above, when we create a new container using Docker and specify CPU and memory limits for it, the cpu.max and memory.max of the corresponding Cgroup configuration file are set to the corresponding values.
If you want to check the resource quota of some containers that are already running, you can also go to the corresponding configuration file.
conclusion
That’s it for cgroups, one of the cornerstones of container technology. I’ll also be writing about namespace and other container technologies, so stay tuned!
Please feel free to subscribe to my official account [MoeLove]