The full name of Linux Cgroups is Linux Control Groups, which is a feature of the Linux kernel. Its main function is to restrict, record and isolate the physical resources (CPU, memory, IO, etc.) used by process Groups.

In 2006, a group of Google engineers (primarily Paul Menage and Rohit Seth) started the project, which was originally called Process Containers. Due to the ambiguous name of container in the kernel, the container was renamed control Groups in 2007 and merged into the 2.6.24 kernel release in 2008.

The original version of CGroups, known as V1, was not designed to be friendly and was very difficult to understand. Subsequent development work was taken over by Tejun Heo, who redesigned and rewrote CGroups in a new version called V2 that first appeared in kernel version 4.5.

Cgroups have been designed with a clear mission to provide resource control for processes. Its main functions include:

  • Resource limits: Limits the maximum amount of resources a process can use, such as maximum memory and file system cache limits
  • Priority control: Different groups can have different priorities, such as CPU usage and disk I/O throughput
  • Audit: Calculates the resource usage of the group and can be used for accounting
  • Control: Suspend a group of processes or restart a group of processes

At present, Cgroups has become the basis of many technologies, such as LXC, Docker, Systemd and so on.

NOTE: Resource limits are the focus of this article, which is also the basis of container technologies such as Docker. The following article will also focus on Cgroup resource limits.

Cgroups core concepts

As mentioned earlier, Cgroups are used for resource management of processes, so Cgroups need to consider how to abstract these two concepts: processes and resources, and how to organize their own structure. There are several very important concepts in Cgroups:

  • Task: a task that corresponds to an entity running in a system, usually a process
  • Subsystem: a specific resource controller (Resource class or Resource controller) that controls the use of a particular resource. For example, the CPU subsystem controls CPU time, and the Memory subsystem controls memory usage
  • Cgroup: control group, a group of tasks and subsystem association, which represents the resource management strategy for these tasks
  • Hierarchy: a hierarchy structure consisting of a series of Cgroups. Each node is a Cgroup, and a Cgroup can have multiple children, which inherit the attributes of their parent by default. Multiple hierarchies can exist in the system

Although Cgroup supports Hierarchy, which allows different subresources to be attached to different directories, there are various restrictions between multiple trees, which increases the complexity of understanding and maintenance. In practice, all subresources are placed in a single path (e.g. /sys/fs/cgroup/ in Ubuntu20), so this article doesn’t cover multiple trees in detail.

Sub-resource Systems (Resource Classes or SubSystem)

There are currently the following resource subsystems:

  • Block I/o (blkio) : limits the I/O rates of block devices (disks, SSDS, and USB devices)
  • CPU Set(cpuset) : Limits which CPU cores a task can run on
  • CPU Accounting(cpuacct) : Generates a report on CPU usage for tasks in cgroup
  • CPU (CPU) : Limits the CPU time allocated by the scheduler
  • Devices (devices) : Allows or denies access to devices by tasks in the Cgroup
  • Freezer (freezer) : Suspends or restarts a task in cGroup
  • Memory (memory) : limits the amount of memory used by tasks in cGroup and generates a report on the current memory usage of tasks
  • Network Classifier(net_cls) : Set a specific Classid flag for the packets in the Cgroup, so that tc and other tools can configure the network according to the flag
  • Network Priority (net_prio) : Sets the priority of packets for each network interface
  • Perf_event: Identifies the cgroup members of the task and can be used for performance analysis

Cgroups file system

Linux uses a variety of data structures in the kernel to achieve the configuration of cgroups, associated with the process and cgroups node, so how does Linux let the user process use the function of Cgroups? The Linux kernel has a very powerful module called VFS (Virtual File System). The VFS hides the details of specific file systems and provides a unified file system API for user-mode processes. Cgroups also expose functionality to the user state through the VFS. The interface between Cgroups and the VFS is called the CGroups file system. Let’s take a look at the basics of VFS and then the cGroups file system implementation.

VFS

VFS is a kernel abstraction layer that hides the implementation details of a specific file system and provides a unified API for user-mode processes. VFS uses a general-purpose file system design, in which specific file systems that implement the VFS design interface can be registered with VFS so that the kernel can read and write to the file system. This is much like the relationship between an abstract class and a subclass in object-oriented design, where the abstract class is responsible for the design of the external interface and the subclass is responsible for the concrete implementation. VFS is itself a set of object-oriented interfaces implemented in C.

Common file model

The VFS common file model contains the following four metadata structures:

  1. A superblock object is used to store information about registered file systems. For example, basic disk file systems such as ext2 and ext3, socket file systems for reading and writing sockets, and the current Cgroups file system for reading and writing cgroups configuration information, etc.
  2. Inode objects are used to store information about specific files. For a common disk file system, inode nodes generally store information such as file blocks in hard disks. For socket file systems, inodes store socket-related attributes, and for special file systems such as Cgroups, inodes store attributes related to cgroup nodes. An important part of this is a structure called inode_operations, which defines the implementation of creating files, deleting files, and so on in a specific file system.
  3. File object: A file object represents an open file in the process. The file object is stored in the file descriptor table of the process. Also important in this file is a structure called file_operations, which describes the read and write implementation of a specific file system. When a process calls a read or write operation on one of the file descriptors, it actually calls the methods defined in File_operations. For a normal disk file system, file_operations defines normal block device read and write operations. For socket file systems, file_operations define send/recv operations corresponding to sockets. For special file systems like Cgroups, file_operations defines specific implementations that operate on cgroup structures.
  4. Dentry Object: In each file system, the kernel generates a dentry object for each component of the kernel path when searching for a file in a path. The inode object can be found through the dentry object. The dentry object is generally cached to improve the search speed of the kernel.

Subsystems, Hierarchies, and the Control Groups and the relationship between the Tasks

Hierarchies,Control Groups, and Tasks have many rules between them, which are described below:

Rule 1

  • The same hierarchy can attach one or more subs.

The following figure attaches CPU and Memory SUBSYSTEMS (or any number of SUBSYSTEMS) to the same hierarchy.

Rule 2

  • A subsystem can only be attached to one hierarchy.

The CPU Subsystem has been attached to Hierarchy A and the memory subsystem has been attached to Hierarchy B. Therefore the CPU subsystem cannot be attached to hierarchy B again.

[

Rule 3

Every time the system creates a Hierarchy, all tasks on the system by default constitute the initial Cgroup of the new Hierarchy, which is also called the root Cgroup. For each hierarchy you create, a task can exist in only one Cgroup. That is, a task cannot exist in different Cgroups in the same hierarchy, but a task can exist in multiple Cgroups in different hierarchies. If a task is added to another Cgroup in the same Hierarchy, it will be removed from the first Cgroup

As shown below, the CPU and memory subsystem are attached to cpu_mem_CG’s hierarchy. Whereas the NET_CLS subsystem is attached to the Net_CLS hierarchy. The HTTPD process is added to both cpu_mem_CG Hierarchy CG1 CGroup and NET Hierarchy CG3 Cgroup. And through the two hierarchy subsystem respectively to the HTTPD process of CPU,memory and network bandwidth constraints.

[

Rule 4

When a task(Linux process) forks a sub-task, the sub-task automatically inherits the relationship of the parent task in the same Cgroup, but the sub-task can be moved to a different Cgroup as required. Parent-child tasks are independent of each other.

In the following figure, the HTTPD process is in cpu_and_mem Hierarchy /cg1 cgroup and writes PID 4537 to the tasks of the Cgroup. Then the HTTPD (PID=4537) process forks a child process HTTPD (PID=4840) and its parent process in the same Hierarchy in a unified Cgroup, but because the relationship between the parent task and the child task is independent. So subtasks can be moved to other Cgroups.

[

The use of cgroups

What’s interesting about the cgroup kernel functionality is that it doesn’t provide any system call interface, but rather an implementation of the Linux VFS, so you can operate in a file-system-like fashion.

There are several ways to use cgroups:

  • Use the virtual file system provided by CGroups to control cgroups directly by creating, reading, and deleting directories and files
  • Use command-line tools such as the cgcreate, cgexec, cgclassify commands provided with the libcGroup package
  • userules engine daemonConfiguration file provided
  • Of course, software such as Systemd, LXC, and Docker that encapsulates CGroups also allows you to control the content of cgroups through their defined interfaces

Directly operate the Cgroup file system

Query cgroups mount information

On an Ubuntu 18.04 machine, cgroups has been mounted to the file system and can be viewed using the mount command:

$ mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
Copy the code

If not, you can also mount the desired subsystem to the system using the following command:

$ mount -t cgroup -o cpu,cpuset,memory cpu_and_mem /cgroup/cpu_and_mem
Copy the code

The preceding command mounts CPU, Cpuset, and memory resources to /cgroup/cpu_and_mem.

Under each cgroup directory there are files that describe the cgroup. In addition to the resource control files that are unique to each cgroup, there are some common files:

  • tasks: Indicates the PID list of tasks in the current Cgroup. To add the PID of a process to this file is to move the process to the cgroup
  • cgroup.procs: List of thread groups currently contained in cGroups, using logical andtasksThe same
  • notify_on_release: 0 or 1, whether to notify when the Cgroup is destroyed. If it is 1, the system executes when the last cgroup task leaves (exits or migrates to another Cgroup) and the last subcgroup is deletedrelease_agentCommand specified in
  • release_agent: Indicates the command to be executed

Create cgroup

To create a cgroup, use mkdir to create a directory in the corresponding subresource:

$ mkdir /sys/fs/cgroup/cpu/mycgroup $ ll /sys/fs/cgroup/cpu/mycgroup total 0 -rw-r--r-- 1 root root 0 Dec 13 08:02 cgroup.clone_children -rw-r--r-- 1 root root 0 Dec 13 08:02 cgroup.procs -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.stat -rw-r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage_all  -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage_percpu -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage_percpu_sys -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage_percpu_user -r--r--r-- 1 root root 0 Dec 13  08:02 cpuacct.usage_sys -r--r--r-- 1 root root 0 Dec 13 08:02 cpuacct.usage_user -rw-r--r-- 1 root root 0 Dec 13 08:02 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Dec 13 08:02 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Dec 13 08:02 cpu.shares -r--r--r-- 1 root root 0 Dec 13 08:02 cpu.stat -rw-r--r-- 1 root root 0 Dec 13 08:02 notify_on_release -rw-r--r-- 1 root root 0 Dec 13 08:02 tasksCopy the code

The preceding command creates mycgroup in the CPU subresource. After the cgroup is created, the required files will be automatically created in the directory. We’ll explain what these files mean later, but for now they control the corresponding child resources.

Delete the cgroup

To delete a subresource, delete the corresponding directory:

rmdir /sys/fs/cgroup/cpu/mycgroup/
Copy the code

After deletion, if there are processes in the Tasks file, they are automatically migrated to the parent CGroup.

Set cgroup parameters

To set a group parameter is to write something in a particular format to a particular file, such as limiting the number of CPU cores a cgroup can use:

echo 0-1 > /sys/fs/cgroup/cpuset/mycgroup/cpuset.cpus
Copy the code

Add the process to cGroup

To add a running process to a cgroup, write the PID of the process directly into the cgroup Tasks file:

echo 2358 > /sys/fs/cgroup/memory/mycgroup/tasks
Copy the code

Run processes in cGroup

What if you want to run a process directly in a Cgroup, but do not know the Pid of the process before running it?

We can do this using cgroup inheritance, since the child inherits the parent’s Cgroup, so we can add the current shell to the desired Cgroup:

echo $$ > /sys/fs/cgroup/cpu/mycgroup/tasks
Copy the code

The above scheme has a flaw that the original shell is still in the cgroup after it is run. If you want the process to run without affecting the current shell, you can create a temporary shell:

sh -c "echo \$$ > /sys/fs/cgroup/memory/mycgroup/tasks & & stress -m 1"
Copy the code

Move the process to cgroup

If you want to move a process to another Cgroup, use echo to write the process PID into the cgroup Tasks file. The cgroup Tasks file automatically deletes the process.

cgroup-tools

The cgroup-tools software package provides a series of commands to operate and manage cgroups. You can use the following commands to install cgroups in Ubuntu:

sudo apt-get install -y cgroup-tools
Copy the code

List cgroup mount information

At its simplest, LSSubsys can view SUBSYSTEMS, which exist in the system:

lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids
rdma /sys/fs/cgroup/rdma
Copy the code

Create cgroup

Cgcreate can be used to create specified cgroups for users:

sudo cgcreate -a cizixs -t cizixs -g cpu,memory:test1 
ls cpu/test1 
cgroup.clone_children  cpuacct.stat   cpuacct.usage_all     cpuacct.usage_percpu_sys   cpuacct.usage_sys   cpu.cfs_period_us  cpu.shares  notify_on_release
cgroup.procs           cpuacct.usage  cpuacct.usage_percpu  cpuacct.usage_percpu_user  cpuacct.usage_user  cpu.cfs_quota_us   cpu.stat    tasks
Copy the code

/sys/fs/cgroup/ CPU and /sys/fs/cgroup/memory create test1 directories respectively.

  • options-tThe specifiedtasksThe file’s users and groups, which specify who can add tasks to a Cgroup, are inherited by default from the parent Cgroup
  • -aIn addition to the specifiedtasksUsers and groups of all external files (resource control files), that is, who can manage resource parameters
  • -gSpecifies the cgroup to be added, preceded by the comma-separated subresource type and followed by the colon, the path of the cgroup (which is added to the directory to which the resource is mounted). That is, add a specified subresource under a specified directory

Delete the cgroup

Know how to create, and know how to delete. Otherwise, too many unused Cgroups are kept in the system, which wastes system resources and makes management very troublesome.

Cgdelete deletes the corresponding cgroups. This is similar to the cgcreate command, which uses -g to specify the cgroups to be deleted:

cgroup sudo cgdelete -g cpu,memory:test1
Copy the code

Cgdelete also provides the -r argument to recursively delete a cgroup and all its subCgroups.

If there are tasks in the deleted CGroup, the tasks are automatically moved to the parent CGroup.

Set cgroup parameters

The cgset command can set parameters for a subresource, such as limiting the number of CPU cores a cgroup task can use:

cgset -r cpuset.cpus=0-1 /mycgroup
Copy the code

-r is followed by the key-value pairs of the parameters. Each child resource has its own rules for the key-value pairs that can be configured, which we will explain in detail later.

Cgsets can also copy arguments from one cgroup to another cgroup:

cgset --copy-from group1/ group2/
Copy the code

NOTE: CgSet will not return an error if the setting is not successful, please be careful.

Run a process in a Cgroup

Cgexec executes a program and adds the program to the corresponding Cgroups:

cgroup cgexec -g memory,cpu:cizixs bash
Copy the code

Cgroups can be hierarchical, so you can directly create a hierarchical Cgroup and run within that cgroup:

cgcreate -g memory,cpu:groupname/foo
cgexec -g memory,cpu:groupname/foo bash
Copy the code

Move an already running process to a Cgroup

To move an existing program (with its PID known) to a Cgroup, use the cgclassify command:

Such as moving the current Bash shell into a specific Cgroup

cgclassify -g memory,cpu:/mycgroup $$
Copy the code

$$indicates the pid number of the current process. The above command can be used to easily test some processes that consume memory or CPU if /mycgroup limits CPU and memory.

This command can also move multiple processes at the same time, with their Pids separated by Spaces:

cgclassify -g cpu,memory:group1 1701 1138
Copy the code

Cgroup subresource parameters

Each subssyTEM manages part of the system’s resources and provides multiple parameters to control. Each parameter corresponds to a file. You can control the resources by writing content in a specific format to the file.

Blkio: limits device I/O access

There are two ways to limit disk I/OS: weight and limit. A weight is assigned to different applications (or Cgroups). Each application uses I/O resources in percentage. The upper limit is the maximum read/write rate of an application.

To set the weights of cgroup devices:

Setting weights does not guarantee anything. When only one application is reading or writing to the disk, it can use the disk regardless of its weight. Only when multiple applications simultaneously read and write data to the disk, the application is assigned a read/write rate based on the weight.

  • blkio.weight: Sets the weight of cgroup read/write devices. The value ranges from 100 to 1000
  • blkio.weight_device: Sets the weight of a device used by cgroup. When the device is accessed, it overrides with the current valueblkio.weightThe value of the. The format of the content ismajor:minor weight, preceded by the major and minor numbers of the device, which are used to uniquely identify a device, followed by integer values between 100 and 1000. The device number can be assigned by referring to:www.kernel.org/doc/html/v4…

Set the limit for cgroup to access devices:

In addition to setting the weight, you can also set the upper limit for cGroup disk usage to ensure that the disk read/write rate of cGroup processes does not exceed a certain value.

  • blkio.throttle.read_bps_device: Maximum number of bytes read from the device per second
  • blkio.throttle.read_iops_device: Maximum number of read operations performed from the device per second
  • blkio.throttle.write_bps_device: Maximum number of bytes written to the device per second
  • blkio.throttle.write_iops_device: Maximum number of write operations that can be performed to the device per second

The maximum number of read and write bytes is in the same format: major:minor bytes_per_second. The first two digits represent a device, followed by an integer that indicates the number of read and write bytes per second. The unit is bit. To limit the read rate of /dev/sda to 10 Mbps, run the following command:

echo "8:0 10485760" >
/sys/fs/cgroup/blkio/mygroup/blkio.throttle.read_bps_device
Copy the code

Iops indicates the number of read and write operations per second. The format is major: Minor OPERATIONs_per_second. For example, to limit writing to 10 times per second, run:

echo "8:0 10" >
/sys/fs/cgroup/blkio/mygroup/blkio.throttle.write_iops_device
Copy the code

In addition to limiting disk usage, BLkio provides statistics on disk usage under the throttle rule.

  • blkio.throttle.io_serviced: Indicates the number of times for cgroup processes to read and write disksmajor:minor operation numberIs the number of times that certain operations (such as read, write, sync, async, and total) are performed on a disk
  • blkio.throttle.io_service_bytes: Similar to the above, but saves the number of bytes transferred by the operation
  • blkio.reset_stats: Resets the statistics by writing an integer value to the file
  • blkio.time: Collects statistics on the access time of each device in the cgroup formatmajor:minor milliseconds
  • blkio.io_serviced: Number of cgroup operations on the device under CFQ scheduler, andblkio.throttle.io_servicedOn the contrary, all requests that are not throttle
  • blkio.io_services_bytes: Number of bytes that cgroup operates on various devices under the CFQ scheduler
  • blkio.sectors: Indicates the number of sectors to be transferred in the Cgroupmajor:minor sector_count
  • blkio.queued: cgroup Indicates the number of I/O requests in the queuenumber operation
  • blkio.dequeue: Indicates the number of TIMES the Cgroup I/O request is out of the queuemajor:minor number
  • blkio.avg_queue_size:
  • blkio.merged: cgroup Indicates the number of times the BIOS request is combined with the I/O operation requestnumber operation
  • blkio.io_wait_time: cgroup Indicates the time to wait for queue services
  • blkio.io_service_time: Time for cgroup to process a request under the CFQ scheduler (from the time the request is scheduled to the time the IO operation is completed)

CPU: Limits the CPU usage of the process group

CPU sub-resources can manage the CPU usage behavior of tasks in cgroup. There are two scheduling modes for tasks to use CPU resources: Completely Fair Scheduler (CFS) and real-time Scheduler (RT). The former can allocate responsive CPU time slices to tasks based on weights, while the latter can limit the number of CPU cores used.

CFS tuning parameters:

Under CFS scheduling, each Cgroup is assigned a weight, but this weight does not guarantee that the task uses specific CPU data. If only one process is running (in theory, it is unlikely that only one process is running on a machine), it can use all CPU resources regardless of the CPU weight of its Cgroup. In the case of tight CPU resources, the kernel allocates proportionally allocated CPU time slices for each task based on the weight of the Cgroup.

In CFS scheduling mode, you can also assign an upper limit to the number of CPU cores that a task can use.

Set CPU numbers in microseconds, expressed in US.

  • cpu.cfs_quota_us: Indicates the CPU usage of all tasks in the Cgroup in each period. The default value is- 1Indicates that the CPU usage is not limited. Need to cooperatecpu.cfs_period_usGenerally, the value is set to100000(docker set value)
  • cpu.cfs_period_us: Indicates the available time period for cgroup tasks in each period. If you want to limit the CPU usage of CGroup tasks to 0.5 seconds per second, you can set this parameter tocpu.cfs_quota_us100000Set it to50000. If its value is less thancfs_quota_usLarge, indicating that a process can use more than one core CPU, for example200000Indicates that the process is available2.0
  • cpu.stat: CPU usage statistics,nr_periodsRepresents the time period that has passed;nr_throttledRepresents the number of times a cGroup task is limited in CPU usage (because the specified upper limit was exceeded);throttled_timeRepresents the total time restricted
  • cpu.shares: cgroup Weight value of CPU usage. If the weight of both cgroups is set to 100, the CPU usage of both cgroups should be the same when the tasks in them are running at the same time. If you change one of the weights to 200, it can use twice as much CPU time as the other.

Parameters in RT scheduling mode:

In RT scheduling mode, the upper limit is similar to that in CFS, except that it only limits the CPU for real-time tasks.

  • cpu.rt_period_us: Sets a cycle time, indicating how long cgroup can reallocate CPU resources
  • cpu.rt_runtime_us: Run time: indicates the duration during which tasks in the CGroup can access the CPU. This limit applies to the number of single CPU cores. For multiple cores, multiply by the number of cores

Cpuacct: collects statistics on CPU usage of tasks

Cpuacct does not limit any resources. It automatically collects CPU resource usage statistics for tasks in cGroups, including tasks in sub-CGroups.

  • cpuacct.usage: Total CPU usage time of all tasks in the Cgroup (including tasks in sub-Cgroups, similarly below), in nanoseconds (ns). Write to file0Statistics can be reset
  • cpuacct.stat: All tasks in this Cgroup use the USER and system time of the CPU, that is, user-mode CPU time and kernel-mode CPU time
  • cpuacct.usage_percpu: The time in nanoseconds (ns) for all tasks in the cgroup to use each CPU core

Cpuset: CPU bound

In addition to limiting CPU usage, cgroups can bind tasks to specific cpus so that they run only on those cpus, which is what cpuset sub-resources do. In addition to cpus, you can also bind memory nodes.

**NOTE: ** Before adding the task to the cPUSet task file, the user must set cpuset.cpus and cpuset.mems parameters.

  • cpuset.cpus: Sets the CPU that can be used by tasks in the Cgroup. The format is comma ()..) separated list, minus sign (-) can represent ranges. For instance,0-2, 7Represents CPU cores 0,1,2, and 7.
  • cpuset.mems: Sets the memory nodes that can be used by tasks in cGroup, andcpuset.cpusformat

The above two parameters are the most commonly used. There are many other parameters in CPUSet that require a deep understanding of the CPU scheduling mechanism. They are rarely used and I do not understand them, so I will not write them.

Memory: Limits the memory usage

Memory sub-resource systems can limit the memory usage of tasks in cgroups and also generate reports on their use of data.

Control memory usage:

  • memory.limit_in_bytes: Memory upper limit that can be used by cgroup. The default value is bytes. You can also addk/K,m/Mg/GUnit suffixes. Write to file- 1To remove the upper limit, indicating that memory is not limited
  • memory.memsw.limit_in_bytes: Memory plus swap upper limit that cgroup can use. The usage is the same as above. write- 1To remove the cap
  • memory.failcnt: Tasks are achieved using the amount of memorylimit_in_bytesNumber of upper limits
  • memory.memsw.failcnt: The task is achieved using memory and swap capacitymemsw.limit_in_bytesNumber of upper limits
  • memory.soft_limit_in_bytes: Sets the soft upper limit of the memory. Tasks in cGroup can be used if there is enough memorymemory.limit_in_bytesSet memory upper limit; When memory resources are low, the kernel will limit the amount of memory used by taskssoft_limit_in_bytesThe value. File content format andlimit_in_bytesThe same
  • memory.swappiness: sets the tendency for the kernel to swap out process memory rather than reclaim pages from the page cache. The default value is 60. A value lower than 60 indicates a decrease tendency, while a value higher than 60 indicates an increase tendency. If the value is higher than 100, it indicates the pages that are allowed in the kernel swap out process address space. A value of 0 indicates a low propensity, not a prohibition.

OOM operation:

OOM is an abbreviation for out of memory. Cgroup controls what happens to processes when they run out of memory. By default, processes that run out of memory are killed.

Memory. oom_control: Whether to start OOM killer, if started (value 0, default) processes exceeding the memory limit will be killed; If not started (with a value of 1), a process that uses more than the memory limit is not killed, but is paused until it has freed memory and can continue to be used.

Statistics on memory usage:

  • memory.stat
    Copy the code

    : Reports memory usage, including:

    • cache: Number of page cache bytes, including TMPFS (SHMEM)
    • rss: Number of anonymous and swap cache bytes, excluding TMPFS
    • mapped_file: Memory-mapped file size, including TMPFS, in bytes
    • pgpgin: paged into Memory pages
    • pgpgout: paged out Number of memory pages
    • swap: Number of swap bytes used
    • active_anon: The number of anonymous and swap cached bytes in the active LRU list, including TMPFS
    • inactive_anon: The number of anonymous and swap cached bytes in the inactive LRU list, including TMPFS
    • active_file: File backed memory bytes of the active LRU list
    • inactive_file: Number of memory bytes supported by the file in the inactive list
    • unevictable: Number of bytes of memory that cannot be reclaimed
  • Memory. usage_in_bytes: The total number of memory bytes currently used by cgroup processes

  • Memory.memsw. usage_in_bytes: indicates the total memory used by the cgroup process plus the total swap bytes

  • Memory. max_usage_in_bytes: indicates the maximum number of memory bytes used by cgroup processes

  • Memory.memsw. max_usage_in_bytes: indicates the maximum number of memory and swap bytes used by a cgroup process

Net_cls: indicates the classification of network packets

The net_CLS sub-resource can mark network packets with a Classid, so that the tc (traffic control) module of the kernel can control traffic according to the classid.

Net_cls. classid: Contains an integer value. It is decimal to read from the file, but hexadecimal to write. For example, if 0x100001 is written to a file, 1048577 will be read and the IP command operation will have the form 10:1.

The value is in the format of 0xAAAABBBB and consists of 32 bits. The preceding 0 is ignored. Therefore, 0x10001 and 0x00010001 are represented as 1:1.

Net_prio: indicates the priority of network packets

Network Priority (NET_PRIO) subresources can dynamically set the Priority of Network interfaces applied in cGroups. The network priority is an attribute value of the packet. TCS can set the network priority, and sockets can set it with the SO_PRIORITY option (but few applications do this).

  • net_prio.prioidx: read-only file that contains an integer value used by the kernel to identify the cgroup
  • net_prio.ifpriomap: Indicates the priority of the network interface. It can contain many lines to set the priority of the packets sent from the network interface. The format of each line isnetwork_interface priority, such asecho "eth0 5" > /sys/fs/cgroup/net_prio/mycgroup/net_prio.ifpriomap

Devices: Device blacklist and whitelist

Device Allows or prevents tasks in the CGroup from accessing a device. This is the function of the blacklist and whitelist.

  • devices.allow
    Copy the code

    : Indicates the list of devices that can be accessed by tasks in the Cgroup

    type major:minor access
    Copy the code

    .

    • typeRepresents the type, which can bea(all), c(char), b(block)
    • major:minorRepresents the device number. Both labels can be used*Substitutions for all, for example* : *Represents all devices
    • accssIndicates the access mode, which can ber(read),w(write), mThe combination of mknod ()
  • Devices. deny: indicates the device that cannot be accessed by tasks in the Cgroup. The format is the same as the previous one

  • Devices. list: lists the blacklist and whitelist of devices in the Cgroup

freezer

Freezer resource is a special resource that does not relate to any system resource but can suspend and resume tasks in the Cgroup.

  • freezer.state
    Copy the code

    This file value exists in the non-root cgroup (since all tasks are in the root cgroup by default, stopping all tasks is obviously a mistake) and represents the state of the processes in the cgroup:

    • FROZEN: Tasks are suspended in cgroup
    • FREEZING: Tasks in cgroup are being suspended
    • THAWED: Tasks in the CGroup are recovered

To suspend a process, move it into a warentest Cgroup and Freeze the Cgroup.

**NOTE:** You cannot add tasks to a cGroup if it is suspended. Users can write FROZEN and THAWED to control process suspension and recovery. FREEZING is not controlled by users.

conclusion

Cgroups provide powerful capabilities that allow us to control the resource usage of our applications, as well as statistics on resource usage data, which is the foundation of container technology. However, the whole system of Cgroup is also very complicated, even a little chaotic. At present, the whole cgroup is being rewritten, and the new version is called CGroup V2, while the previous version is also called V1.

Cgroup itself does not provide control over the use of network resources, but can only add simple markers and priorities. The specific control needs to be realized by using the Linux TC module.

The resources

  • An introduction to cgroups and cgroupspy
  • LXC, Cgroups and Advanced Linux Container Technology Lecture
  • redhat doc:Subsystems and Tunable Parameters
  • Cgroups resource limits Docker
  • Docker resource management exploration: Docker behind the kernel Cgroups mechanism
  • Introduction to Linux kernel Cgroups
  • Linux Cgroups
  • Linux Cgroups V2 design
  • Understanding the new control groups API
  • Resource Management: Linux Kernel Namespaces and cgroups – Rami Rosen
  • Tech.meituan.com/2015/03/31/…