Linux Cgroup tutorial: CPUSet

This is the fourth in our series on Cgroup:

Linux Cgroup tutorial: Basic concepts
Linux Cgroup tutorial: CPU
Linux Cgroup tutorial: Memory

/sys/fs/cgroup /sys/fs/cgroup You also learned how to set CPU Shares and CPU quotas to control CPU usage within and between slices. This article continues to explore limits on CPU usage.

For some CPU-intensive programs, it is not only necessary to capture more CPU usage time, but also to reduce the context switching caused by the workload as it throttles. In today’s multi-core system, each core has its own cache. If frequent scheduling processes are executed on different cores, it is bound to bring overhead such as cache invalidation. Is there a way to isolate the CPU core? Specifically, the running process is bound to the specified core. While all programs are created equal to an operating system, some programs are more equal than others.

For more egalitarian programs, we need to allocate more CPU resources to them, because people are biased. Without further ado, let’s take a look at how cgroup can be used to restrict a process from using a given CPU core.

1. View THE CPU configuration

CPU cores are numbered from 0 to 3. We can determine certain CPU information by looking at the contents of /proc/cpuinfo:

$ cat /proc/cpuinfo ... processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz Stepping: 4 Microcode: 0x1F CPU MHz: 2666.761 Cache Size: 12288 KB physical id : 6 siblings : 1 core id : 0 cpu cores : 1 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt  tsc_deadline_timer hypervisor lahf_lm ssbd ibrs ibpb stibp tsc_adjust arat spec_ctrl intel_stibp flush_l1d Arch_capabilities bogomips: 5333.52 clFlush size: 64 cache_alignment: 64 address sizes: 43 bits physical, 48 bits virtualCopy the code

processor: represents the core number, but this is not the core of the physical CPU. It is more properly called the ** logical core number.
physical id: indicates the core of the physical CPU where the logical core is located. The number starts from 0. This indicates that the logical core is on the seventh physical CPU.
core id: If this value is greater than 0, you should be aware that your server may have hyperthreading enabled. If hyperthreading is enabled, each physical CPU core simulates two threads, also known as logical cores (which are different but have the same name). If you want to check whether the server has hyperthreading enabled, you can use the following command to check:

$ cat /proc/cpuinfo | grep -e "core id" -e "physical id"

physical id    : 0
core id        : 0
physical id    : 2
core id        : 0
physical id    : 4
core id        : 0
physical id    : 6
core id        : 0Copy the code

If the processor with the same physical ID and core ID appears twice, hyperthreading is enabled. Apparently my server is not running.

2. The NUMA structure

There is a concept called NON-Uniform Memory Access (NUMA), or non-uniform memory access architecture. If multiple cpus are installed on the motherboard, the NUMA architecture is used. Each CPU occupies an area and usually has an independent fan.

A NUMA node contains hardware devices directly connected to the region, such as cpus and memory. The communication bus is usually PCI-E. This also introduces the concept of CPU affinity, which means that a CPU can access memory on the same NUMA node faster than that on another node.

To view the native NUMA schema, run the following command:

$ numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 2047 MB
node 0 free: 1335 MB
node distances:
node   0
  0:  10Copy the code

You can see that the server does not use the NUMA architecture, there is only one NUMA node, that is, there is only one CPU, all four logical cores are on this CPU.

3. isolcpus

One of Linux’s most important responsibilities is scheduling processes, which are just an abstraction of how a program runs, executing a series of instructions that a computer can follow to do its actual work. From a hardware point of view, it is the central processing unit, or CPU, that actually executes these instructions. By default, the process scheduler may schedule processes to any CPU core because it balances the allocation of computing resources based on the load.

To add to the obvious effect of the experiment, I can isolate certain logic cores so that the system will never use them by default unless I specify some process to use them. To do this, use kernel parameter isolcpus. For example, if logical cores 2,3 and 4 are not used by default, add the following to the kernel parameter list:

Isolcpus =1,2,3 # or isolcpus=1-3Copy the code

For CnetOS 7, you can modify /etc/default/grub directly:

$ cat /etc/default/grub GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)" GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT="console" Lv =rhel/root rd.lvm.lv=rhel/swap RHGB quiet isolcpus=1,2,3" GRUB_DISABLE_RECOVERY="true"Copy the code

Then rebuild grub.conf:

$ grub2-mkconfig -o /boot/grub2/grub.cfgCopy the code

After the system restarts, the system will no longer use logic cores 2,3 and 4, but only core 1. Find a program that runs the CPU to full (the program used in the previous article) and use the command top to check CPU usage:

After executing the top command, press the number 1 key on the list page to see all cpus.

As you can see that the system only uses core 1, let’s look at how to tie the program to a specific CPU core.

4. Create cgroup

Tying the program to the specified core is as simple as setting up the CPUSET controller. Systemctl can manage cgroup controllers for the resources it controls, but only limited controllers (CPU, memory, and BlockIO), not CpusET controllers. Systemd doesn’t support CPUSET, but it will, and there’s a slightly clunky way to achieve the same goal, which I’ll cover later.

All operations related to a cgroup are based on the cgroup virtual filesystem in the kernel. Using a cgroup is simple, just mount the filesystem. File systems are mounted to /sys/fs/cgroup by default.

$ll /sys/fs/cgroup total amount 0 drwxr-xr-x 2 root root 0 3月 28 2020 blkio LRWXRWXRWX 1 root root 11 3月 28 2020 CPU -> CPU,cpuacct LRWXRWXRWX 1 root root 11 March 28 2020 cpuACCt -> CPU, cpuACCt drwxr-xr-x 2 root root 0 March 28 2020 CPU,cpuacct Drwxr-xr-x 2 root root 0 3月 28 2020 CPUSet drwxr-xr-x 4 root root 0 3月 28 2020 Devices drwxr-xr-x 2 root root 0 3月 28 2020 freezer DRWxr-xr-x 2 root root 0 March 28 2020 Hugetlb drWxr-XR-x 2 root root 0 March 28 2020 Memory LRWXRWXRWX 1 root Root 16 March 28 2020 net_cls -> net_cls,net_prio DRWXR -xr-x 2 root root 0 March 28 2020 net_cls,net_prio LRWXRWXRWX 1 root Root 16 March 28 2020 net_prio -> net_cls,net_prio DRWXR -xr-x 2 root root 0 March 28 2020 perf_event DRWXR -xr-x 2 root root 0 March 28 2020 pids drwxr-xr-x 4 root root 0 March 28 2020 systemdCopy the code

You can see that the CPUSET controller is created and mounted by default. Take a look at what’s in the cpuset directory:

$ll /sys/fs/cgroup/cpuset total amount 0 -rw-r--r-- 1 root root 0 3月 28 2020 cgroup.clone_children --w--w-- 1 root root 0 3月 Event_control-rw-r --r-- 1 root root 0 March 28 2020 cgroup.procs -r--r-- 1 root root 0 March 28 2020 Cgroup. sane_behavior -rw-r--r-- 1 root root 0 March 28 2020 cpuset.cpu_exclusive -rw-r--r-- 1 root root 0 March 28 2020 Cpuset. cpus -r--r--r-- 1 root root 0 3月 28 2020 cpuset.effective_cpus -r--r--r-- 1 root root 0 3月 28 2020 Cpuset. effective_mems-RW-r --r-- 1 root root 0 March 28 2020 cPUSet. mem_exclusive -rw-r--r-- 1 root root 0 March 28 2020 Mempuset. mem_hardwall -rw-r--r-- 1 root root 0 3月 28 2020 cpuset. Memory_migrate -r--r-- 1 root root 0 3月 28 2020 Memorpuset. memory_pressure -rw-r--r-- 1 root root 0 March 28 2020 cpuset. memory_presSURE_enabled -rw-r--r-- 1 root root 0 March 28 Memory_spread_page -rw-r--r-- 1 root root 0 March 28 2020 cpuset.memory_spread_slab -rw-r--r-- 1 root root 0 Sched_load_balance-rw-r --r-- 1 root root 0 3月 28 2020 cpuset. sched_load_balance-rw-r --r-- 1 root root 0 3月 28 2020 Sched_relax_domain_level-rw-r --r-- 1 root root 0 3月 28 2020 NotifY_on_release -rw-r--r-- 1 root root 0 3月 28 2020 Release_agent-rw-r --r-- 1 root root 0 3月 28 2020 TasksCopy the code

There is only default configuration in this directory and no cgroup subsystem. Next we create the CPUSET subsystem and set the corresponding binding core parameters:

$ mkdir -p /sys/fs/cgroup/cpuset/test
$ echo "3" > /sys/fs/cgroup/cpuset/test/cpuset.cpus
$ echo "0" > /sys/fs/cgroup/cpuset/test/cpuset.memsCopy the code

We first created a CPUSET subsystem called Test, and then tied core 4 to that subsystem, cpu3. For the cpuset.mems parameter, each memory node corresponds to a NUMA node. If the process needs a lot of memory, you can configure all NUMA nodes. This is where the NUMA concept is used. For the sake of performance, the logical core and memory nodes usually belong to the same NUMA node. You can use the numactl –hardware command to obtain their mapping relationship. Obviously, my host doesn’t have a NUMA architecture, so I just set it to node 0.

Check the test directory:

$CD/sys/fs/cgroup/cpuset/test $ll total dosage of 0 - rw - rw - r - 1 root root 0 March 28 17:07 cgroup. Clone_children - w - w - 1 root Event_control-rw-rw-r -- 1 root root 0 3月 28 17:07 cgroup. procs-rw-rw-r -- 1 root root 0 3月 28 17:07 cgroup. procs-rw-rw-r -- 1 root root 0 3月 28 17:07 cgroup. event_control-rw-rw-r -- 1 root root 0 3月 28 17:07 cgroup. procs-rw-rw-r -- 1 root root 0 3月 28 17:07 cpuset.cpu_exclusive -rw-rw-r-- 1 root root 0 3月 28 17:07 cpuset.cpus -r--r--r-- 1 root root 0 3月 28 17:07 Cpuset. effective_CPUS -r--r--r-- 1 root root 0 3月 28 17:07 CPUSet. effective_mems-RW-rw-r -- 1 root root 0 3月 28 17:07 Cpuset. mem_exclusive -rw-rw-r-- 1 root root 0 3月 28 17:07 cpuset.mem_hardwall -rw-rw-r-- 1 root root 0 3月 28 17:07 Memory_migrate -r--r--r-- 1 root root 0 3月 28 17:07 cpuset. Memory_pressure -rw-RW-r -- 1 root root 0 3月 28 17:07 Memory_spread_page -rw-rw-r-- 1 root root 0 March 28 17:07 cpuset. Memory_spread_slab -rw-rw-r-- 1 root root 0 March 28 17:07 cpuset. mems-rw-rw-r -- 1 root root 0 3月 28 17:07 cpuset. sched_load_balance-rw-rw-r -- 1 root root 0 3月 28 17:07 Sched_relax_domain_level-rw-rw-r -- 1 root root 0 3月 28 17:07 notify_on_release-rw-rw-r -- 1 root root 0 3月 28 17:07 tasks $ cat cpuset.cpus 3 $ cat cpuset.mems 0Copy the code

Currently, the Tasks file is empty, that is, no process is running on the CPUSET subsystem. There are two ways to get a specified process to run on this subsystem:

Will already run the processPIDwritetasksFile;
usesystemdCreate a daemon to write the cgroup SettingsserviceFile (essentially the same as method 1).

Let’s look at method 1. First run a program:

$ nohup sha1sum /dev/zero &
[1] 3767Copy the code

Then write PID to tasks in the test directory:

$ echo "3767" > /sys/fs/cgroup/cpuset/test/tasksCopy the code

View CPU usage:

You can see that the binding has taken effect and the process with PID 3767 is scheduled to CPU3.

Cpuset is not currently supported in Systemd to specify the CPU of a Service. However, we have a variant of cpuset. The Service file is as follows:

$ cat /etc/systemd/system/foo.service [Unit] Description=foo After=syslog.target network.target auditd.service [Service]  ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/testset ExecStartPre=/bin/bash -c '/usr/bin/echo "2" > /sys/fs/cgroup/cpuset/testset/cpuset.cpus' ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > /sys/fs/cgroup/cpuset/testset/cpuset.mems' ExecStart=/bin/bash -c "/usr/bin/sha1sum /dev/zero" ExecStartPost=/bin/bash -c '/usr/bin/echo $MAINPID > /sys/fs/cgroup/cpuset/testset/tasks' ExecStopPost=/usr/bin/rmdir /sys/fs/cgroup/cpuset/testset Restart=on-failure [Install] WantedBy=multi-user.targetCopy the code

Start the service and then view CPU usage:

Processes in this service are indeed scheduled to CPU2.

5. Back to the Docker

Finally, we come back to Docker. Docker actually integrates cgroup, namespace and other technologies implemented at the bottom of the system into a tool published in the way of mirror image, thus forming Docker, which must be known to everyone, so I won’t expand it. Is there a way for Docker to always run containers on one or more cpus? It’s actually quite simple, just using the –cpuset-cpus parameter!

To demonstrate this, specify the CPU core number of the runtime container as 1:

🐳 → docker run -d --name stress --cpuset-cpus="1" progrium/stress -c 4Copy the code

Check the CPU load of the host.

Only Cpu1 reached 100%; the other cpus were not used by the container.

If you read the first article in this series, you should know that on newer systems that implement init with Systemd (such as ConetOS 7), the system creates three top-level slices by default: System, User, and Machine, where Machine. Slice is the default location for all virtual machines and Linux containers, while Docker is actually a variant of Machine. Slice, which you can think of as Machine.

If the system is running Kubernetes, machine. Slice becomes KubePods:

To facilitate the management of cgroups, Systemd creates a subsystem for each slice, such as the Docker subsystem:

Cpuset = cpuset = cPUSet = cPUSet

View docker directory:

Docker creates a subdirectory for each container, 7766.. This corresponds to the container we created earlier:

🐳 - docker ps | grep stress 7766580 dd0d7 progrium/stress "/ usr/bin/stress - v..." 36 minutes ago Up 36 minutes stressCopy the code

Let’s verify the configuration in this directory:

$ cd /sys/fs/cgroup/cpuset/docker/7766580dd0d7d9728f3b603ed470b04d0cac1dd923f7a142fec614b12a4ba3be $ cat cpuset.cpus 1 $  cat cpuset.mems 0 $ cat tasks 6536 6562 6563 6564 6565 $ ps -ef|grep stress root 6536 6520 0 10:08 ? 00:00:00 /usr/bin/stress --verbose -c 4 root 6562 6536 24 10:08 ? 00:09:50 /usr/bin/stress --verbose -c 4 root 6563 6536 24 10:08 ? 00:09:50 /usr/bin/stress --verbose -c 4 root 6564 6536 24 10:08 ? 00:09:50 /usr/bin/stress --verbose -c 4 root 6565 6536 24 10:08 ? 00:09:50 /usr/bin/stress --verbose -c 4Copy the code

Of course, you can also tie containers to multiple CPU cores, which I won’t go into here. The next article will show you how to restrict BlockIO through cGroups.

Wechat official account

Scan the following QR code to follow the wechat public account, in the public account reply ◉ plus group ◉ to join our cloud native communication group, and Sun Hongliang, Zhang Curator, Yang Ming and other leaders to discuss cloud native technology

1. View THE CPU configuration

2. The NUMA structure

3. isolcpus

4. Create cgroup

5. Back to the Docker

Wechat official account

Related Posts

Today I’ll talk about logging and logback in Java

Nanny-level FastDFS tutorial

Spring source code reading loop reference