Hello everyone, I’m Zhang Jintao.

All of the container technologies and virtualization technologies mentioned so far (virtualization at any level of abstraction) can achieve resource level isolation and limitation.

Container technology, which implements resource-level restrictions and isolation, relies on the CGroup and Namespace technologies provided by the Linux kernel.

Here’s a quick overview of what these two technologies can do:

  • Main functions of cgroup: Manage resource allocation and restriction;
  • The main purpose of a namespace is to encapsulate abstractions, constraints, and isolation. Processes in a namespace seem to have their own global resources.

In the last article, we focused on Cgroups. In this article, we’ll focus on namespace.

What is a Namespace?

We refer to the wiki definition of namespace:

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources.

Namespace is a feature of the Linux kernel. It can partition kernel resources so that a group of processes can see a group of resources. Another set of processes can see a different set of resources. This works by using the same namespace for a set of resources and processes, but these namespaces actually reference different resources.

Namespace is simply a technology provided by the Linux kernel for interprocess resource isolation. Wrap global system resources in an abstraction so that processes (seemingly) have separate instances of global resources. Linux also provides multiple namespaces by default for isolating different resources.

Previously, we had a limited use of namespace alone, but namespace is a cornerstone of containerization technology.

Let’s take a look at how it developed.

The development history of Namespace

Figure 1, the history of a namespace

Earliest – Plan 9

The early proposal and use of namespace can be traced back to Plan 9 from Bell Labs. It is a distributed operating system developed by bell LABS ‘Computing Science Research Center between the 1980s and 2002 (with a stable fourth release in 2002, a decade after the first public release in 1992) and is still being developed and used by OS researchers and hobbyists. In the design and implementation of Plan 9, we repeat the following three points:

  • File system: All system resources are listed in the file system, identified by Node. All interfaces are also presented as part of the file system.

  • Namespace: Enables better application and presentation of file system hierarchies. It implements so-called “separation” and “independence”.

  • Standard communication protocol: 9P protocol (Styx/9P2000).

Figure 2, Plan 9 from Bell Labs icon

Start adding the Linux Kernel

Namespace entered the Linux Kernel in version 2.4.X, originally from version 2.4.19. However, namespace per process has only been implemented since version 2.4.2.

Figure 3. Linux Kernel Note

Figure 4. Operating system versions corresponding to the Linux Kernel

Linux 3.8 basic implementation

Linux 3.8 finally fully integrates User Namespace functionality into the kernel. In this way, namespace-related capabilities used by Docker and other container technologies are basically implemented.

Figure 5 shows the gradual evolution of the Linux Kernel from 2001 to 2013, and the implementation of namespace is completed

The Namespace type

The namespace name Indicates the used identifier – Flag To control the content
Cgroup CLONE_NEWCGROUP Cgroup root Directory Cgroup root directory
IPC CLONE_NEWIPC System V IPC, POSIX Message queues
Network CLONE_NEWNET Network devices, stacks, ports, etc. Network devices, protocol stacks, ports, etc
Mount CLONE_NEWNS Mount points Mount points
PID CLONE_NEWPID Process IDs Indicates the Process ID
Time CLONE_NEWTIME The clock
User CLONE_NEWUSER User and group IDS
UTS CLONE_NEWUTS System host name and Network Information Service (NIS) host name (sometimes called domain name)

Cgroup namespaces

The Cgroup namespace is the virtual view of the cgroups of a process. The namespace is displayed in /proc/[pid]/ Cgroup and /proc/[pid]/mountinfo.

Using cgroup namespace requires the CONFIG_CGROUPS option enabled by the kernel. This can be verified in the following ways:

(MoeLove) ➜ grep CONFIG_CGROUPS /boot/config-$(uname -r)
CONFIG_CGROUPS=y
Copy the code

The CGroup Namespace provides a number of isolation supports:

  • Prevent information leakage (containers should not see any information outside of containers).

  • Simplifies container migration.

  • Restrict the container process resources because it will mount the Cgroup file system and prevent the container process from gaining access from the upper layer.

Each Cgroup namespace has its own set of cgroup root directories. The root directories of these cgroups are the base points relative to the corresponding records in the /proc/[pid]/cgroup file. When a process creates a new Cgroup namespace with CLONE_NEWCGROUP (clone(2) or unshare(2)), its current cgroups directory becomes the cgroup root of the new namespace.

(MoeLove) ➜ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-2.scope
Copy the code

When a target process reads the cgroup relationship from /proc/[pid]/cgroup, the pathname of each record is shown in the third field, which is associated with the root of the cgroup hierarchy associated with the process being read. If the cgroup directory of the target process is outside the cGroup namespace root of the process being read, the path name will be displayed for upper nodes in each CGroup hierarchy.. /.

Let’s take a look at the following example (cgroup V1 is used here, if you want to see the v2 version of the example, please let me know in the comments) :

  1. For the initial cgroup namespace, we use root (or a user with root permission) and create a sub-cgroup named freezer under the freezer layermoelove-subAt the same time, put the process into the cgroup for restriction.
(MoeLove) ➜ mkdir -p/sys/fs/cgroup/freezer/MoeLove - sub (MoeLove) ➜ sleep 6666666 & 1489125 [1] (MoeLove) ➜ echo 1489125  > /sys/fs/cgroup/freezer/moelove-sub/cgroup.procsCopy the code
  1. Then we create another sub-Cgroup called freezer under the freezer layermoelove-sub2And put in the execution process number. You can see that the current process has been incorporatedmoelove-sub2Managed under cgroup.
(MoeLove) ➜ mkdir -p/sys/fs/cgroup/freezer/MoeLove - sub2 (MoeLove) ➜ echo $$1488899 (MoeLove) ➜ echo 1488899 > / sys/fs/cgroup/freezer/moelove - sub2 / cgroup procs (moelove) ➜ cat/proc/self/cgroup | grep freezer 7: freezer: / moelove - sub2Copy the code
  1. We use theunshare(1)Create a process, as used here-CThe new cGroup namespace is used-mParameter Indicates the new mount namespace.
(MoeLove) ➜ unshare-cm bash root@moelove:~#Copy the code
  1. From the new shell launched with unshare(1), we can use the/proc/[pid]/cgroupSee in the file, the new shell and the process in the example above:
root@moelove:~# cat /proc/self/cgroup | grep freezer 7:freezer:/ root@moelove:~# cat /proc/1/cgroup | grep freezer 7:freezer:/.. Root @ # the first sample process moelove: ~ # cat/proc / 1489125 / cgroup | grep freezer 7: freezer: /.. /moelove-subCopy the code
  1. From the above example, we can see the freezer cgroup relationship of a new shell. When a namespace of a new Cgroup is created, the relationship between the freezer root directory and it is established.
root@moelove:~# cat /proc/self/mountinfo | grep freezer
1238 1230 0:37 /.. /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
Copy the code
  1. Fourth field (/..) shows the mount directory in the cgroup file system. If the process’s current warencgroup directory is changed to its root directory, then the field is displayed as cgroup namespaces/... We can remount it to handle it.
root@moelove:~# mount --make-rslave /
root@moelove:~# umount /sys/fs/cgroup/freezer
root@moelove:~# mount -t cgroup -o freezer freezer /sys/fs/cgroup/freezer
root@moelove:~# cat /proc/self/mountinfo | grep freezer
1238 1230 0:37 / /sys/fs/cgroup/freezer rw,relatime - cgroup freezer rw,freezer
root@moelove:~# mount |grep freezer
freezer on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)
Copy the code

IPC namespaces

IPC Namespaces separate IPC resources such as System V IPC Objects and POSIX Message Queues. Each IPC namespace has its own set of System V IPC identifiers, as well as the POSIX message queue System. Objects created in an IPC namespace are visible to all members of that namespace (not to members of other namespaces).

Using IPC namespace requires kernel support for the CONFIG_IPC_NS option. As follows:

(MoeLove) ➜ grep CONFIG_IPC_NS /boot/config-$(uname -r)
CONFIG_IPC_NS=y
Copy the code

You can set the following /proc interface in the IPC namespace:

  • /proc/sys/fs/mqueue-POSIx Message queue interface

  • /proc/sys/kernel-system V IPC interface (MSgmax, MSGMNB, MSGMNI, SEM, shmall, SHmmax, SHMMni, SHM_rMID_forced)

  • /proc/sysvipc-system V IPC interface

When the IPC namespace is destroyed (when the last process in the space has been stopped and deleted), objects created in the IPC namespace are also destroyed.

Network namepaces

Network namespaces isolate Network related system resources (here are some of them) :

  • Network Devices – Network devices

  • IPv4 and IPv6 protocol stacks – IPv4 and IPv6 protocol stacks

  • IP routing tables – IP routing table

  • Firewall rules – Firewall rules

  • /proc/net (/proc/pid/net)

  • /sys/class/net

  • The file is in the /proc/sys/net directory

  • Port, the socket

  • UNIX domain abstract socket namespace

Using Network namespaces requires kernel support for the CONFIG_NET_NS option. As follows:

(MoeLove) ➜ grep CONFIG_NET_NS /boot/config-$(uname -r)
CONFIG_NET_NS=y
Copy the code

A physical Network device can exist in only one Network namespace. When a Network namespace is released (when the last process in the space has been stopped and deleted), the physical Network device is moved to the original Network namespace instead of the upper Network namespace.

A virtual Network device (veth(4)) that connects in a pipe-like manner between Network namespaces. This allows it to exist in multiple Network Namespaces, but when a Network Namespace is destroyed, the Veth (4) devices contained in that space may be destroyed.

Mount namespaces

Mount Namespaces first appeared in Linux version 2.4.19. Mount namespaces isolates process instances mounted in each space. Processes under each instance of mount Namespace see a different directory hierarchy.

The description of each process in mount Namespace can be seen in the file view below:

  • /proc/[pid]/mounts

  • /proc/[pid]/mountinfo

  • /proc/[pid]/mountstats

A new Mount namespace is created with CLONE_NEWNS using clone(2) or unshare(2).

  • With clone(2), the Mount list of child namespaces is copied from the parent process’s Mount namespace.
  • If the Mount namespace is created with unshare(2), the Mount list for the new namespace is copied from the caller’s preceding moun namespace.

If the mount namespace is modified, what is the chain reaction? Now, let’s talk about shared subtrees.

Each mount can be marked as follows:

  • MS_SHARED – Share events with each member of the group. This means that the same mounts or unmounts automatically occur on other mounts in the virtualized group. Conversely, mount or unmount events also affect the event action.

  • MS_PRIVATE – This mount is private. Neither mount nor unmount events will affect the event action.

  • Ms_slave-mount or unmount events are passed from the master node to affect the node. However, mount or unmount events under this node does not affect other nodes in the group.

  • MS_UNBINDABLE – This is also a private mount. Any mount that tries to bind will fail under this setting.

You can view the propagation fields in /proc/[pid]/mountinfo. Each peer group has a unique ID generated by the kernel, and the mount of the same peer group is this ID (that is, X below).

(MoeLove) ➜ cat /proc/self/mountinfo  |grep root  
65 1 0:33 /root / rw,relatime shared:1 - btrfs /dev/nvme0n1p6 rw,seclabel,compress=zstd:1,ssd,space_cache,subvolid=256,subvol=/root
1210 65 0:33 /root/var/lib/docker/btrfs /var/lib/docker/btrfs rw,relatime shared:1 - btrfs /dev/nvme0n1p6 rw,seclabel,compress=zstd:1,ssd,space_cache,subvolid=256,subvol=/root
Copy the code
  • Shared :X – Shared in group X.

  • Master :X – Slave for group X, that is, to the master whose ID is X.

  • Propagate_from :X – Receives a shared mount issued from group X. This tag always comes with master:X.

  • Unbindable – Indicates that it cannot be bound, that is, it is not subordinate to other associations.

The propagation type of a new mount namespace depends on its parent. If the propagation type of the parent node is MS_SHARED, then the propagation type of the new mount namespace is MS_SHARED, otherwise it defaults to MS_PRIVATE.

Add the following information to mount namespaces:

(1) Each mount namespace has an owner user namespace. If the new mount namespace and the copied mount namespace belong to different user namespaces, the new mount namespace has a lower priority.

(2) If the created mount namespace has a low priority, the slave mount events will take precedence over the shared mount events.

(3) If the mount namespace of high priority and low priority is associated and locked together, they cannot be unloaded separately.

(4) The mount(2) identifier and atime identifier are locked, that is, cannot be modified by propagation.

summary

The above is some introduction about namespace in Linux kernel, the reason for the length, the rest and the application of namespace in containers will be covered in the next article, please look forward to it!


Please feel free to subscribe to my official account [MoeLove]