preface

Namespace, Cgroup, and UFS are a few of the key words that people use when they talk about container technology, but after that they can’t seem to move on. In this article, I’ll take a straightforward look at what these keywords mean and how processes use these techniques to achieve containerization.

  • There is a little bit of Golang code in this article, but it is explained in detail and does not affect understanding;

1, the Linux Namespace

Namespace is used for resource isolation, including process isolation, mount point isolation, network isolation, user isolation, program group process communication isolation; When a process is created, you can use clone() to create these NS, of course you can use setns() to add existing NS, and finally you can use the unshare() process to remove an NS

The Namespace type System call parameters Kernel version
Mount NS CLONE_NEWNS 2.4.19
UTS NS CLONE_NEWUTS 2.6.19
IPC NS CLONE_NEWIPC 2.6.19
PID NS CLONE_NEWPID 2.6.24
Network NS CLONE_NEWNET 2.6.29
User NS CLONE_NEWUSER 3.8
1.1. Environment Preparation

The following through the implementation of the code to understand the above namespace, the premise needs to prepare the environment: 1, centos 7 system, kernel 4.4, because 4.4 system call mode changed; 2. Golang 1.12 or above;

2. The code implements NS

2.1, UTS Namespace
package main

/* UTS Namespace is used to isolate node name and domain name */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   / / command
   cmd := exec.Command("sh")
   // When the clone process is called, a new UTS namespace is set
   cmd.SysProcAttr = &syscall.SysProcAttr{
      Cloneflags: syscall.CLONE_NEWUTS,
   }
   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr

   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}
Copy the code
  • Use the following command to verify
Execute the current script and enter the container
$ go run uts.go
# View the process tree
$ pstree -pl    
# echo $$
19915
Uts namespace number;$readlink /proc/19915/ns/uts uts:[40287666333] // Compare another shell's hostname, Domain $hostname -b shadow shadow is isolatedCopy the code

2.2, IPC Namespace
package main

/* IPC Namespace is used to isolate system V IPC and POSIX message queues */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   cmd := exec.Command("sh")
   cmd.SysProcAttr = &syscall.SysProcAttr{
      Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC,
   }
   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr

   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}


/ * the ipcs command to view the Message the Queues | Shared Memory Segments | Semaphore Arrays ipcrm delete ipcmk create (t1) : ipcs -q ipcmk -Q ipcs -q [t2]: ipcs -q */
Copy the code

Term1

Execute the current script
# Enter container
$ go run ipc.go
2020/09/16 01:11:47 start new namespace
sh-4.2# ipcs -q-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- message queue - key msqid owner permissions has use bytes message sh - 4.2# ipcmk -QMessage queue ID: 0 sh-4.2# ipcs--------- message queue ----------- key MSQID Number of bytes used by Owner permission Message 0x6A89a1ad 0 Root 644 0 0Copy the code

Term2

$ipcs - q -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- message queue - key msqid owner permissions used bytes messagesCopy the code

2.3, PID Namespace
package main

/* PID Namespace is used to isolate the process ID */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   cmd := exec.Command("sh")
   cmd.SysProcAttr = &syscall.SysProcAttr{
      Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID,
   }
   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr

   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}


/* [t1]: pstree -pl echo $$ */
Copy the code

Term1

# Enter container
$ go run pid.go
2020/09/16 01:18:49 start new namespace
Pid = 1Sh - 4.2 -# echo $$
1
Copy the code

2.4, the Mount Namespace
package main

/* Mount Namespace is used to isolate the Mount points seen by various processes */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   cmd := exec.Command("sh")
   cmd.SysProcAttr = &syscall.SysProcAttr{
      Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID |syscall.CLONE_NEWNS | syscall.CLONE_NEWIPC,
   }
   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr

   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}

Copy the code

As you can see below, after mounting /proc to our namespace, we only see our own process information;

$ go run mount.go
2020/09/16 01:22:14 start new namespace
/proc is not mountedSh - 4.2 -# ls /proc1 20 300 5186 64 6776 79 cpuinfo kallsyms mtrr thread-self 10 21 301 5332 6481 6780 796 crypto kcore net timer_list 1055  22 370 549 65 6783 8 devices keys pagetypeinfo tty 11 23 3925 550 6543 68 80 diskstats key-users partitions uptime 1113  24 3939 553 6570 69 804 dma kmsg sched_debug version 1115 25 396 554 6571 71 850 driver kpagecgroup schedstat vmallocinfo 13 26 397 559 6572 72 9 dynamic_debug kpagecount scsi vmstat 1300 271 4 560 6574 73 93 execdomains kpageflags self zoneinfo 1306 272 419 562 66 74 94 fb loadavg slabinfo 14 274 439 563 6652 75 acpi filesystems locks softirqs 15 275 446 574 67 76 buddyinfo fs mdstatstat
16 276 448	6 6740 77 bus	interrupts meminfo	swaps
18 283 450	61 6741 780 cgroups iomem	misc	sys
19 294 452	62 6742 781 cmdline ioports	modules	sysrq-trigger
2 3 508	63 6745 784 consoles irq	mounts	sysvipc
After the /proc device is mounted;Sh - 4.2 -# mount -t proc proc /procSh - 4.2 -# psPID TTY TIME CMD 1 PTS /0 00:00:00 sh 5 PTS /0 00:00:00 ps sh-4.2# ls /proc
1	crypto	fs	kmsg	modules	self	timer_list
6	devices	interrupts kpagecgroup mounts	slabinfo	tty
acpi	diskstats	iomem kpagecount mtrr	softirqs	uptime
buddyinfo dma	ioports kpageflags net	statversion bus driver irq loadavg pagetypeinfo swaps vmallocinfo cgroups dynamic_debug kallsyms locks partitions sys vmstat  cmdline execdomains kcore mdstat sched_debug sysrq-trigger zoneinfo consoles fb keys meminfo schedstat sysvipc cpuinfo filesystems key-users misc scsi thread-selfCopy the code

2.5, the User Namespace
package main

/* The USER Namespace isolates USER ids and USER group ids */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   cmd := exec.Command("sh")
   cmd.SysProcAttr = &syscall.SysProcAttr{

      Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC |syscall.CLONE_NEWPID | syscall.CLONE_NEWNS |
         syscall.CLONE_NEWUSER,
      UidMappings: []syscall.SysProcIDMap{
         {
            ContainerID: 5001,
            HostID:      syscall.Getuid(),
            Size:        1,
         },
      },
      GidMappings: []syscall.SysProcIDMap{
         {
            ContainerID: 5001,
            HostID:      syscall.Getuid(),
            Size:        1,,}}}//cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)}

   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr


   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}
Copy the code

You can see that the current user is inconsistent

Term1

# Enter container$go run user. Go 2020/09/16 01:33:33 start new namespace sh-4.2$ID UID =5001 GID =5001 group =5001 sh-4.2$exit
exit
Copy the code

Term2 physical machine

$id uid=0(root) gid=0(root) group =0(root)Copy the code

2.6, Network Namespace
package main

/* Mount Namespace is used to isolate network space */

import (
   "log"
   "os"
   "os/exec"
   "syscall"
)

func main(a)  {
   log.Println("start new namespace")
   cmd := exec.Command("sh")
   cmd.SysProcAttr = &syscall.SysProcAttr{
      Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID |syscall.CLONE_NEWNS | syscall.CLONE_NEWIPC |
         syscall.CLONE_NEWNET,
   }
   cmd.Stdin = os.Stdin
   cmd.Stdout = os.Stdout
   cmd.Stderr = os.Stderr

   iferr := cmd.Run(); err ! =nil {
      log.Fatal(err)
   }
}

Copy the code
  • After entering the container, there is no network configuration
$go run net.go 2020/09/16 01:35:10 start new namespace sh-4.2# ifconfig

Copy the code

3, Linux Cgroup

What is a Cgroups? Cgroup provides the ability to limit, control, and count the resources of a group of processes, including CPU, memory, storage, network, and so on. With Cgroups, you can easily limit the resources occupied by applications and monitor progress and statistics in real time. Three components in Cgroups

  • cgroupSubsystem is a mechanism for sub-group management of processes. A Cgroup contains a set of processes and can be added to the Cgroup to configure various parameters of the Linux SUBSYSTEM to associate a set of processes to a set of system parameters.
  • subsystemIs a group of resource control modules. It generally contains the following configurations:
    • Blkio input/output access control to block devices;
    • CPU Sets the CPU scheduling policy for processes in the CGroup.
    • Cpuacct Collects statistics on the CPU usage of processes in the CGroup.
    • Cpuset sets the CPU available to processes in ccGroup on a multicore machine
    • Devices Control the access of processes in the CGroup to devices.
    • Freeze is used to scrape and resume processes in a cgroup.
    • Memory is used to control the process memory in the Cgroup.
    • Net_cls is used to classify network packets generated by processes in the Cgroup, so that the Traffic controller (TC) in Linux can distinguish packets from a certain Cgroup for traffic limiting or monitoring.
    • Net_prio Sets the priority of network traffic generated by processes in the Cgroup.
    • Ns When a cgroup process forks a new process (newns) in a new namespace, a new cgroup is created. This cgroup contains the process in the new namespaace.
# // Query SUBSYSTEMS supported by the system
$ lssubsys -a
cpuset
cpu,cpuacct
blkio
memory
devices
freezer
net_cls,net_prio
perf_event
hugetlb
pids
Copy the code
  • hierarchyThe function of cgroup is to string a group of Cgroups into a tree structure for inheritance. It is equivalent to a cgroup root node by default, and other Cgroups are child nodes of this Cgroup.

Relationship of three components

  • After a new hierarchy is created, all processes in the system will be added to the cgroup root node of the hierarchy.
  • A subsystem can only be attached to one hierarchy;
  • A hierarchy can attach more than one subsystem;
  • A process can be a member of multiple Cgroups, but these Cgroups must reside in different Hierarchies.
  • When a process forks out a child process, the child process is in the same Cgroup as the parent process or can be moved to another Cgroup as required.
3.1 present Cgroup

1. Create and mount a hierarcy;

$ mkdir cgroup-test
$ mount -t cgroup -o none,name=cgroup-test cgroup-test ./cgroup-test/
$ ls ./cgroup-test/
cgroup.clone_children cgroup.procs cgroup.sane_behavior notify_on_release release_agent tasks
Copy the code

These files are hierachey’s root Cgroup node configuration

  • Cgroup. clone_children: the CPUSET’s subsystem will read the configuration file. If the value is 1 (default 0), the sub-Cgroup will inherit the cPUSET configuration from the parent Cgroup
  • Cgroup. procs is the process ID of the current cgroup node in the tree. The current location is the root node. This file will contain all process group ids of the existing system.
  • Notify_on_rellease and RELEase_agent are used together. Notify_on_releaase indicates whether release_agent is executed when the last cgroup process exits. Release_agent is a path that is usually used to automatically clean up cgroups that are no longer used after a process exits.
  • Task identifies the process ID under the Cgroup. If you write a process ID to the Tasks file, the corresponding process will be added to the Cgroup.

2. Extend the cgroup of the two children

$├── ├─ cgroup.procs │ ├─ notify_on_release │ ├─ cgroup.procs │ ├─ notify_on_release │ ├─ cgroup.procs │ ├─ notify_on_release │ ├─ cgroup.procs │ ├─ cgroup.procs │ ├─ notify_on_release │ └ ─ ─ the tasks ├ ─ ─ cgroup - 2 │ ├ ─ ─ cgroup. Clone_children │ ├ ─ ─ cgroup. Procs │ ├ ─ ─ notify_on_release │ └ ─ ─ the tasks ├ ─ ─ ├── Cgroup.clone_children Exercises ─ cgroup.procs Exercises ── cgroup.sane_behavior Exercises ── notify_on_Release Exercises ──Copy the code

3. Add and move processes in cGroup

$ pwd
/data/docker_lab/online/cgroup-test
$ cat tasks |grep `echo6935 $$$`echo $$ > cgroup-1/tasks
$ cat tasks |grep `echo $$`
$ cat /proc/6935/cgroup
12:name=cgroup-test:/cgroup-1  Cgroup-test :/cgroup-1
11:pids:/
10:memory:/user.slice
9:cpu,cpuacct:/user.slice
8:devices:/user.slice
7:freezer:/
6:blkio:/user.slice
5:perf_event:/
4:cpuset:/
3:hugetlb:/
2:net_cls,net_prio:/
1:name=systemd:/user.slice/user-0.slice/session-449.scope
Copy the code

Subsystem to restrict process resources within the Cgroup

  • Since the system already created a default hierachy for all subsystem by default, and a SusBsysten can only belong to one hierachy, we can only use the default hierachy to do the limiting experiment.
// The default memory is hierachy
$ mount |grep mem   
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) 
# // System resource pressure command
$ stress --vm-bytes 500m --vm-keep -m 1  
# // Stress occupies 500 MB
$ top   
# // Create a subcgroup and enter
$ cd /sys/fs/cgroup/memory ; mkdir test-limit-mem ; cd test-limit-mem  
# // Add limited memory
$ echo 200m > memory.limit_in_bytes   
# // Move current PID to cgroup
$  echo $$ > tasks  
# // System resource pressure command
$ stress --vm-bytes 500m --vm-keep -m 1  
# // You can see that stress takes up 200m
$ top   
Copy the code
3.2 use cgroups in Docker
$ docker run -itd -m 128m ubuntu
#Limit_in_bytes; // The directory where the corresponding unique DOCker ID is found is also set in memory.limit_in_bytes;
$ ls /sys/fs/cgroup/memory/docker/    
Copy the code

4. Union File System

4.1. What is Union File System

UFS is a file system service that combines other file systems into a single federated mount. It uses Branch to “transparently” overwrite files and directories from different file systems into a single consistent file system. These branches are either read-only or read-write here, so when the virtual union file system writes, the system actually writes to a new file. It looks like the whole system can operate on any file, but it doesn’t change the original file. Because UnionFS uses a technique called copy-on-write (COW). Copy-on-write, also known as implicit sharing; If a resource is duplicate, there is no need to create a new resource without any modification. This resource can be shared by both the old and new instances. When the first write operation occurs, the resource is copied and modified.

4.2, AUFS

AUFS is an implementation of UFS. The first storage driver Docker uses is still in use today. There are other drivers of the same type, overlay, overlay2, overlyafs, etc. AUFS requires kernel support

Check whether AUFS is supported

$ cat /proc/filesystems |grep aufs
nodev	aufs
# // If not, you can switch to a kernel that supports AUFS
$ cd /etc/yum.repos.d/
$ wget https://yum.spaceduck.org/kernel-ml-aufs/kernel-ml-aufs.repo
$ yum install kernel-ml-aufs 
Copy the code
  • The kernel switch, recommend a I was reference: www.cnblogs.com/xzkzzz/p/96…
4.3 How does Docker use AUFS

Docker aufs storage directory

  • /var/lib/docker/aufs/diff
    • The Docker Host filesystem is stored in this directory
  • /var/lib/docker/aufs/layers/
    • The main storage location of docker images
  • The/var/lib/docker/aufs/MNT;
    • The contents of the file modified at runtime
4.4 handwritten AUFS

Here’s how it happened

$ pwd
/data/docker_lab/online/aufs
$ ls
changed-ubuntu cn-l iml1 iml2 iml3 mnt
$ cat cn-l/cn-l.txt
I am cn layer
$ cat iml1/iml1.txt
l1
$ cat iml2/iml2.txt
l2
$ cat iml3/iml3.txt
l3
# mount aufs
$ mount -t aufs -o dirs=)/cn - l / :)/iml3:. / iml2:. / iml1 none. / MNT $tree MNT/MNT / ├ ─ ─ cn - l.t xt ├ ─ ─ iml1. TXT ├ ─ ─ iml2. TXT └ ─ ─ iml3. TXTThis is an MNT mount point, and only CN-l is read-write
$ cat /sys/fs/aufs/si_d3fb24f591e1278f/*   
/data/docker_lab/online/aufs/cn-l=rw
/data/docker_lab/online/aufs/iml3=ro
/data/docker_lab/online/aufs/iml2=ro
/data/docker_lab/online/aufs/iml1=ro
64
65
66
67
/data/docker_lab/online/aufs/cn-l/.aufs.xino
# Append a content to see COW's reaction
$ echo "write to mnt's iml1" >> ./mnt/iml3.txt 
The append succeeded
$ cat  ./mnt/iml3.txt  
l3
write to mnt$cat cn-l/iml3.txt write $cat cn-l/iml3.txt write $cat cn /iml3.txt write to mnt's iml1
# delete iml1.txt from UFS
$ rm ./mnt/iml1.txt  
# No
$ ls ./mnt/  
cn-l.txt iml2.txt iml3.txt
The iml1.txt file for the iml1 mirror layer still exists
$ ls iml1/iml1.txt   
iml1/iml1.txt
Wh.iml1.txt will be hidden, but will not actually delete the corresponding layer. Wh files are called whiteout files.$ll./cn-l/ -a drwxr-xr-x 4 root root 4096 9月 15 12:46. Drwxr-xr-x 8 root root 4096 9月 14 01:21.. -rw-r--r-- 1 root root 14 September 15 12:37 cn-l.twt -rw-r--r-- 1 root root 23 September 15 12:45 iml3. TXT -r--r-- 2 root root 0 -r--r--r-- 2 root root 0 9月 15 12:40.wh.. Wh.aufs DRWX ------ 2 root root 4096 9月 15 12:40.wh.. Wh. Orph DRWX ------ 2 root root 4096 9月 15 12:40.wh.. wh.plnkCopy the code

conclusion

  • If the article is helpful to you pleasegive a like.collection;

reference

  • Recommended books: << Write Docker yourself >>