Everything about COW (CopyOnWrite)

concept

Copy-on-write (COW for short) is an optimization strategy in computer programming. Its core idea is that if there are multiple caller (callers) at the same time, request the same resources (such as the data on the memory or disk storage), they will get the same pointer to the same common resources, until a caller tried to change the contents of the resources, the system will truly reproduce a copy of a dedicated (private copy) to the caller, The original resource that other callers see remains the same. The process is transparent to other callers. The main advantage of this approach is that if the caller does not modify the resource, no private copy is created, so that multiple callers can share the same resource only when reading operations.

COW already has many applications, such as in. File management systems such as Linux also use copy-on-write strategies.

application

Linux fork()

When fork() is used to create a child process, the operating system copies most of the parent process’s virtual memory space into the child process (mainly data segments, heaps, stacks; Code snippet sharing). Not only is this operation time-consuming, but it wastes a lot of physical memory. In particular, if the program uses exec to load the new program immediately after the process is copied, the negative effect will be even worse, and the previous copy operation will be completely unnecessary.

Hence the introduction of copy-on-write technology. The kernel does not copy the entire address space of a process, but only its page table, and the address space of the parent process after fork points to the same physical memory page.

However, the memory space of the different processes should be private. If all processes read only their memory pages, then they can continue to share the same copy in physical memory. However, whenever a process tries to write to a page in the shared area, a new copy of the page is created for that process.

If fork()+exec() executes an executable as soon as the child is created, the data in the parent’s memory is meaningless to the child — that is, the parent’s pages will not be written to at all. In this case, you can avoid copying altogether and assign the address space directly to the child process, as shown in the figure below.

The copy-on-write technique delays the copying of pages in memory until the first write and, more importantly, in many cases does not need to be copied. This saves a lot of time and makes good use of scarce physical memory.

Copy-on-write in virtual memory management

When a task attempts to write data to the memory, the memory management unit (MMU) throws an exception. When handling the exception, the kernel allocates a physical memory for the task, copies the data to the memory, and sends another write operation to the MMU

Take a look at Linux memory management

Linux Memory Management

In order to make full use of and manage system memory resources, Linux uses virtual memory management technology, which abstracts physical memory in modern computer systems.

It provides a contiguous private address space for each process. In 32-bit mode, each virtual address space is 4GB.

Linux uses virtual memory management technology to give each process a non-interfering 4GB virtual address space.

The initial allocation and operation of the process are based on this “virtual address”. Only when the process needs to access memory resources, the mapping between the virtual address and the physical address is established, and the physical memory page is called.

This principle is in fact the same as the current so-and-so network disk. If your net dish space is 1TB, think really one breath gave you so big space? You’re given as much real space as you put in, but you look like you own a terabyte of space.

The user space occupied by processes (executing programs) is divided into five different memory areas according to the principle that address Spaces with consistent access attributes are stored together. Access properties refer to “readable, writable, executable, etc.”

Code segment

Code snippets are used to store executable file operation instructions, executable program in memory image. The code snippet needs to be protected from being modified illegally at run time, so only read operations are allowed; it is not writable.
Data segment

The data segment is used to store initialized global variables in the executable, in other words, statically allocated variables and global variables in the program.
BSS

The BSS segment contains global variables that are not initialized in the program. The BSS segment is set to zero in memory.
Heap heap

The heap is used to store dynamically allocated memory segments during the running of a process. Its size is not fixed and can be dynamically expanded or reduced. When a process calls a function such as malloc to allocate memory, the newly allocated memory is dynamically added to the heap (the heap expands); When memory is freed using functions such as free, the freed memory is removed from the heap (the heap is reduced)
Stack stack

The stack is where the user stores temporary local variables created by the program, that is, variables defined in functions (but not static, which means storing variables in a data segment). In addition, when a function is called, its parameters are pushed onto the stack of the process that initiated the call, and the return value of the function is pushed back onto the stack when the call ends. Because of the first-in, first-out nature of the stack, it is particularly convenient to save/resume the call scene. In this sense, we can think of the stack as an area of memory where temporary data is stored and exchanged.

You can use the size command on Linux to view the size of each memory area of the compiled program:

# size /usr/local/sbin/sshd
   text     data      bss      dec      hex  filename
1924532    12412   426896  2363840   2411c0  /usr/local/sbin/sshd
Copy the code

Kernel address space partition

inx86 32On bit-based systems, the Linux kernel address space refers to virtual addresses from0xC0000000Start to0xFFFFFFFFSo far the high-end memory address space, total1GThe capacity includes kernel images, physical page tables, drivers, and so on running in kernel space

Direct mapping region

Direct Memory Region: the Direct Memory Region is 896M in size, starting from the start address of the kernel space.

The 896MB “linear address” in the direct mapping area is directly mapped to the first 896MB “physical address”, that is, the linear address and the allocated physical address are consecutive. The linear address 0xC0000001 of the kernel address space corresponds to the physical address 0x00000001, which is offset by PAGE_OFFSET = 0xC0000000

There is a linear conversion relationship between the linear address and the physical address in this area. “Linear address = PAGE_OFFSET + physical address” can also be used to convert the linear address in the kernel virtual space to the physical address.

High memory linear address space

Kernel space linear address ** range from 896M to 1G **, capacity 128MB address range is high-end memory linear address space, why called high-end memory linear address space? Here’s what it’s all about:

As mentioned earlier, the total size of the kernel space is 1GB, and 896MB linear addresses starting from the start of the kernel space can be directly mapped to 896MB physical address range.

At the very least, even if all 1GB of linear addresses in kernel space are mapped to physical addresses, they can only address a 1GB range of physical memory addresses.

Kernel space took out the final 128M address range and divided it into the following three high-end memory-mapped areas to address the entire physical address range. On 64-bit systems, there is no such problem, because the available linear address space far outstrips the memory available for installation.

Dynamic memory mapped area

Vmalloc Region This Region is allocated by the kernel function vmalloc. The linear space is continuous, but the corresponding physical address space may not be continuous. The physical page corresponding to the linear address allocated by vMALloC may be in low or high memory.

Permanent memory mapped area

Persistent Kernel Mapping Region This Region can access high-end memory. Access is achieved by allocating high end memory pages using alloc_page (_GFP_HIGHMEM) or mapping the allocated high end memory to the region using the kmap function.

Fixed mapping region

Fixing kernel Mapping Region This Region and the top of 4G have only 4k isolation bands, each of which serves a specific purpose, such as ACPI_BASE, etc.

User space memory data structure

Benefits of virtual addresses

This prevents users from directly accessing physical memory addresses, preventing destructive operations and protecting the operating system
Each process is allocated 4GB of virtual memory, and user programs can use a larger address space than the actual physical memory

System Processing flow

The system splits the Virtual memory into Virtual pages of fixed size. In the same way, Physical memory is split into Physical pages. When a process accesses the memory, The CPU translates virtual addresses into physical addresses based on the Page Table through the MEMORY management Unit (MMU), and finally obtains memory data. This is like having the entire main memory to yourself within each process.

When the CPU gets a virtual address and wants to access it, it divides it into two parts: virtual frame number and cheap. First, it checks the TLB with the virtual frame number. When the TLB hits the target, it directly splashes the physical frame number and offset to obtain the physical address. I’m accessing the physical address. If the cache summary is available, then the next level of memory is accessed. If the TLB doesn’t hit, use itCR3The register, which stores the primary page table base address of the current process, searches the page table level by level.

When initializing a process, Linux initializes the process’s virtual address space by associating it with an object on disk, a process called memory mapping.

Executable files are stored on disk, which contains the data of various segments in virtual memory, such as code segments, data segments, etc. Code snippet, for example, should remain constant during program execution and look the same in memory as it does on disk,

So how does it load into memory.

Linux maps different areas of memory to one of the following two disk files:

A regular file for the Linux file system. Executable files, for example. A portion of the file is divided into page sizes, with each block containing the initial content of a virtual address page. When a block is less than a page, fill it with zeros. However, the operating system does not actually put everything into memory in the first place, but interrupts the page into physical memory when a page miss occurs in the CPU’s first access. When a process requests memory, the Linux kernel actually allocates a virtual memory address, but does not allocate physical memory, which means that the operating system only allocates this address to the process. The kernel allocates physical pages to the process and creates a page table to map virtual memory to physical memory. This makes full use of physical memory.
Anonymous file. An area of virtual memory can also be mapped to an anonymous file created by the kernel, such as the stack portion and uninitialized global variables, which have no entities in the executable file and are mapped to anonymous files. When the CPU accesses these areas, the kernel finds a physical page, empties it, and updates the process’s page table. This process does not involve disk-to-main memory data interaction. But it’s important to note that inC++The heap doesn’t always have zero memory becauseC++Internal implementation of heap memory management, memory may not be allocated by the operating system, but previously allocated returned, but byC++The memory management part is retained and the request is returned directly to the user.

In both cases, after the virtual page is initialized, it is swapped in and out of swap space and main memory. Swap space size limits the maximum number of virtual pages for a currently running process. The size of swap space can be set according to the operating system.

Memory mapping and Interprocess Shared Objects (CopyOnWrite)

Different processes can share objects. For example, if the code segment is read-only, processes running the same executable file can share the code segment of virtual memory, thus saving physical memory. There is also a shared memory mechanism for interprocess communication. This can be done at the virtual memory mapping level. Virtual pages of different processes can be mapped to the same physical page box to achieve memory sharing between different processes.At the same time to save physical memory, can be usedcopy-on-writeTechnology to achieve process private address space sharing. Start with multiple processes sharing a page of physical memory, and then start when one process writes to the pagecopy-on-writeMechanism to copy the physical page, thus privatizing it.

Buddy assignment algorithm

The Linux kernel introduced the Buddy system algorithm. What does that mean? The algorithm gets its name from the fact that the same size frames are linked together in a linked list.

Specifically, all free page frames are grouped into 11 block linked lists, each block linked list containing page frame blocks of size 1,2,4,8,16,32,64,128,256,512, and 1024 consecutive page frames, respectively. A maximum of 1024 contiguous page enclosures can be applied for, corresponding to 4MB contiguous memory.

Since any positive integer can be made up of a sum of 2^n, you can always find a block of memory of the right size to allocate, reducing external fragmentation.

Slab allocator

Now, you might be thinking, well, with the buddy system you can manage physical memory, right? No, not enough, otherwise there would be no slab allocator.

So what is a slab allocator?

In general, the kernel object life cycle is this: allocating memory initialization – release the memory, the kernel has a large number of small objects, such as the file structure, the structure of the task description object, if the partner system according to the page of memory allocation and release, the small object frequent execution “allocates memory initialization – free memory” can be very cost performance.

The partner system still allocates memory in page frames, whereas many scenarios in the kernel allocate small pieces of memory, much less than the size of a page of memory. The slab allocator applies to the cache of kernel objects “by subdividing memory into different sizes according to the objects used”.

Partner system and SLAB are not binary relationship, slab memory allocator is a supplement to partner allocation algorithm.

mmap

Mmap is a system call used in the POSIX specification interface to handle memory mapping, which itself can be used in many different scenarios:

Can be used for large chunks of memory
It can be used to apply for shared memory
You can also map files or devices directly into memory

The process can access mapped files just like normal memory, which can be used in many scenarios during actual development

In LINUX we can use Mmap to allocate address space in the virtual memory address space of the process and create a mapping relationship with the physical memory.

Mmap maps a file directly to the address space of the process. The process can read and write the file content on the disk just as it operates on memory, without invoking system calls such as read/write.

int main(int argc, char **argv)  
{  
    char *filename = "/tmp/foo.data";  
    struct stat stat;  
    int fd = open(filename, O_RDWR, 0);  
    fstat(fd, &stat);  
    void *bufp = mmap(NULL, stat.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);  
    memcpy(bufp, "Linuxdd".7); 
    munmap(bufp, stat.st_size); 
    close(fd);
    return 0;  
}  
Copy the code

After MMAP, instead of loading file contents onto physical pages, address space is allocated in virtual memory. When process during a visit to this address, by looking for a page table, found that virtual memory the corresponding page in physical memory cache, produces “missing page”, by the kernel page fault exception handling procedures, the file contents, with page (4096) for the unit loaded into physical memory, attention is only loading page fault, but also affected by the operating system some scheduling policy, Load more than you need.

space

The virtual space of a process consists of several parts, and the memory space where mmap files reside is in the memory-mapped segment.

Difference between Mmap and Read /write

The flow of the read system call looks something like this:

A) The user process initiates the read operation; B) The kernel does some basic page cache judgments, reading data from disk into the kernel buffer; C) Then the kernel copies the buffer data to the user buffer in user mode; D) Wake up the user process to continue execution;

The mMAP process is shown below

The kernel directly exposes the memory to the user state, and the modification of the memory by the user state is directly reflected in the kernel state. The speed will be improved if the memory copy from the kernel state to the user state is lost

Mmap has many advantages. Compared with traditional I/O methods such as Read and write, mMAP directly maps the area of the virtual address to the file without any data copy operation. When a page is missing, mMAP loads the disk data to the memory through the mapping relationship. The operation of reading, cache and page swapping of index data is directly scheduled by the operating system, which indirectly reduces the complexity of user programs and improves the running efficiency.

The advantages and disadvantages

The advantages are as follows:

The read operation on the file crosses the page cache, reduces the number of data copies, and replaces I/O reads with memory reads, improving file read efficiency.
The efficient interaction between user space and kernel space is realized. The modification operations of the two Spaces can be directly reflected in the mapped area, thus being captured in time by the other space.
Provides a way for processes to share memory and communicate with each other. Both parent and unrelated processes can map their own user space to the same file or anonymously to the same area. Therefore, the purpose of interprocess communication and interprocess sharing can be achieved by changing the mapping region. At the same time, if both process A and process B map region C, when A reads C for the first time, the file page is copied from disk to memory through the missing page. However, when USER B reads the same page from user C again, a page error will occur. However, the file data stored in the memory can be directly used instead of copying the file from the disk.
Can be used to achieve efficient large-scale data transmission. Insufficient memory is a constraint to big data operations. The solution is to use hard disk space to supplement the insufficient memory. However, it further causes a large number of file I/O operations, which greatly affects efficiency. This problem can be solved by mmap mapping. In other words, mMAP can be useful whenever disk space is needed instead of memory.

Disadvantages are as follows:

Files that are small are less than 4096 bytes, say 10 bytes, because the minimum granularity of memory is pages, and the mapping of process virtual address space to memory is also in pages. Although the mapped file is only 10 bytes, the size of the virtual address area of the process must meet the size of the entire page. Therefore, after the mmap function is executed, 4096 bytes are actually mapped to the virtual memory area, and the bytes from 11 to 4096 are filled with zero. Therefore, if you continuously mmap small files, you will waste memory space.
This is not suitable for variable-length files, which cannot be expanded because by the time mmap is in memory, the range of operations you can perform is determined.
If you do a lot of file updates, you can trigger a lot of dirty page writes back and random IO. So mMAP is not necessarily more efficient than buffer writing in the case of a lot of random writes

File management systems such as Linux use a copy-on-write strategy

ZFS and BTRFS copy file systems on write. The log technology is used to copy file systems on write.

ZFS

ZFSThe name of the file system isZettabyte File System, also called dynamic file system (Dynamic File System), is the first128Bit file system. The original is made ofSunThe company forSolaris 10Operating system development file system. As aOpenSolarisPart of the open source initiative,ZFSReleased in November 2005, bySunIs called the ultimate file system, has experienced10Years of active development. The latest development will be fully open and renamedOpenZFS.

The use of copy-on-write makes the implementation of ZFS’s snapshot and transaction functions easier and more natural, and the snapshot function is more flexible. The disadvantage is that COW aggravates the fragmentation problem. If a part of a large file generated by sequential writing is changed randomly, the physical address of the file on the hard disk is no longer continuous, and the sequential read performance in the future will be poor.

BTRFS

BTRFS (commonly pronounced Butter FS), an ongoing copy-on-write (copy-on-write) file system announced by Oracle in 2007. The goal is to replace the Linux Ext3 file system by improving ext3’s limitations, particularly single file size limits, total file system size limits, and adding file checksum features. We can call it “snapshots” and call it “snapshots”. We can call it “snapshots” and call it “RAID”. We can call it “Subvolumes”. Allows you to resize a file system online.

First, BTRFS relies on scalability. The most important design goal of BTRFS is to meet the scalability requirements of large machines. Features such as Extent, B-tree, and dynamic inode creation ensure that BTRFS will continue to perform well on large machines, with overall performance not decreasing as system capacity increases. Then there are features related to data integrity. Btrfs uses COW transaction technology to ensure the consistency of file system. BTRFS also supports checksum to avoid silent corrupt. Traditional file systems cannot do this. The third is features related to multi-device management. Btrfs supports snapshot creation and clone creation. BTRFS can easily manage multiple physical devices, making traditional volume management software redundant. Finally, there are other features that are hard to categorize. These features are advanced technologies that can significantly improve the time/space performance of file systems, including delayed allocation, storage optimization of small files, directory indexing, and so on.

Databases generally adopt a copy-on-write policy to provide users with a snapshot

MySQL MVCC

Multi-version Concurrency Control (MVCC) implements READ and write concurrency to a certain extent. It only works at two isolation levels, REPEATABLE READ and READ COMMITTED. The other two isolation levels are incompatible with MVCC because READ UNCOMMITTED rows are always READ instead of rows that match the current version of the transaction. SERIALIZABLE locks all rows read.

Row locking, concurrency, transaction rollback, and many other features are related to MVCC. The core idea of MVCC implementation is Copy On Write

Copy-on-write applications in Java

Thread-safe collection of copy-on-write support in the J.U.C package: CopyOnWriteArrayList, CopyOnWriteArraySet

Compared with the Fail-fast container, the Fail-Safe COW container is much more secure, but it is only suitable for the scenario where a lot of reads and a lot of writes are needed because the entire array is copied each time. To ensure efficiency, try to insert or delete data in batches rather than in a single operation. And its original and copy may not be synchronized, so it cannot guarantee that the latest data is read, only the final consistency.

Redis

Redis does not terminate external services while generating RDB snapshot files

Data can be restored after Redis restarts. RDB, for example, is a database snapshot of Redis at a moment in time. When you run the bgSave command, Redis saves a dump. RDB file that records all the data for the entire database at that moment. The great thing about Redis is that it can process commands while saving. So here’s an interesting question — how does Redis guarantee consistency in dump. RDB? Redis is modifying the database and saving the database to a file at the same time. Does Redis not worry about dirty reads and writes?

Redis has a main process that is writing data, and then a command comes in saying persist data to disk. As we know, redis workers are single-threaded. If the behavior to persist is also placed in a single thread, the use of users will be affected if too much data needs to be persisted. Therefore, a single process (child process) is forked to do persistence.

It works like this: after fork(), the kernel sets the permissions of all memory pages in the parent to read-only, and the child’s address space points to the parent. When both parent and child processes read only memory, nothing happens. When one of these processes writes to memory, the CPU hardware detects that the memory page is read-only and triggers a page-fault, falling into one of the kernel’s interrupt routines. In the interrupt routine, the kernel makes a copy of the page that triggered the exception, so that the parent process keeps a separate copy.

Does the parent hold the original, the child holds the copy, or vice versa?

Whoever modifies the memory keeps the copy

Does the kernel copy in a page of memory?

The size of copy is a page size

reference

zh.wikipedia.org/wiki/ Copy on write
Imageslr.com/2020/copy-o…
Pthree.org/2012/12/14/…
M.imooc.com/wiki/linuxl…
Blog.51cto.com/u_15091061/…
[zhou – yuxin. Making. IO/articles / 20… after the copy on write some of the features of/index. The HTML] (zhou – yuxin. Making. IO/articles / 20… Some features of copy on Write /index.html)
www.wildmanli.top/2019/05/21/…
Maben. Me / 2020/04/21 /…
Mp.weixin.qq.com/s/bKq-b9Ga2…
Wangdh15. Making. IO / 2020/12/08 /…
Mp.weixin.qq.com/s?__biz=MzA…