Deep dive into the secrets of Linux CP

The outline

More dry goods, public concern: Qiya cloud storage

[toc]

The thinking caused by CP

Cp is what? Copy is one of the most commonly used commands in Linux.

The cp command is in the Coreutils library. It is a core project of the GNU project maintenance and provides core commands on Linux.

Today, I used cp command, which surprised my partner and triggered my thinking about the details.

What happened was that Kiv copied a 100 GiB file with CP today and copied it in less than a second. A SATA drive with write capacity up to 150 MiB/s (most drives don’t) is not bad, so it normally takes at least 682 seconds to copy a 100GB file (100 GiB/ 150 MiB/s). That’s 11 minutes.

Sh -4.4# time cp./test.txt. /test.txt. Cp real 0m0.107s user 0m0.008s sys 0m0.085sCopy the code

The above is our theoretical analysis, which takes at least 11 minutes, but the actual situation is that we finished the work in less than one second. We were shocked, why? I have a file system that is 40 GiB in size. Why do I have a file that is 100 GB?

Analysis of the file

Let’s first look at a handful of files using LS to show that the file is indeed 100 GiB.

Sh -4.4# ls -lh-rw-r --r-- 1 root root 100G Mar 6 12:22 test.txtCopy the code

When du is used, it is only 2M. (And the total space of the file system is less than 100 GB)

Sh -4.4# du -sh./test.txt 2.0m./test.txtCopy the code

Look at the stat command again:

Sh -4.4# stat./test. TXT File:./test. TXT Size: 107374182400 Blocks: 4096 IO Block: 4096 Regular File Device: 78h/120d Inode: 3148347 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-03-13 12:22:00.888871000 +0000 Change: 2021-03-13 12:22:46.562243000 +0000 Change: 2021-03-13 12:22:46.562243000 +0000 Birth: -Copy the code

Stat command output description:

The Size is 107374182400 (100G).
Blocks = 4096 (2M); Blocks = 4096 (2M);

To highlight:

Size is the Size of the file, which is what most people see;
Blocks are the physical space actually occupied;

So, notice that a new concept, file size and actual physical occupancy, turns out to be different concepts. Why is that?

Here are the basics of file systems: How do file systems store files? (Take ext series file systems on Linux for example)

The file system

A file system may sound fancy, but in layman’s terms it’s just a container for storing data, essentially like your suitcase or warehouse. It’s just that file systems store digital products. I have a video file, and I put that video in this file system, and the next time I pick it up, I want to get my entire video file data, and that’s the file system, and that’s the access service.

Realistic access scenarios

It’s like the storage service you use at the train station. I can put the package in, I can take it out later, and that’s it. Here’s the question. Put it in? How to take? Think about storing your luggage.

Do I have to register some personal information when I deposit my luggage? Right? At least put your name on it. They might even give you a sign to hang out of your hand that identifies each individual piece of luggage.

When picking up your luggage, state your name and give it to the person with the tag so that the staff member can go to a specific location to find your luggage (otherwise there are so many people in the airport and their bags are all about the same length, he will not know which one you have).

** Draw key points: when storing, some key information must be recorded (record ID, to the identity card), when taking can be correctly positioned. **

The file system

Going back to our file system, a simple analogy can be made with the baggage access behavior above;

To register a name is to record the name of the file in the file system;
The generated brand is the metadata index;
Your luggage is a document;
A cloakroom is a disk (the physical space that holds things);
The whole operation mechanism of the administrator is the file system;

The above corresponding is not very strict, just to help you understand the file system, let you know that in fact, the file system is a very simple thing, ideas come from life.

Highlight: The storage medium of a file system is a disk, and the file system is a software system that manages how disk space is used.

Space management

Now, how does the file system manage space?

If you were given contiguous large disk space to use, what would you do with it?

An intuitive idea, I put the data in completely.

This method is very easy to implement, belongs to the simplest at present, the most troublesome way in the future. Because it creates a lot of holes, obviously there is a lot of space, but because the whole is too big, the shape is not right (data size), there is no place to fit. Because you want a whole space.

This unusable space is called fragmentation, or more precisely, external fragmentation, and is most common when allocating memory in a memory pool. The principle is the same.

How to improve? One would think, if the whole thing doesn’t fit, chop it up. A little here, a little there, and it goes in.

Yeah, that’s exactly right. The way to improve it is to slice the space according to a certain granularity. Each physical Block of small granularity is named Block, and each Block is generally 4K in size. The user data stored in the file system naturally needs to be sliced and stored in each Block. The smaller the Block granularity is, the smaller the external fragments are (note: the amount of metadata is larger), and the space can be utilized as much as possible. In addition, when the complete user data files are stored on the disk, they are not contiguous, but are cut into data blocks of one size and stored in each corner of the disk.

The icon indicates the Block number of the complete object, which is used to restore the object.

Then there’s the problem: you can’t just slice it up. When you take the file data, you have to give the complete user data. The user doesn’t care what you do internally, he only wants the original. Therefore, there should be a table to record the location of all blocks corresponding to the file, to record the location of each Block, when taking the file, the table restores a complete Block to the user.

So, here’s what the writing process looks like:

Write data first: Data is stored to each location on the disk in the Block granularity.
Write metadata: Then store the various locations of the Block. This is called metadata, which is called inode in the file system (I use a book to represent it);

The file reading process is as follows:

First read metadata to find the location of each Block.
Then read the data, construct a complete file, and give it to the user;

The inode/block concept

Ok, now we have two concepts:

Disk space is divided by Block granularity. Data storage areas are all blocks, which are called data areas.
File storage is no longer continuously stored on disk, so metadata needs to be recorded, which we call inodes;

In a file system, one inode corresponds to only one file, and the number of inodes is determined during file system formatting. In other words, a local file system naturally has a maximum number of files to support.

Blocks are fixed in size, 4k per block (most file systems are, we don’t get tangled up here), and blocks are intended to store the user’s data that was shredded.

Both inode and block areas are essentially linear disk space. The spatial hierarchy of a file system is as follows:

Each file corresponds to an inode, which needs to be divided into blocks and stored on the disk. The storage location is recorded by inodes. Blocks can be found through inodes, and user data can be obtained.

Now there is a small new problem, inode and block areas are constructed at initialization. To store a file, you need to take a free inode and then slice the data into 4K and store it in a free block, right?

Highlight: free inodes, free blocks. This is very important, the data has been stored in the place can not be written, otherwise someone else’s data will be overwritten.

So, how do you distinguish between idle inodes and blocks that are already in use?

The answer is: inodes and blocks need a separate table to indicate whether inodes and blocks are in use. This table is called a bitmap table. A bitmap is an array of bits, 0 being idle and 1 being in use, as follows:

When will bitmap be used? Naturally it’s when you write, when you allocate inodes or blocks, because only when you allocate, do you need to find free space.

In order to highlight the essence of the idea, similar to the superblock, block descriptors are omitted, this can be interested in expanding, here only highlight the trunk.

To summarize:

A bitmap is essentially an array of bits that take up very little space, with 0 indicating free and 1 indicating in use. Use it when creating files or writing data;
Inodes correspond to a file that stores metadata, mainly the location of data blocks.
Blocks store user data. User data is segmented by block size (4K) and distributed discretely on disks. The full file can be recovered only depending on the position of the record in the inode.
The total number of inodes and blocks is determined when the file system is formatted, so the number of files and the size of files are capped.

What a document actually looks like

Now let’s look at what a real inode -> block would look like. In addition to data stored in a file, some meta confidence, such as file type, permissions, file size, creation/modification/access time, etc., is stored in inodes. Each file has a unique inode.

Take a look at the inode data structure (for example, Linxu ext2, which is defined in the Linux /fs/ext2/ext2.h header) :

struct ext2_inode {
    __le16  i_mode;     /* File mode */
    __le16  i_uid;      /* Low 16 bits of Owner Uid */
    __le32  i_size;     /* Size in bytes */
    __le32  i_atime;    /* Access time */
    __le32  i_ctime;    /* Creation time */
    __le32  i_mtime;    /* Modification time */
    __le32  i_dtime;    /* Deletion Time */
    __le16  i_gid;      /* Low 16 bits of Group Id */
    __le16  i_links_count;  /* Links count */
    __le32  i_blocks;   /* Blocks count */
    __le32  i_flags;    /* File flags */

    __le32  i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
    __le32  i_file_acl; /* File ACL */
    __le32  i_dir_acl;  /* Directory ACL */
    __le32  i_faddr;    /* Fragment address */
};
Copy the code

Key points:

The above structure mode, UID, size, time and other information is commonly referred to as the file type, size, creation and modification time metadata;
Pay attention to thei_block[EXT2_N_BLOCKS]This field, this field is going to take you to the data, because that’s where the block is stored, the number of the block;

Again, let’s understand what the position of the block is.

The location is the number, the record location is the record number, and the number is the index.

We see that there is an array: i_block[EXT2_N_BLOCKS], which is an array of blocks. Where EXT2_N_BLOCKS is a macro definition with a value of 15. That is, i_block is an array of 15 elements, each of which is 4 bytes (32 bits) in size.

For example, suppose we have a 6K file and only need 2 blocks to store it. Suppose the data is stored on blocks 3 and 101.

i_block[15]The first element stores 3, the second element stores 101, the other slots are not used,Since the memory in the inode is zeroed out, the value in the inode is 0, indicating that it is not in useWe can assemble complete user data through these two blocks [3, 101]. The user’s 6K file consists of the following:

The first 4K data is in the [3*4K, 4*4K] range;
The second 2K data is in the [101*4K, 101*4K+2K] range;

Ok, now that we know that each fixed-length block has a unique number, our i_block[15] array finds the location of the file data by storing this number in order, and assembles the entire file.

Question to consider: Distinguish between the number of files cut into 4K blocks and the number of physical 4K blocks on disk.

For example, if a file is 12K in size, it will be stored in 4K shards on 2 physical blocks.

File 0 4K is stored on physical block 101; File 1 4K is stored on physical block 30; The second 4K file is stored on physical block 11;

The logical space of a file is numbered from 0 to 2, and the corresponding physical block numbers are 101, 30, and 11.

Question to consider: How large a file can such an inode represent?

We see that inode->i_block[15] is a one-dimensional array that can hold 15 elements. That is, the number that can store 15 blocks, so if the block number of the direct storage file can represent the maximum 60K (15*4K) file. In other words, if I hold all 15 slots for file numbers, the file system will support a maximum of 60K files. Stunned? (Note: The ext2 filesystem can create files up to 4TB in size!!)

So we naturally think, how do we solve this? How can I support a larger file?

The most straightforward way to think about it is to make the inode->i_block array bigger by using a larger array. For example, if you want to support 100GB files:

So, we need an i_block array size of 26214400 (100\*1024\*1024/4), that is, to allocate an i_block array [26214400].

With each number taking up 4 bytes, this array takes up 100 MB of space (calculated as :(26214400\*4)/1024/1024).

100 m! Note that i_block is just a field inside an inode, which is a statically allocated array. In other words, to support up to 100 GIGABytes of file storage, each inode takes up 100 MB of memory. Even if you are a 1K file, Inodes also take up the same amount of memory. In addition, this scheme has poor scalability. The larger the supported file size is, the more serious the memory consumption of i_block[N] is.

This is unacceptable. So what can be done to solve this problem? How can you represent larger files without wasting space?

If we break this down, you’ll see that there are two core issues:

(The key point is to ensure that the inode memory structure is stable, no matter how the file changes, the inode structure itself cannot change).
Second, if you think about it carefully, you will find that no matter what fairy scheme, if you want to store a 100GB file in 4K, you will need 100M space to store the index (block number), but 99.99% of the files are probably not that big.

Our previous scheme of storing block numbers in a large array was simple, but the problem was that it was too rigid. The core problem is that the array storing block numbers is pre-allocated, and pre-allocation is a source of waste by having a full block index array ready for something that hasn’t happened yet and won’t happen 99% of the time (100GB file size).

So knowing these two problems, the next step is to analyze and solve them one by one:

Index disk

Solution to problem 1: Index storage disk:

The inode->i_block is allocated to disk, and the inode->i_block is allocated to disk.

Why is that?

Because disk space is more than an order of magnitude larger than memory. 100M is big for memory and small for disk. In other words, you can store the block number of the user’s data on disk, which also requires physical space and uses blocks for storage, but this block stores the block number information, not the user’s data.

So how do we find user data through inode?

Inode ->i_block[15] = inode->i_block[15] = inode->i_block[15] = inode->i_block[15] Then, locate the block where the user data resides based on the id of the block where the user data resides and read the data.

The number of the block where the user block number is stored is called the indirect index, and then we can classify it into primary index, secondary index and tertiary index according to the number of jumps. As the name implies, a level 1 index means that the user data can be located after one jump, a level 2 index means that the user data can be located after two jumps, and a level 3 index means that the user data can be located after three jumps. Inode ->i_block[15] stores blocks that directly locate user data.

Back to ext2, the inode->i_block[15] array. According to the convention, these 15 slots are divided into 4 different categories to use:

The first 12 slots (that is, 0-11) are directly indexed;
The 13th location is what we call level 1 index;
The 14th position, we call level 2 index;
The 15th place is what we call a level 3 index;

Ok, so let’s look at the performance of direct indexes, primary, secondary and tertiary indexes.

Direct index: can store 12 block numbers, each block 4K, that is, 48K, that is, files within 48K, only need to use the first 12 slots of inode->i_block[15] storage numbers can be fully held.

Primary index:

The inode->i_block[12] position stores a level 1 index, i.e., the number stored in the block is also the number stored in the block, and the number in the block is the user data. A block 4K, 4 bytes per element, that is, 1024 numbered positions can be stored.

Therefore, a level 1 index can address 4M (1024 * 4K) space.

Secondary index:

The secondary index is only one more level on the basis of the primary index, converted to 4M space for storing user data number. So the secondary index can address 4G (4M/4 * 4K) space.

Tertiary index:

The tertiary index is an additional level on the basis of the secondary index, that is, there is 4G space to store the block number of user data. So a secondary index can address 4T (4G/4 * 4K) of space.

Finally, take a look at the full representation:

So, on our ext2 file system, the maximum file size supported by this indirect block indexing is equal to 48K + 4M + 4G + 4T, which is about 4T. The file system supports a maximum of 16 TB space, because the maximum number of bytes is2 ^ 32 = 4294967296Omega times 4K is equal to 16 T.

This is how the maximum single file size and maximum file system capacity supported by the ext2 file system are calculated. (Note: The ext4 file system is not only compatible with indirect block implementation, but also uses the extent mode to manage the space. The ext4 file system supports a maximum of 16 TB per file and 1 EB for the file system.)

Consider: How does this multilevel index addressing perform?

Addressing in small files with no more than 12 data blocks is the fastest, accessing any data in the file theoretically requires only two reads of the disk, one read of the inode and one read of the data block. Accessing data in a large file requires up to five disk reads: inodes, primary addressable blocks, secondary addressable blocks, tertiary addressable blocks, and data blocks.

Multilevel indexing and post-allocation

Problem two solved: multi-level indexing and post-allocation

Level 1 index is not enough, poor performance, reserved space is too wasteful, do not reserve space and can not expand, how to solve?

Since the problem is pre-allocation, we use post-allocation (thin allocation, or thin allocation) to solve it. That is, the size of the user file is the size of the array I allocate. For example, if we store 100 gigabytes of files, we will use a tertiary index block that allocates up to 26214,400 slots in the array (because 26214,400 blocks are required). If you store 6K files, you only need a 2-slot array.

The post-allocation of an indexed array

Distribution after said here is that the block after the index number of the array allocation, allocate only when needed, and not to say that users to store a 1 k file, now I come up to his assigned an index of 100 m array, after just for the sake of this file may grow to 100 G.

Post allocation of data

As mentioned here, there is another dimension to post-allocation, that is, the space occupied by the data is allocated only when it is used, which is the core issue involving the secrets of CP today.

Actual chestnuts

Take a look at the normal file writing to do (note that only the trunk is described here, the actual flow may be optimized) :

Create a file and allocate an inode.
Write 4K data at position [0, 4K]. At this point, only one block is required. Let’s say 102inode->i_block[0]I’m going to save this position;
Write 4K data at [1T, 1T+4K]. Allocate a block at [1T, 1T+4K].
Write complete, close file;

Here is why the file offset [1T, 1T+4K] falls into the tertiary index.

Offset is 1T, which is block 268435456 (note this is a virtual file block, not a physical location).
Work out the range first: The direct index range is [0, 11], primary index [12, 1035], secondary index [1036, 1049611], tertiary index [1049612, 1074791435]. Direct index 12, primary index 1024, secondary 1M, tertiary 1G, and then calculated);
268435456 falls within the range of the tertiary index [1049612, 1074791435];

The actual storage is shown as follows:

Calculate index:

12 + 1024 + 1024 * 1024 + 1024 * 1024 * 254 + 1024 * 1022 + 1012 = 268435456

The actual physical allocation is shown below:

Since the offset already uses level 3 indexes, in addition to the two blocks of user data, three indirect index blocks are allocated in between.

What if I want to read the data at [1T, 1T+4K]?

The process is as follows:

Calculate offset to get the position at 268435456;
Readout tertiary indexinode->i_block[14]Find the corresponding physical block, this is the first level block;
Then read the data in slot 254+1 of the block, which stores the number of the second level of the block, read this number, find the corresponding physical block by this number;
Read the data of the 1022 +1 operation of the block, which stores the third-level block number. You can find the data of the physical block by using this number, which stores the number of the block where the user data resides.
Read the block number stored in slot 1012+1, find the physical block, this block is stored in the user data;

At this time, our file seems to be a large file, size is equal to 1T+4K, but the actual data inside is only 8K, the positions are [0, 4K], [1T, 1T+4K] respectively.

Note: The size of the file is only an attribute in the inode, and the actual physical space used depends on how many blocks are placed in the user’s data.

Highlight: Do not allocate physical blocks where no data is written.

Allocating physical blocks without writing data? What’s that? That’s the sparse file we’re going to talk about.

Sparse semantics of files

What is a sparse file

Finally, sparse semantics of our file. What does sparse semantics mean?

Sparse file. Sparse files are essentially computer files, which users are unaware of. The file system supports sparse files just to use disk space more efficiently. Sparse files are a form of post-allocated space, which is allocated in real time to maximize the efficiency of the use of disk space.

Take the above example, the file size is 1T, but the actual data is only 8K, this is a sparse file, the logical size and the actual physical space can be different. The file size is just an attribute, the file is just a container of data, there is no user data location can not allocate space.

Why support sparse semantics?

Again, using the 1TB file above as an example, if the 1TB file only writes 4K of data at the beginning and end, but the file system has to allocate 1TB of physical space, there will be a huge waste. Why not wait until the user data is stored and then allocate, how much actual data, to allocate large blocks, why rush pre-allocation?

After the allocation in line with the principle of how much to how much, as far as possible effective use of space.

Post allocation also has the advantage of reducing the first write time.

Because, if the file size is 1T, 1T of space is allocated, then the initial allocation needs to write all zero to space, otherwise the data above may be random.

For sparse file empty places, do not take up physical space, but to ensure that the semantics of all 0 data return when reading, can be.

Another point: sometimes the empty space in a sparse file is indistinguishable from the user’s real data, which is all zeros, because it looks the same.

Sparse files also need to be supported by file systems. Not all file systems support sparse semantics. For example, ext2 does not support sparse semantics, but ext4 has sparse semantics.

How to create a sparse file?

Use the TRUNCate command to create a file on an ext4 file system.

truncate -s 100G  test.txt
Copy the code

youls -lh ./test.txtCommand to see a 100 GB file;
butdu -sh ./test.txtIt’s a 0 byte file;
stat ./test.txtWill find that isSize: 107374182400 Blocks: 0The file;

This is a typical sparse file. Size is the logical size of the file. The actual physical space used depends on Blocks.

For the following 1TB file, only 8K data is written, so only 2 blocks are allocated to store user data.

Ok, so let’s think a little bit deeper, why does a file system do this?

This is why understanding sparse semantics begins with understanding the implementation of the file system.

First, the most critical thing is to manage disk space by cutting it into discrete, fixed-length blocks.
The inode can then look up all discrete data (holding all indexes).
Finally, the post allocation of index block and data block space is realized.

These three points are progressive.

Sparse semantic interface

For the sake of knowledge integrity, several interfaces for sparse semantics are briefly introduced:

Preallocate: Provide interfaces that allow users to preallocate a specified amount of physical space within a file.
Punch hole: Provides an interface for users to release physical space within a specified range in a file.

The two operations are opposite.

Preallocate means that when you create a 1TB file, if you don’t write data, and there’s no physical space allocated at the time, sparse semantic file systems will provide you with a Fallocate interface that allows you to preallocate, so that 1TB of physical space is allocated right now.

Think: What’s the advantage of this?

First, if you are destined for 1TB of space, pre-allocation is beneficial, concentrating the workload of space allocation on initialization, avoiding the overhead of live allocation.
Second, if you do not occupy the pit in advance, there is a good chance that there will be no space available by the time you want it. So you take the physical space first, you can set your mind at use;

Linux provides a fallocate command that can be used to preallocate space.

fallocate -o 0 -l 4096 ./test.txt
Copy the code

TXT file [0, 4K];

What is a punch hole?

This call allows you to free up the occupied physical space for quick release purposes. This operation is often used in VM image scenarios to quickly release space. Punch hole enables services to effectively use space.

Linux provides a fallocate command that can also be used for punch hole Spaces.

fallocate -p -o 0 -l 4096 ./test.txt
Copy the code

Test.txt [0, 4K] free physical space

Sparse file application

Go language implementation

Sparse files themselves have no specific relationship with programming language. Let me take Go as an example to see how sparse files are pre-allocated and punch hole.

Pre-allocation implementation:

func PreAllocate(f *os.File, sizeInBytes int) error {
	// use mode = 1 to keep size
	// see FALLOC_FL_KEEP_SIZE
	return syscall.Fallocate(int(f.Fd()), 0x0.0.int64(sizeInBytes))
}
Copy the code

Punch hole implementation:

// mode 0 change to size 0x0
// FALLOC_FL_KEEP_SIZE = 0x1
// FALLOC_FL_PUNCH_HOLE = 0x2

func PunchHole(file *os.File, offset int64, size int64) error {
	err := syscall.Fallocate(int(file.Fd()), 0x1|0x2, offset, size)
	if err == syscall.ENOSYS || err == syscall.EOPNOTSUPP {
		return syscall.EPERM
	}
	return err
}
Copy the code

As you can see, it’s essentially a system call fallocate with different parameters. Specify file offset and length to preallocate or free physical space.

Here’s the thing: Punch hole calls need 4k alignment to free up space.

Here’s an example:

Punch hole [0, 6K] data, you will find that only [0, 4k] data physical blocks are released, [4K, 6K] occupied by 4k physical blocks still occupy space.

And this is easy to understand because the physical space of the disk is divided into 4k blocks, and this is the smallest unit, you can’t divide any more, you can’t cut a smallest unit.

Note that Fallocate does not report errors even if you do not have 4k aligned send calls.

`cp`The secret of

Foreshadowing such a long time of basic knowledge, finally to our CP command decryption. Back to the original question, it takes less than a second to cp a 100GB file, why so fast?

At this point, the problem is clear, the 100G file is a sparse file, blind guessing hand: cp copy only valid data, the void is directly skipped. Just look at the difference between the stat and ls commands.

Let’s look at the implementation of CP in detail.

The interesting sparse parameter controls the behavior of the CP command with sparse files. This parameter has three optional values:

--sparse=always: the most economical space;
--sparse=auto: Default value, fastest.
--sparse=neverB b B b copy, the most silly;

Cp By default, SPARSE is auto. What are auto, always, never?

Three major Spare strategies

Auto policy

By default, CP checks if the source file has sparse semantics, and for locations that do not occupy physical space, the target file does not write data, resulting in a void.

So, for our example, the real thing is that we only did 2 MB OF IO, the expected 100 GIGABytes of file, and only copied 2 MB of data. Naturally, it was fast, and naturally surprised everyone.

Auto is the default policy. When using this mode, the cp internal implementation gets the file’s empty position through the system call, and the target file is left empty for these positions.

** Note that no judgment is made on the contents of the file in non-empty positions. If the user data occupies a physical block, but is all zero data, in this case, auto mode does not recognize and will write all zero data to the target file. ** This is the biggest difference with always.

In the auto policy, the file size, and number of physical blocks of the CP file are the same as those of the source file.

always

This approach is the most radical, the pursuit of space minimization. On the basis of Auto, one more step is made: the source file content is judged.

After reading the source data, even if the location of the data is not a void in the source file, the program will make a judgment to determine whether the data is all zeros. If so, it will also create a void in the corresponding location in the target file (no physical space is allocated).

This method results in the same size of the source file as the target file (the size of the file is the same in all three strategies), but the physical blocks use less.

never

This approach is the most conservative and simplest to implement. Regardless of whether the source file is a sparse file or not, CP is completely unaware and any data read is written directly to the target file. That is, if a 100GB file takes up only 4K of physical space, it will create a 100GB object file, which takes up 100GB of physical space.

So, if you cp with this parameter, it will be very, very slow.

in-depth`cp --sparse`The source code

The above are the conclusion, now we through the source code to further understand the principle of CP, together with the implementation of the code of CP.

Cp command source code in the GNU project coreutils project, provides a peripheral basic command tools for Linux. Cp seems to be very simple, but actually the code implementation is quite interesting.

The entry code for cp is in the cp.c file (following is based on coreutils 8.30) :

Taking a command from a cp file as an example, let’s take a trip through the source perspective:

cp ./src.txt dest.txt
Copy the code

First, we initialize the parameters in main:

      switch (c)
        {
        case SPARSE_OPTION:
          x.sparse_mode = XARGMATCH ("--sparse", optarg,
                                     sparse_type_string, sparse_type);
          break;
Copy the code

This is translated into an enumeration value based on the argument passed in by the user. The enumeration value is one of SPARSE_NEVER, SPARSE_AUTO, or SPARSE_ALWAYS. If the default user does not have this parameter, it will be SPARSE_AUTO:

static enum Sparse_type const sparse_type[] =
{
  SPARSE_NEVER, SPARSE_AUTO, SPARSE_ALWAYS
};
Copy the code

Therefore, the parameter x.sparse_mode is assigned in the main function, which is also the guiding parameter for the behavior of sparse files. How to deal with sparse files in the future depends on this parameter.

The following is to call do_copy, copy, copy_internal functions, do_copy, copy these two functions are to deal with some encapsulation, verification, including some logic involving directory, with our sparse file decryption is not related, directly skip.

Copy_internal is a very long function, and the logic in it is mostly about compatibility and adaptation scenarios, which is not relevant to this. A regular file whose copy_reg function ends up being called is an implementation of regular file copy.

  else if(S_ISREG (src_mode) || (x->copy_as_regular && ! S_ISLNK (src_mode))) { copied_as_regular =true;
      // Copy a normal file
      if (! copy_reg (src_name, dst_name, x, dst_mode_bits & S_IRWXUGO,
                      omitted_permissions, &new_dst, &src_sb))
        goto un_backup;
Copy the code

Copy_reg is the function copy_reg that really started copying ordinary files. In this function, the source file and target file handles are opened, and the data is copied.

static bool
copy_reg(...). 
{
  // Confirm to copy data
  if (data_copy_required)
    {
      // Get block size, buffer size and other parameters
      size_t buf_alignment = getpagesize ();
      size_t buf_size = io_blksize (sb);
      size_t hole_size = ST_BLKSIZE (sb);

      bool make_holes = false;
      Is_probably_sparse () the is_probably_sparse function is used to tell if a source file is sparse.
      bool sparse_src = is_probably_sparse (&src_open_sb);

      if (S_ISREG (sb.st_mode))
        {
          if (x->sparse_mode == SPARSE_ALWAYS)
            // Sparse_always mode, which is also a strategy for pursuing extreme spatial efficiency;
            // So this method generates sparse target files regardless of whether the source file is really sparse;
            make_holes = true;
          // If the sparse_auto policy is used and the source file is sparse, the target file will be sparse (i.e., the file can have holes).
          if (x->sparse_mode == SPARSE_AUTO && sparse_src)
            make_holes = true;
        }

      // If the target file is not sparse, then use a more efficient way to copy, such as using a larger buffer to hold data, more copy at a time;
      if (! make_holes)
        {
            / / a little
        }

      // If the source file is sparse, use extent_copy as a more efficient way to copy.
      if (sparse_src)
        {
          if (extent_copy (source_desc, dest_desc, buf, buf_size, hole_size,
                           src_open_sb.st_size,
                           make_holes ? x->sparse_mode : SPARSE_NEVER,
                           src_name, dst_name, &normal_copy_required))
            goto preserve_metadata;
            
        }

      // If the source file is not considered sparse, the standard sparse_copy function is used to copy it.
      if (! sparse_copy (source_desc, dest_desc, buf, buf_size,
                         make_holes ? hole_size : 0,
                         x->sparse_mode == SPARSE_ALWAYS, src_name, dst_name,
                         UINTMAX_MAX, &n_read,
                         &wrote_hole_at_eof))
        {
          return_val = false;
          goto close_src_and_dst_desc;
        }
        / / a little}}Copy the code

The above for copy_reg code I have made a great simplification, the key process combed out.

Summary:

The copy_reg function is really where the logic of cp a normal file is, where the source file is opened, the target file is created, and the data is written.
I’ll use it before I copy itis_probably_sparseFunction to determine whether the source file belongs to the sparse file;
Sparse Always mode attempts to generate sparse object files regardless of whether the source file is sparse or not (in this mode, if the source file is not sparse, zeros will be determined, and holes will be made in the target file if so).
In SPARSE Auto mode, the source file is sparse and the generated target file is sparse.
If the source file is sparse, it will try to use a more efficient oneextent_copyFunction to copy data;
If it is never mode, then it is calledsparse_copyFunction to copy data, and there will be no attempt to punch hole, the copy process will be very slow, will generate a real object file, physical space is exactly the same as the file size;

Above the summary, mentioned a few interesting points, let’s explore the secret:

Problem a:is_probably_sparseHow does the function determine the source file?

Read the source code you will find, very simple, in fact, is the source file stat, get the size of the file, and the number of physical blocks (let’s say 512 bytes physical blocks), compared to know.

static bool
is_probably_sparse (struct stat const *sb)
{
  return (HAVE_STRUCT_STAT_ST_BLOCKS
          && S_ISREG (sb->st_mode)
          && ST_NBLOCKS (*sb) < sb->st_size / ST_NBLOCKSIZE);
}
Copy the code

For example, if the file size is 100G and the physical block occupies 8, 100G/512 bytes is > 8, so it is a sparse file.

The file size is 4K and occupies 8 physical blocks, so 4K/512 bytes == 8, so it is not a sparse file.

Problem two:extent_copyWhy is it more efficient?

The key is the implementation of an extent_scan_read subfunction, which is located in the extent-scan.c file. Extent_scan_read begins with extent_copy and is used to obtain information about the void location of the source file. This is the root cause of extent_copy’s efficiency. Extent_scan_read uses this function to get the exact location of the file’s holes, so that when copying data, it can skip those holes and only copy the valid ones.

So, how is extent_scan_read implemented?

The answer is: the ioctl system call, with the FS_IOC_FIEMAP argument, which is the fiemap call.

/* Call ioctl(2) with FS_IOC_FIEMAP (available in linux 2.6.27) to

obtain a map of file extents excluding holes. */

Fiemap this is a very important feature. The ioctl and FS_IOC_FIEMAP function can get the physical space allocation of a file and let the user know which parts of a 100GB file actually have physical blocks to store data and which parts are empty.

This feature is provided by the file system, meaning that it is available only if the file system provides the interface, as ext4 does, but ext2 does not.

Problem two:sparse_copyWhy slow? What’s going on in there?

Extent_copy (extent_copy, extent_copy, extent_copy, extent_copy);

Sparse_copy says that a large contiguous block of data with all zeros is considered a void and the object file is not written to and holes are made.

The function to determine whether all zeros are zero is is_nul, located in the system.h header file. The implementation is very simple to see if the entire block of memory is zero.

For example, if sparse_Copy reads 4k data from the source file and finds all zeros, then the corresponding location of the target file will not be written. Instead, it will punch holes directly to save space.

Note, however, that this is only true in aggressive sparse always strategy. Otherwise, Sparse_Copy would not do this. Instead, sparse_copy faithfully copies data, even if all zeros are written to the target file.

Therefore, in always mode, the target file takes up less physical space than the source filesparse_copyThe implementation of this function.

Cp fast reasons

So far, the secret of CP has been completely revealed. Why is CP a 100G file so fast?

Because the source file is sparse, the file looks like 100GB, but actually only takes up 2M of physical space. File systems decouple the concepts of file size and physical space usage, allowing for more flexible usage posture and more efficient use of physical space.

Cp by default, through the fiemAP interface provided by the file system, it obtains all the void information of the file, and then skips the void information and only copies the valid data. This greatly reduces the disk I/O data volume, so it is so fast.

Summarize the characteristics of THREE parameters OF CP — SPARSE:

Auto mode: The default mode is the most consistent mode (or the fastest mode if no user has zero data blocks). Data is copied based on the actual space occupied by the source file. The destination file is the same as the source file. Whether it’s file size or physical blocks;
Always mode: minimizes space usage. Even if the source file is not a sparse file but contains continuous blocks of all zeros, the system tries to punch holes in the target file to save space. As a result, the physical blocks of the target file may be smaller than the source file.
Never mode: the least efficient and slowest mode. In this way, no matter what the source file is, all the data is copied, whether it is empty or all zero data, will be written in the target file;

Animation demonstration (essence) :

cp src.txt dest.txt
Copy the code

cp --sparse=always src.txt dest.txt
Copy the code

cp --sparse=never src.txt dest.txt
Copy the code

Sparse file application

Where are sparse files used?

Database snapshot: WHEN a database snapshot is taken, a sparse file is generated, which does not initially take up disk space. When a write operation is performed on the source database, the original data block before the modification is copied to the sparse file only once.
MySQL5.7 has a data compression method, its principle is to use the kernel Punch hole feature, for a 16KB data Page, before writing the file, except for the Page header, other parts of the compression, after the compression of the blank place using Punch hole “hole”. On the disk, it does not occupy space, which can quickly release physical space.
Space reclamation scenario of qEMU disk image files;

Let’s do an experiment

And then finally we’re going to show you the experiment and check it out do you get it? Find a Linux machine and run the following command.

Preparation of initial conditions

Step 1: Create a file (expected to occupy 1 block).

echo =========== test ======= > test.txt
Copy the code

Step 2: Truncate into a sparse file of 1G.

truncate -s 1G ./test.txt
Copy the code

Step 3: pre-allocate the position from 1M to 1M+4K (and write the 0 allocation, which is expected to occupy 2 blocks, i.e. 8K data).

fallocate -o 1048576 -l 4096 -z ./test.txt
Copy the code

Step 4: Run the stat command to check the situation.

TXT File: test. TXT Size: 1073741824 Blocks: 16 IO Block: 4096 Regular File Device: 6ah/106d Inode: 3148347 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-03-12 15:37:54.427903000 +0000 Change: 2021-03-12 15:46:00.456246000 +0000 Change: 2021-03-12 15:46:00.456246000 +0000 Birth: -Copy the code

Size: 1073741824 Blocks: 16 Size: 1G Blocks stat = number of sectors (512 bytes).

In other words:

The file size is 1G;
Actual data is written only at [0, 4K] and [1M, 1M+4K].
The range [0, 4K] is normal data, and the data in the range [1M, 1M+4K] is all 0 data.

Now that initial conditions are in place, let’s start with three CP –sparse behaviors.

Experimental verification of CP

Default policy:

cp ./test.txt ./test.txt.auto
Copy the code

Always strategy:

cp --sparse=always ./test.txt ./test.txt.always
Copy the code

Never strategy (this command may be a little slow to type, but leave plenty of space) :

cp --sparse=never ./test.txt ./test.txt.never
Copy the code

Auto: test.txt.auto: test.txt.always: test.txt.never: test.txt.never: test.txt.auto: test.txt.always: test.txt.never

. . .

Results revealed:

test.txt.auto

Auto File:./test.txt. Auto Size: 1073741824 Blocks: 16 IO Block: 4096 Regular File Device: 6ah/106d Inode: 3148348 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-03-13 15:58:57.395725000 +0000 Change: 2021-03-13 15:58:57.395725000 +0000 Change: 2021-03-13 15:58:57.395725000 +0000 Change: 2021-03-13 15:58:57.395725000 +0000 Change: 2021-03-13 15:58:57.395725000 +0000 Birth: -Copy the code

Size: 1073741824: The file Size is 1 gb
Blocks: 8: Takes up 8K of physical space

test.txt.always

Always File:./test.txt. Always Size: 1073741824 Blocks: 8 IO Block: 4096 Regular File Device: 6ah/106d Inode: 3148349 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-03-13 15:59:01.064725000 +0000 Change: 2021-03-13 15:59:01.064725000 +0000 Change: 2021-03-13 15:59:01.064725000 +0000 Change: 2021-03-13 15:59:01.064725000 +0000 Birth: -Copy the code

Size: 1073741824: The file Size is 1 gb
Blocks: 8: Physical space takes up 4K

test.txt.never

/test.txt. Never File:./test.txt. Never Size: 1073741824 Blocks: 2097160 IO Block: 4096 regular file Device: 6ah/106d Inode: 3148350 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: (0/ root) Access: 2021-03-13 15:59:04.774725000 +0000 Modify: 2021-03-13 15:59:05.977725000 +0000 Change: 2021-03-13 15:59:05.977725000 +0000 Birth: -Copy the code

Size: 1073741824: The file Size is 1 gb
Blocks: 2097160: Physical space occupies 1 GB

So, did you learn?

More dry goods, public concern: Qiya cloud storage

Knowledge summary

File systems provide file semantics externally, and are essentially software for managing disk space.
A typical file system is divided into three blocks: superblock, inode, and block (block description and bitmap are not described here). The internal shape of a file in a file system consists of an inode recording metadata and a block storing user data.
The size of the file system is the size of the logical space. The size of the file system is not the same as the real physical space.
Sparse semantics is a feature provided by file systems. Its fundamental purpose is to use disk space more efficiently.
After the allocation of space is the most effective way to use space, public cloud disk by what money? Is after allocation, you buy 2T cloud disk, when there is no data written, a byte did not give you allocation, you are to pay 2T price;
Stat displays the number of sectors (512 bytes) that Blocks Blocks.
The empty space of a sparse file is indistinguishable from the user’s real all-0 data, because it looks the same externally (which is important);
Cp command through the callioctl(fiEMAP) system call, can obtain the distribution of file holes, cp process skip these holes, greatly improve the efficiency (100G source file, CP only do a dozen IO to complete, so 1 second is enough);
The VALUE of CP is the fastest, most space-saving, and most copied data. The target file produced by the tiny CP command is actually different from the source file, but you haven’t noticed.
Pre-allocation and punch hole are bothfallocateCall, just with different arguments, and when I call,Note that 4k alignment is required for this purpose;
The punch hole application of sparse files has many scenarios, usually used to quickly free up space, such as mirror files;

Afterword.

This paper starts with a cp command that can be seen everywhere in daily life and is used by all people, but they have not thought about it carefully, and analyzes its principle deeply through a phenomenon that is often ignored by us. This time through the analysis of CP and get a little secret knowledge points.

I told this little knowledge to my little friend for an hour. When I saw his face, I felt that he had learned fei and was very satisfied. Am I thinking too much? He didn’t even call me for lunch.

More dry goods, public concern: Qiya cloud storage

Deep dive into the secrets of Linux CP

The thinking caused by CP

Analysis of the file

The file system

Realistic access scenarios

The file system

Space management

The inode/block concept

What a document actually looks like

Index disk

Multilevel indexing and post-allocation

Sparse semantics of files

What is a sparse file

Why support sparse semantics?

How to create a sparse file?

Sparse semantic interface

Sparse file application

Go language implementation

cpThe secret of

Three major Spare strategies

Auto policy

always

never

in-depthcp --sparseThe source code

Cp fast reasons

Sparse file application

Let’s do an experiment

Preparation of initial conditions

Experimental verification of CP

Knowledge summary

Afterword.

Related Posts

3, hand to hand teach you to build Java environment

All that confusion about being new to data

Detailed steps to implement merging multiple microservice Swagger interface documents at the gateway

`cp`The secret of

in-depth`cp --sparse`The source code