[toc]
Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.
Is the written data safe?
Consider a question: How far is it safe to write data?
If the user sends an IO request and you reply “Write successful”, the data can still be read regardless of power failure, restart, etc.
So, what is the essential requirement for data security without considering data quiescent errors?
Underline: that is the data must be in non-volatile storage medium, you can reply to the user “write success”. Please remember this sentence, do storage developers, 80% of the time are thinking about this sentence.
So what are the common volatile and non-volatile media?
Volatile media: registers, memory, etc. Non-volatile media: disk, solid state drive, etc.
Take a look at the simplified classic pyramid:
From top to bottom the speed decreases, the capacity increases, the price decreases.
Linux IO briefly
We mentioned earlier how a file is read and written, how the standard library is written, and how the system calls it. Either way, it is essentially a file based form that follows a layer of file systems. The main layer is: system call -> VFS -> file system -> Block device -> hardware driver.
We open the file and write data into it. Ok, now consider the question, when write returns successfully, does the data reach disk?
The answer is: not really.
Because of the filesystem cache, the default mode is write back. Data is successfully written to memory, and the kernel flusits the disk asynchronously depending on the actual situation (such as periodically or when a certain threshold of dirty data is reached).
The benefit of this is to ensure the performance of the write, it seems that the performance of the write is very good (but not good, data write memory speed), the disadvantage is that there is data risk. Because when the user receives the success, the data may still be in memory, this time the machine power off, because memory is a volatile medium, the data will be lost. Losing data is the most unacceptable thing to do for storage, the equivalent of losing the lifeblood of storage.
Animation demonstration:
How to ensure the reliability of the data?
Delimit key point: still that sentence, must ensure the data falls disk after, just return success to the user.
So how can we ensure this? There are three ways to do this.
open
For files, useO_DIRECT
Mode turns on, and so onwrite/read
The file system’s I/O bypasses the cache and directly tracks the DISK’S I/O.open
File, usingO_SYNC
Mode to ensure that each IO stroke is synchronized. orwrite
After that, make an active callfsync
, forced data disk;- Another way to read and write files is through
mmap
The function maps a file to a process’s address space. Data written to or written to a process’s memory address is actually forwarded to disk for reading and writing.write
And then you call onemsync
Mandatory brush disk;
Three safe IO positions
O_DIRECT mode
The DIRECT IO mode ensures that each IO accesses disk data directly, rather than returning a success result to the user after the data is written to memory, thus ensuring data security. Because memory is volatile and lost on a power failure, data can only be safely written to persistent media.
Animation demonstration:
The data is directly read from the disk instead of being cached in the memory, thus saving the entire system memory.
The downside is equally obvious, since every TIME I/O is driven, the performance will look bad (but you need to understand that this is real disk performance).
When O_DIRECT mode is used, the user must ensure that the alignment rules are correct, otherwise I/O will report an error.
- The size of the disk IO must be aligned with the sector size (512 bytes)
- The disk IO offset is aligned to the sector size.
- Memory buffer addresses must also be sector aligned;
Examples of C language:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>
extern int errno;
#define align_ptr(p, a) \
(u_char *)(((uintptr_t)(p) + ((uintptr_t)a - 1)) & ~((uintptr_t)a - 1))
int main(int argc, char **argv)
{
char timestamp[8192] = {0};char *timestamp_buf = NULL;
int timestamp_len = 0;
ssize_t n = 0;
int fd = - 1;
fd = open("./test_directio.txt", O_CREAT | O_RDWR | O_DIRECT, 0644);
assert(fd >= 0);
// Align memory addresses
timestamp_buf = (char *)(align_ptr(timestamp, 512));
timestamp_len = 512;
n = pwrite(fd, timestamp_buf, timestamp_len, 0);
printf("ret (%ld) errno (%s)\n", n, strerror(errno));
return 0;
}
Copy the code
Compile command:
gcc -ggdb3 -O0 test.c -D_GNU_SOURCE
Copy the code
Generate the binary file, execute it and you’ll see that this is successful.
Sh - 4.4 -# ./a.out
ret (512) errno (Success)
Copy the code
If you want to verify this error, you can set the IO offset or size, or the buffer address is not aligned to 512 (for example, set timestamp_buf to align by 1, and then try again), you will get the following:
Sh - 4.4 -# ./a.out
ret (-1) errno (Invalid argument)
Copy the code
Question to consider: Some children may be curious to ask? I can align the IO size and offset with 512, but how can I make sure that malloc’s address is aligned with 512?
Yeah, we can’t use malloc to control the generated address. There are two solutions to this need:
Method 1: allocate a larger size of memory, and then find an aligned address in the chunk of memory, just make sure that the SIZE of the IO does not exceed the last boundary.
In my demo above, I allocated block 8192 and found 512 aligned addresses from it. There is no way 512 bytes from this address can reach the boundary of this large block of memory. The purpose of alignment is achieved safely.
This approach is simple and universal, but it is a waste of memory.
Method 2: Use the POSIX encapsulated interfaceposix_memalign
To allocate memory, this interface allocates memory to ensure alignment;
As follows, allocate a memory buffer of 1 KiB with memory addresses aligned to 512 bytes.
ret = posix_memalign (&buf, 512.1024);
if (ret) {
return - 1;
}
Copy the code
Consider a question: What are the typical application scenarios for IO in O_DIRECT mode?
- The most common is the database system, the database has its own cache system and IO optimization, do not need the kernel to consume memory to do the same thing, and may be good for bad;
- A scenario where the file system is not formatted and the block device is directly managed;
The standard IO +sync
Sync function: Forcibly flush the kernel buffer to the output disk.
In Linux’s cache I/O mechanism, there is a volatile layer of media between the user and disk — the kernel space buffer cache;
- Read data is cached in memory to improve subsequent read performance.
- When user data is written to the memory, the cache returns a success message to the user. Then, the disk is flushed asynchronously to improve user write performance.
Read operations are described as follows:
- Operating systems look at the kernel first
buffer cache
Is there a cache? If yes, it will return directly from the cache; - Otherwise, it is read from disk and cached in the operating system cache.
The write operations are described as follows:
- When data is copied from user space to the kernel’s memory cache, a success message is returned to the user and the write operation is complete.
- When memory data is actually written to disk is determined by operating system policy (if the machine is powered off, user data is lost);
- So, if you want to guarantee a drop, you have to explicitly call it
sync
Command to explicitly flush data to disk (only to flush to disk, machine power failure will not cause data loss);
Highlighting: The sync mechanism ensures that all data generated before the current point in time is flushed to disk. There are two ways to use sync:
open
Use ofO_SYNC
Identity;- Explicitly call
fsync
System calls like this;
Method 1: Use the O_SYNC flag for open.
Examples of C language:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>
extern int errno;
int main(int argc, char **argv)
{
char buffer[512] = {0};ssize_t n = 0;
int fd = - 1;
fd = open("./test_sync.txt", O_CREAT | O_RDWR | O_SYNC, 0644);
assert(fd >= 0);
n = pwrite(fd, buffer, 512.0);
printf("ret (%ld) errno (%s)\n", n, strerror(errno));
return 0;
}
Copy the code
This ensures that every stroke of I/O is a synchronous I/O, which must be flushed to disk before returning, but this is rarely used because it can cause poor performance and is not good for batch optimization.
Animation demonstration:
Method two: Call the function alonefsync
This one fsync the data to disk after write. This method is used more because it is convenient for service optimization. This approach places a higher requirement on the programmer to timing fsync to ensure both security and performance, which is often a tradeoff.
For example, you could write 10 times before calling regular fsync last, which would ensure disk flushing and optimize for batch IO.
There are several similar functions for this gesture, with some differences. Here are some of them:
// Flush the file data and metadata parts
int fsync(int fildes);
// Flush the data part of the file
int fdatasync(int fildes);
// Flush the entire memory cache
void sync(void);
Copy the code
Animation demonstration:
mmap + msync
This is a very interesting I/O mode, using the mmap function to map a file on the disk to the same size as the process address space, and then when access to a segment of memory data, the kernel will be converted to access to the corresponding location of the file data. In terms of posture, it’s just like memory, but in terms of results, it’s essentially file IO.
void *
mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset)
int
munmap(void *addr, size_t len);
Copy the code
Mmap can reduce the data copy operation between user space and kernel space. When the data is large, using memory mapping to access files can obtain better efficiency (because it can reduce the memory copy amount, and aggregate IO, data batch disk, effectively reduce IO times).
Of course, if you write data, it is still asynchronously dropped. There is no real-time drop. To ensure a drop, you must call msync.
Advantages of MMAP:
- Reduce the number of system calls. Only one mmap system call is required. All subsequent operations are memory copy operations, not write/read system calls.
- Reduce the number of data copies;
Examples of C language:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>
#include <fcntl.h>
#include <string.h>
int main(a)
{
int ret = - 1;
int fd = - 1;
fd = open("test_mmap.txt", O_CREAT | O_RDWR, 0644);
assert(fd >= 0);
ret = ftruncate(fd, 512);
assert(ret >= 0);
char *const address = (char *)mmap(NULL.512, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); assert(address ! = MAP_FAILED);// This is where the magic happens (it looks like a memory copy, but it's actually file IO)
strcpy(address, "hallo, world");
ret = close(fd);
assert(ret >= 0);
// Ensure the bid
ret = msync(address, 512, MS_SYNC);
assert(ret >= 0);
ret = munmap(address, 512);
assert(ret >= 0);
return 0;
}
Copy the code
Let’s compile and run it.
gcc -ggdb3 -O0 test_mmap.c -D_GNU_SOURCE
Copy the code
This is a test_map.txt file with a “hello, world” in it.
Animation demonstration:
Hardware cache
This ensures that the file system layer is driven, but the disk hardware itself has caches, which are called hardware caches, and this layer of caches is also volatile. So finally, in order to ensure data drop, the hard disk cache should also be turned off.
# Check the write cache status;
hdparm -W /dev/sda
# Disable HDD Cache to ensure strong data consistency; Avoid power failure when the data does not fall;
hdparm -W 0 /dev/sda
# Open HDD Cache (may cause data loss during power failure)
hdparm -W 1 /dev/sda
Copy the code
According to the above IO posture, when you write an IO drive after the data can be said to disk, to ensure that the data is non-volatile power failure.
Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.
conclusion
- The data must be written on a non-volatile storage medium before you can reply “write successful” to the user. Other ways are taking advantage of rogue, tightrope walking;
- This article summarizes the three most fundamental I/O security methods: O_DIRECT write, standard I/O + Sync, and MMAP write + msync. Either each time is synchronous write disk, or each time write, and then call sync active brush, in order to ensure data security;
- O_DIRECT imposes strict requirements on users, such as IO offset, length sector alignment, and memory buffer address sector alignment.
- Note that the hard drive also has a cache and can pass
hdparm
Command switch;
Afterword.
Finally, you can rest assured that data has made its way to disk. Hey hey, you think the data is safe? There are many more inside, disk silent error is broken? Can the data still be salvaged? How to ensure that the network transmission process does not go wrong? How to ensure that memory copy process is not a problem? I’ll tell you later;
Original is not easy, welcome to pay attention to the public number: queer cloud storage. More dry stuff.