I like Linux very much, especially some of the design of Linux is very beautiful, for example, some complex problems can be broken down into a number of small problems, flexible use of off-the-shelf tools through pipe characters and redirection mechanism, writing shell scripts is very efficient.
I’ve written several articles about Linux before:
- Linux process/pipe/redirect/file descriptor
In this article, I will share some of the problems I encountered with redirection and pipe characters in my practice, and understand some of the underlying principles to improve the efficiency of writing scripts.
> and >> redirection character pit
First, what happens when you run the following command?
$ cat file.txt > file.txt
Copy the code
Reading and writing the same file feels like nothing’s going to happen, right?
In fact, the result of running the above command is to empty the contents of file.txt.
PS: Some Linux distributions may report an error directly, you can run cat < file.txt > file.txt to bypass this detection.
As mentioned in the previous Linux process and file descriptors, the program itself does not need to care where its standard input/output points to. It is the shell that modifies the program’s standard input/output positions through pipe characters and redirection symbols.
Therefore, when you run the cat file. TXT > file. TXT command, the shell will open file. TXT first and empty the contents of the file because the redirection symbol is >. Then the shell sets the standard output of the cat command to file. TXT. Then the cat command is executed.
That is, the following process:
1. Shell opens file.txt and empties its contents.
2. The shell points the standard output of the cat command to file.txt.
3. Shell runs the cat command and reads an empty file.
4. The cat command writes the empty string to the standard output (file.txt).
As a result, file.txt becomes an empty file.
We know that > empties the target file and >> appends the end of the target file, so what if we change the redirection character > to >>?
$cat file. TXT >> file. TXT #Copy the code
TXT is written to one line first, and two lines should be expected after executing cat file.txt >> file.txt.
Unfortunately, the result is not as good as expected. Instead, the file is written to file.txt in an endless loop, and soon the file becomes so large that you have to stop the command with Control+C.
Now, that’s interesting. Why is there a loop? In fact, a little analysis can be thought of the reason:
First, recall the behavior of the cat command. If you run the cat command alone, the keyboard input is read from the command line. Each time you press enter, the cat command echoes the input, that is, the cat command reads data line by line and then outputs data.
TXT >> file. TXT is executed as follows:
1. Open file. TXT and prepare to append content to the end of the file.
2. Point the standard output of the cat command to file. TXT.
3. The cat command reads a line in file. TXT and writes it to the standard output (appended to file. TXT).
4. After a row of data is written, the cat command finds that there is still something in file. TXT that can be read, and repeats step 3.
The above process is like iterating through the list and adding elements to the list. It never finishes, so our command loops endlessly.
> redirection operator and | pipeline operators to cooperate
We often encounter the need to intercept the first XX lines of a file and delete the rest.
In Linux, the head command can intercept the first few lines of a file:
# $cat file. TXT file. TXT in five lines 1, 2, 3, 4, 5 $head - n 2 file. TXT # head command reads the first two lines 1 2 $cat file. TXT | head - n 2 # head You can also read the standard input 1, 2Copy the code
If we want to keep the first two lines of the file and delete the rest, we might use the following command:
$ head -n 2 file.txt > file.txt
Copy the code
However, this will make the mistake mentioned above, and finally file.txt will be cleared, which cannot fulfill our requirements.
Can we avoid the pit by writing the command like this:
$ cat file.txt | head -n 2 > file.txt
Copy the code
The conclusion is no, the file will still be cleared.
What? Did the pipe leak and lose all the data?
In essence, the standard input and output of two commands are concatenated, with the standard output of one command acting as the standard input of the next.
However, if you think writing commands this way will get you the expected result, it’s probably because you think the pipe concatenated commands are executed serially, which is a common mistake, when in fact **** pipe concatenated multiple commands are executed in parallel.
You might expect the shell to run cat file.txt, read everything in file.txt normally, and then pipe it to head -n 2 > file.txt.
Although the contents of file.txt are emptied at this point, head does not read from a file, but from a pipe, so two lines should be written correctly to file.txt.
In practice, however, this is not true. The shell executes pipe concatenated commands in parallel, such as the following command:
$ sleep 5 | sleep 5
Copy the code
The shell starts two sleep processes at the same time, so the result is sleep for 5 seconds instead of 10 seconds.
It’s a little counterintuitive, like this common command:
$ cat filename | grep 'pattern'
Copy the code
The intuition seems to be that the cat command reads all the contents of filename at once, and then passes it to grep for search.
In fact, the cat and grep commands are executed at the same time. The expected result can be obtained because grep ‘pattern’ blocks and waits for standard input, while CAT writes data to the standard input of grep through the Linux pipe.
It can be intuitively felt that cat and grep are executed at the same time by executing the following command. Grep processes the data we input with the keyboard in real time:
$ cat | grep 'pattern'
Copy the code
With that said, let’s go back to our original question:
$ cat file.txt | head -n 2 > file.txt
Copy the code
The cat command and head command are executed in parallel, and the execution result is uncertain as to which comes first.
If the head command is executed before the cat command, file.txt is emptied and cat cannot read anything. On the other hand, if CAT reads the contents of the file first, it will get the expected result.
However, in my experiment (which repeated this concurrency 1W times), I found that the error of file.txt being emptied was much more likely than the expected result. It is not clear why, but it should be related to the logic of the Linux kernel implementation process and pipeline.
The solution
Having said all that about pipe and redirection characters, how do you avoid this empty pit of files?
The best way to do this is not to read and write to the same file at the same time, but to make a transfer through a temporary file.
For example, to keep only the first two lines in file.txt, you could write code like this:
# the first data to a temporary file, and then cover the original file $cat file. TXT | head - n 2 > temp. TXT && mv temp. TXT file. TXTCopy the code
It’s the simplest, most reliable, foolproof method.
If you don’t like the length of the command, you can also install the moreUtils package with apt/brew/yum, which adds a sponge command like this:
# to pass data to sponge, and then by the sponge in the original file $cat file. TXT | head - n 2 | sponge file. TXTCopy the code
A sponge absorbs data before writing it to file.txt. The sponge acts as a temporary file, avoiding the problem of reading or writing to the same file at the same time.
The above is redirection and pipe character of some pits, I hope to help you.
View more quality algorithm articles click here, hand with your brush force buckle, committed to the algorithm to speak clearly! My algorithm tutorial has received 90K star, welcome to like!