Linux command pick-ups - text processing

Original: Coding diary (wechat official ID: Codelogs), welcome to share, reprint please reserve the source.

Introduction to the

This is the second in a series of Linux command pickups. This article focuses on Linux commands related to text processing, such as xargs, grep, sed, awk, etc.

This series indexes Linux command gaps-getting Started

Common text-related commands

Cat, TAC, and LESS

Print the input to standard output
$ seq 3 | cat
# -n output with line number
$ seq 3 | cat -n
# -a can be used to query for special characters
$ seq 3 | cat -A
Tac can be output in reverse order
$ seq 3 | tac
3
2
1
Copy the code

Less Is used to view the file content.

$ less app.log
Copy the code

Less is also an interactive command similar to vim, as follows:

operation	describe
Ctrl + f	Page forward
Ctrl + b	Turn a page backward
g	Jump to the first line
G	Jump to tail and press Shift+ G
63G	Skip to line 63
J or or	Scroll down one line
K or	Scroll up one line
q	Exit the LESS program
/abc	Search backwards for ABC Press n again to search for the next ABC, and press n again to search for the previous ABC
? abc	Search ahead for ABC Press n again to search for the previous ABC, and press n again to search for the next ABC
F	Continuously display the new contents of the file, while pressing Shift + F
v	Open the current file in the editor
-N	According to the line Numbers (according to the first`-`, then Shift + N, then press)
-I	Case-insensitive search (according to the first`-`, then Shift + I, then press)
-S	View without line feed (Press -, then Shift + S, then press)
-R	Keep the color (First press -, then Shift + R, then press)
-F	If one screen can be displayed, output directly (First press -, then Shift + F, then press)

In addition, less often used to view the content of the command output, such as ps – ef tend to show a lot of content, it will be before the command execution result of push far away from the screen, use the ps – ef | less there would be no such trouble.

The head and tail

Display the first 10 lines
$ seq 20 | head -n10
Display the last 10 lines
$ seq 20 | tail -n10
Display lines from line 10 to the end
$ seq 20 | tail -n+10
Keep checking for new additions to the file
$ tail -f temp.txt
Generate 16 bytes of random hex
$ cat /dev/urandom | head -c 16 | xxd -ps
Copy the code

Wc, SORT, UNIQ

# count the number of lines, words, and bytes
$ seq 5 | wc 
      5       5      10

# count rows only
$ seq 5 | wc -l
5

# sort, -n for numeric sort, -r for reverse order, -k1 for using the first column sort
$ seq 5 |sort -nrk1
5
4
3
2
1

Uniq: sort the data before using uniq$ (seq 6; seq 3 8) |sort|uniq -c 1 1 1 2 2 3 2 4 2 5 2 6 1 7 1 8# and set
cat a b | sort | uniq > c 
# intersection
cat a b | sort | uniq -d > c 
# difference set
cat a b b | sort | uniq -u > c 
Copy the code

grep

Based on the content of the re search, it will pull out the contents of the data line by line, and then see if the row matches the re. If the row matches, it will output the contents of the row.

# use regular search, default BRE, does not support +? \ D is not supported
$ seq 12|grep '11 *'1 10 November 12# -f Pure string search, not as a regular search
$ seq 12|grep -F '11 *'

# -w = 11
$ seq 12|grep -w '1'
1
# -e uses ERE regular search, supports +? \ D is not supported
$ seq 12|grep -E '1 +'1 10 November 12# -p use PCRE re search, support +? To support \ d
seq 12|grep -P '\d\d+'10 and 12# -v Reverse search to display rows that do not contain 1
seq 12|grep -v 1
2
3
4
5
6
7
8
9
-o prints only the matched data, not the entire line
$ echo hello,java|grep -oP '\w+'
hello
java
# -c Displays the number of rows searched
$ seq 12|grep -P '\d\d+' -c
3
# -m limits the number of rows searched to 2
$ seq 12|grep -P '\d\d+' -m 2
10
11

# search for 10 and also display the next 2 lines (-a2)
$ seq 12|grep -A2 -w 10
10
11
12
# search 10 and also display the previous 2 lines (-b2)
$ seq 12|grep -B2 -w 10
8
9
10
# search 10 and also display 2 lines before and after (-c2)
$ seq 12|grep -C2 -w 10
8
9
10
11
12

# -r recursively finds the file in the current directory and finds the word 8080 in the file. -n displays the line number 8080 in the file
$ grep -rn -w 8080 .
Copy the code

Find and ls

Ls is used to find files in the current directory

# list the filename of the current directory
ls
# -l Lists the files in the current directory, along with file attributes such as created user, time, size, etc
ls -l
# find TXT files in current directory
ls *.txt
List the files in the current directory in reverse chronological order
ls -lt
List the files in the current directory in reverse order of size
ls -lS
Copy the code

Find is usually used to find files recursively

-type f indicates that you want to search for files with the suffix TXT in the current directory
find -name '*.txt' -type f
Find files larger than 800MB
find . -type f -size +800M 
# Files modified within 1 minute
find . -type f -mmin -1 
# Files modified within 7 days
find . -type f -mtime -7 
Copy the code

xargs

Function: Converts the data in the standard input stream into command parameters and executes the command. The reason for introducing this command is that some commands do not support processing standard input data, but only command parameters, such as the kill command to kill a process. When we want to kill all Java processes, we can do this:

Use pgrep to find Java processes
pgrep java
856
857

Use kill to kill both Java processes
kill856, 857,# Write it as a single line of command, as follows, using bash's command substitution syntax
kill `pgrep java`
kill $(pgrep java)

If kill only supports passing one argument at a time, you can use bash's for and while loop syntax
for pid in `pgrep java`;do kill $pid; done
pgrep java | while read pid;do kill $pid;done
Copy the code

As you can see, for the above scenario, the more complex the command is written, and xargs can solve this problem very well, as follows:

Use xargs to split the input stream into blank parameters and pass them to kill command, equivalent to kill 'pgrep Java' above
pgrep java | xargs kill
# use the -n option to split the input stream into blank parameters, passing one parameter at a time to the kill command, similar to the above for and while loops
pgrep java | xargs -n1 kill
Copy the code

Here’s a taste of the common options in XARgs, as follows:

Break up the parameter

# xargs by default separates arguments with white space. When no command is specified, echo is executed by default and as many arguments as possible are passed to the command by default
$ seq 20|xargs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# use -d to specify the delimiter
$ seq -s, 20|xargs -d,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Copy the code

Note: When -d is not specified, Xargs defaults to whitespace, where whitespace refers to Spaces, tabs, and newlines, and multiple whitespace is interpreted as a single whitespace. Most Linux commands that need to split text into columns, such as sort described above, and awk described later, follow this principle.

Notice the difference with or without -d as follows:

$ seq 20|xargs printf '"%s"\n'|xargs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

$ seq 20|xargs printf '"%s"\n'|xargs -d'\n' 
"1" "2" "3" "4" "5" "6" "Seven" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
Copy the code

Xargs defaults to ignoring ‘,”,\ when no -d option is specified, but not when -d is present.

Parameters of partial

# use -n or -l to specify the number of parameters to be passed each time$ seq 20|xargs -n4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 $ seq 20|xargs -L4 1 2 3 4 5 6 7 8 9 10 11 12 13 14, 15, 16, 17, 18, 19, 20Copy the code

Notice the difference in detail between -n and -l as follows:

$ seq 20|xargs -n2
1 2 
3 4 
5 6 
7 8 
9 10 
11 12 
13 14 
15 16 
17 18 
19 20

$ seq 20|xargs -n2|xargs -n4
1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 
17 18 19 20

$ seq 20|xargs -n2|xargs -L4
1 2 3 4 5 6 7 8 
9 10 11 12 13 14 15 16 
17 18 19 20
Copy the code

If -d is not specified, -n defaults to using whitespace (including Spaces, tabs, and newlines) to split arguments, while -l defaults to using newlines to split arguments.

Parameter markers

Use {} as a placeholder for the argument
seq 20|xargs -i echo 'id={}'
Copy the code

Experience the details of -i as follows:

$ seq 20|xargs -n2|xargs -i echo 'id={}'
id=1 2 
id=3 4 
id=5 6 
id=7 8 
id=9 10 
id=11 12 
id=13 14 
id=15 16 
id=17 18 
id=19 20
Copy the code

It appears that when -d is not specified and -i is used, a newline character is used by default to split the argument.

Debugging and concurrency

Use -p to debug xargs parameter details
$ seq 20|xargs -i -p echo 'id={}'
echo 'id=1' ?...y
id=1
echo 'id=2'? .# use -p to run the command concurrently, 10s is required if there is no -p4, 4s is required if there is -p4
$ seq 4|xargs -n1 -P4 sleep
Copy the code

Sometimes you can do some simple stress tests with xARgs’s -p option.

Common scenario

Xargs usually works with ls, find, and grep to search for content in a specified file. Ls and find are used to find files. Xargs changes the found file name to grep parameters as follows:

Search for port 8080 configuration in all XML files in the current directory
ls *.xml |xargs grep -w 8080
Search for port 8080 configuration in the XML file of the current directory and subdirectories
find -name '*.xml'|xargs grep -w 8080
Copy the code

sed

Commonly used to replace and modify text data, it is called a Stream editor, and in fact, you can think of it as a very simplified scripting language.

grammar

The basic syntax is in the form of pattern action. Sed reads each line into the pattern space to see if it matches the pattern and executes the action if it does.

Note: The pattern space, which will be explained in detail later, is now understood as a variable that stores the current row data.

For example, sed ‘3,5 s/a/c/g’ replace a in lines 3 to 5 with C, where 3,5 is the pattern part, s/a/c/g is the action part, only if the pattern condition is met, the action part can be omitted. This will execute the action for each row.

# yes can be used to repeatedly generate strings for our test data
$ yes abcde|head -n5
abcde 
abcde 
abcde 
abcde 
abcde
# replace a with C in lines 3 through 5, where g means replace all
$ yes abcde|head -n5|sed '3, 5 s/a/c/g'
abcde 
abcde 
cbcde 
cbcde 
cbcde
Copy the code

In addition, pattern action can also be in the following form:

Sed ‘3, 5s /a/c/g 2,4 s/b/d/g’ indicates that a is replaced with C in lines 3 to 5, and b is replaced with D in lines 2 to 4.
Sed ‘3,5{3,4 s/a/c/g; 4,5 S /b/d/g}’ indicates that a is replaced with C in lines 3 to 4, and B is replaced with D in lines 4 to 5. In fact, you can assume that the braces are omitted when there is only one action.
Sed ‘3,5! S/A /c/g’ denotes lines that are not lines 3 through 5. Replace a with C.

# replace a in lines 3 to 5 with c and b in lines 2 to 4 with D
$ yes abcde|head -n5|sed '3, 5 s/a/c/g; 2, 4 s/b/d/g '
abcde 
adcde 
cdcde 
cdcde 
cbcde
In lines 3 to 5, a is replaced with C, and b is replaced with D
$ yes abcde|head -n5|sed '3, 5 {3, 4 s/a/c/g; 4, 5 s/b/d/g} '
abcde 
abcde 
cbcde 
cdcde 
adcde
Instead of lines 3 through 5, replace a with c
$ yes abcde|head -n5|sed '3, 5! s/a/c/g'
cbcde 
cbcde 
abcde 
abcde 
abcde
Copy the code

A common pattern

Sed will print each line of action by default. With the -n option you can turn off the default printing as follows:

-n is used to turn off the default printing, otherwise lines 1 through 3 will be printed twice
$ seq 5|sed -n '1, 3 p'
1
2
3
Use regular expressions for # pattern. Note that sed does not use \d, and always use the -e option
Note that the pattern part and the action part can be arbitrarily combined, which means that regular patterns can also be used with s
$ seq 5|sed -n '/[2-4]/ p'2, 3, 4,The # pattern part can also be two regular expressions separated by commas, matching lines that start when the first regular expression is found and end when the second regular expression is found
$ seq 5|sed -n '/[2]/,/[4]/ p'2, 3, 4,Print the first line, and the second line after that
$ seq 5|sed -n '1~2 p'
1
3
5
Print the matching line with the next 2 lines
$ seq 5|sed -n '/^1$/,+2 p'
1
2
3
Copy the code

Common action

In addition to s(replace) and P (print), there are D (delete), I (insert), A (append), C (modify), Q (exit), L (print special characters), as follows:

# delete rows 1 and 2
$ seq 3|sed '/[1-2]/ d'
3
# Insert a line forward, often used to set the CSV header
$ seq 3|sed '1 i\id'
id 
1 
2 
3
# Appends a line after the last line, where $represents the last line
$ seq 3|sed '$ a\id'
1 
2 
3 
id
# change the first line to id
$ seq 3|sed '1 c\id'
id 
2 
3
Print the first 5 lines, since sed exits on line 5 with the q command
$ seq 9|sed '5q'One, two, three, four, five# Display special characters
$ echo -ne '\r\n'|sed -n 'l0'
\r$
Copy the code

In addition, there are some details about S (substitution) that are actually quite useful. Try it out:

# Replace can use the capture group function of re
$ echo 'id=1,name=zs'|sed -E 's/id=(\w+),name=(\w+)/\1 \2/'
1 zs
# g means to replace all a's with c's
$ echo 'a,a,a,a'|sed 's/a/c/g'
c,c,c,c
# 3g: replace the third and subsequent a matches with c
$ echo 'a,a,a,a'|sed 's/a/c/3g'
a,a,c,c
# No g can only replace the first match
$ echo 'a,a,a,a'|sed 's/a/c/'
c,a,a,a
# 3 indicates that only the third match of a is replaced by c
$ echo 'a,a,a,a'|sed 's/a/c/3'
a,a,c,a
# & represents previously matched content
$ echo 'a,a,a,a'|sed 's/.,./[&]/g' 
[a,a],[a,a]
# Change case
$ echo 'hello'|sed -E 's/.+/\U&/g' 
HELLO
$ echo 'hello'|sed -E 's/.+/\u&/g' 
Hello
$ echo 'HELLO'|sed -E 's/.+/\L&/g' 
hello
$ echo 'HELLO'|sed -E 's/.+/\l&/g' 
hELLO
Copy the code

Pattern space and reservation space

There are two concepts in SED: pattern space and reserved space hold space. Simply speaking, they can be regarded as two variables. The pattern space is a local variable, and the current data read by SED will be stored in it, while the reserved space is a global variable.

The sed runtime can be described as follows:

hold_space="";
while read pattern_space; do
    # sed script here 
done 
Copy the code

Now that you understand this concept, you can introduce the following actions, as follows:

action	describe
n	Load the next line of text into the schema space, overwriting the schema space original data
N	Appends the next line of text to the schema space
P	Prints the first line of schema space data
D	Deletes the first row of schema space data
: label	Mark the location to jump
b label	Jump to the position of the: label tag, which can be used for branch judgment and looping

# Print even lines
$ seq 9|sed -n 'n; p'$seq-s,9 1,2,3,4,5,6,7,8,9For every 3 rows, use s to cut new rows, P to print the first row, D to delete again, and so on until D deletes all the data in the schema space
$ seq -s, 9|sed 's/,/\n/3; P; D'1 2 3 4 5 6 7 8 9 $SEq 9 1 2 3 4 5 6 7 8 9# 3 per line, ba jump is used to implement a loop, with additional 3 N line into the pattern space, then \ N for, can
$ seq 9|sed ':a; N; 0 ~ 3! {$! ba}; s/\n/,/g'1, 2, 3 4 5 6 7,8,9Copy the code

piecewise

Sed is used to retrieve the contents of the specified segment, which is multiple lines separated by empty lines, as follows:

For example, to obtain the IP address of eth0, run the following command:

$ ifconfig|sed -nE '/\S/{:a; N; /\n$/! {$! ba}}; /eth0/s/.*inet (\S*).*/\1/gp'
172.21.117.1
Copy the code

Operation process:

use/\S/Start with a non-blank line and use:aThe tag.
Continually use N to read the next line appending to the schema space.
Then use the/\n$/! {$! ba}}If the previous read is not a blank line and not the last line, it will hit/\n$/!with$!And then usebaSkip to the beginning and continue loading the next line until a blank line is encountered, thus reading a full segment.
use/eth0/Check whether the segment contains eth0. If so, extract the IP address and print it.

Reserved Space Action

action	describe
h	Overwrites schema space data into reserved space
H	Appends schema space data to the reserved space
g	Overwrites reserved space data to the schema space
G	Appends reserved space data to the schema space
x	Exchange data for schema space and reserved space

# output in reverse order
# sed processing line 1 will save 1 to the reserved space
When line 2 is processed, 1 of the reserved space is appended to the schema space, which is 2 and 1, and then saved to the reserved space
The next line is 3, 2, 1, and so on until the last line
$ seq 5|sed -n '1! G; $p; h; '5, 4, 3, 2, 1The process is similar, except that it prints every 3 lines and clears the mode space and the reserved space
$ seq 9|sed -n 'G; 0~3{p; s/.*//g}; h'
3
2
1

6
5
4

9
8
7

# print matching line, and the front line 4, similar seq 9 | grep - 7 B4
$ seq 9|sed -n 'H; x; 4,$s/^[^\n]*\n//; x; /^7$/{g; p}'
4
5
6
7
Copy the code

When sed is mixed with commands to manipulate the reserved space, the execution becomes mind-numbing, and your brain needs to work at warp speed.

Example: Implement URldecode

$ echo hello%E7%BC%96%E7%A8%8B|sed 's/%/\\x/g'
hello\xE7\xBC\x96\xE7\xA8\x8B

$ echo hello%E7%BC%96%E7%A8%8B|sed 's/%/\\x/g'|xargs -d"\n" echoHello - e programmingCopy the code

awk

Awk is a powerful text processing tool. It is essentially a scripting language that can be used to filter and replace text, as well as simple statistics and SQL-like join functions.

grammar

The basic syntax of AWK is as follows:

awk 'BEGIN{ //your code } pattern1{ //your code } pattern2{ //your code } END{ //your code }'
Copy the code

The BEGIN section of the code executes first.
Each line of text read from the standard input is then looped through, executing the code for Pattern1 if it matches, and the code for Pattern2 if it matches.
Finally, execute the END part of the code.

Operation process

Sum odd and even rows, as shown below:

$ seq 1 5
1 
2 
3 
4 
5

$ seq 1 5|awk 'BEGIN{print "odd","even"} NR%2==1{odd+=$0} NR%2==0{even+=$0} END{print odd,even}'
odd even 
9 6
Copy the code

Seq 1 5 is used to generate numbers 1 through 5.
Awk first executes the BEGIN block and prints the title.
NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1, NR ==1
Then line 1 attempts to match pattern NR%2==0.
And then line 2, line 3… , all the way to the last line, perform the above two steps.
Finally, the END block is executed to print out the variables summed previously, where 9=1+3+ 5,6 =2+4.

The program could also be written like this:

seq 1 5|awk 'BEGIN{print "odd","even"} {if(NR%2==1){odd+=$0}else{even+=$0}} END{print odd,even}'
Copy the code

Here using the if statement, a program actually awk grammar and C is very similar, so the awk has the else, while, for, break, continue, exit, etc., common syntax is as follows:

if (condition) statement [ else statement ]
while (condition) statement
do statement while (condition)
for (expr1; expr2; expr3) statement
for (var in array) statement
i++; i--;
i > 0 ? 1 : 0
Copy the code

respectively

As you can see, awK processes data line by line by default. This does not mean awK can only process data line by line.

$cat temp. TXT 5 4 3 2 1, 6, 7, 8, 9, 10 $cat temp. TXT | awk - F,'{printf "%s\t%s\n",$1,$2}'One six, two seven, three, eight, four, nine, five, tenCopy the code

This example is specified with -f, so awK will automatically save the result of each read, used, sorted, split at $1,$2… You can also use $NF, $(nf-1) to reference the values of the last two columns. Awk defaults to whitespace if -f is not specified.

Note that printf “%s\t%s\n”,$1,$2, printf is a formatted printing function. It can also be written as printf(“%s\t%s\n”, $1,$2), but the function call in AWK can omit the parentheses.

In addition, awK does not require any concatenation, just concatenation of two strings. This is the same as in C, unlike Java, where + concatenation is used, as follows:

$ awk 'BEGIN{print "a""b"}'
ab
# Of course, add a space in the middle, same thing, awK will ignore it
$ awk 'BEGIN{print "a" "b"}'
ab
Print ("a","b"); print("a","b");
$ awk 'BEGIN{print "a","b"}'
a b
Copy the code

An associative array

Awk supports one-dimensional arrays, which can be used to calculate odd and even sums as follows:

$ seq 1 5|awk 'BEGIN{print "odd","even"} {S[NR%2]+=$0} END{print S[1],S[0]}'
odd even
9 6
Copy the code

Awk is an associative array. The array key can be any value, not necessarily a number. It is conceptually similar to Map in Java, as follows:

# count the number of processes and display the top 4 processes
$ ps h -eo comm|awk '{S[$0]++}END{for(k in S){print S[k],k}}'|sort -nr|head -n4
9 sshd
6 httpd
3 systemd
3 bash
Copy the code

To delete an element from an array, use delete S[k].

Built-in variables

The NR built-in variable has been mentioned above, and awK has the following built-in variables

Built-in variables	role
$0	Current record (this variable holds the contents of the entire row)
`$i~$n`	The NTH field of the current record
NF	The number of fields in the current record is the number of columns
NR	The number of records that have been read, the line number, starts at 1 and accumulates if there are multiple files.
FNR	Number of current records. Unlike NR, this value will be the line number of each file
FS	Similar to the -f function, FS can be a regular expression, and the default is whitespace. Note: If the value of FS is empty, it means that each letter is split into one
OFS	Corresponding to FS, specify the column separator for print function output, default space
RS	Record separator. The default record separator is`\n` Note: If the value of RS is null, it indicates that the record is divided by segment
ORS	Corresponding to RS, specifies the record separator for the output of the print function, default`\n`
FILENAME	Currently enter the name of the file

Here are two examples:

$ echo -n '1, 2, 3 4 5 6 | | 7,8,9'|awk 'BEGIN{RS="|"; FS=","} {print $1,$2,$3}'
1 2 3 
4 5 6 
7 8 9
$ echo -n '1, 2, 3 4 5 6 | | 7,8,9'|awk 'BEGIN{RS="|"; FS=","; ORS=","; OFS="|"} {print $1,$2,$3}'1 | 2 | 3 4 5 6, 7 | | | | 9,Copy the code

Summary: AWK data read mode, always with RS as the record separator, read records one by one, then each record split into fields according to FS.

Consider this example:

$ seq 1 5|awk '/ ^ [1-4] / &&! / / ^ (3-4] '
1
2
$ seq 1 5|awk '$0 ~ /^[1-4]/ && $0 !~ /^[3-4]/{print}'
1
2
$ seq 1 5|awk '$0 ~ /^[1-4]/ && $0 !~ /^[3-4]/{print $0}'
1
2
Copy the code

You can see:

Awk pattern part can be used directly in the regular expression, but also can use &&, | |,! Such a logical operator.
If the regular expression does not specify a match variable, $0 is matched by default, so awk ‘/regex/’ is directly equivalent to grep -e ‘regex’.
In addition, if the code following pattern is omitted, $0 is printed by default.
The print function also prints $0 by default if no argument is specified.
Note that regular expressions in AWK do not support \d. Use [0-9] to match numbers. Linux has regular syntax for BRE,ERE, and PCRE, while AWk supports ERE.

piecewise

How to obtain the IP address of the EHT0 network adapter using AWK:

ifconfig|awk -v RS= '/eth0/{print $6}'
172.21.117.1 
Copy the code

Commonly used functions

The function name	instructions	The sample
sub	To replace a	`sub(/,/,"\|",$0)`
gsub	Replace all, the incoming string is replaced, and returns the number of substitutions	`gsub(/,/,"\|",$0)`
gensub	Replace, returns the replaced string	`$0=gensub(/,/,"\|","g",$0)`
match	Match, capture content in array A	`match($0,/id=(\w+)/,a)`
split	Split, split in array A	`split($0,a,/,/)`
index	Find the string, return the position found, starting at 1	`i=index($0,"hi")`
substr	Interception of substring	`Substr ($0, 1, I)`or`substr($0,i)`
tolower	Turn to lowercase	`tolower($0)`
toupper	Turn the capital	`toupper($0)`
srand,rand	Random number generation	`BEGIN{srand(); printf "%d",rand()*10}`

Search and extract

Sample data is as follows, also generated using AWK:

$ seq 1 10|awk '{printf "id=%s,name=person%s,age=%d,sex=%d\n",$0,$0,$0/3+15,$0/6}'|tee person.txt
id=1,name=person1,age=15,sex=0
id=2,name=person2,age=15,sex=0
id=3,name=person3,age=16,sex=0
id=4,name=person4,age=16,sex=0
id=5,name=person5,age=16,sex=0
id=6,name=person6,age=17,sex=1
id=7,name=person7,age=17,sex=1
id=8,name=person8,age=17,sex=1
id=9,name=person9,age=18,sex=1
id=10,name=person10,age=18,sex=1
Copy the code

Select id,name,age from person where age > 15 and age < 18 limit 4

$ cat person.txt |awk 'match($0, /^id=(\w+),name=(\w+),age=(\w+)/, a) && a[3]>15 && a[3]<18 { print a[1],a[2],a[3]; if(++limit >= 4) exit 0}'
3 person3 16
4 person4 16
5 person5 16
6 person6 17
Copy the code

The match function and the capture group function of regular expressions will be used firstid,name,ageTo extract the value ofa[1],a[2],a[3]In the.
thena[3]>15 && a[3]<18That is, similar to SQLage > 15 and age < 18The function.
Then printa[1],a[2],a[3], similar to SQLselect id,name,ageThe function.
Finally, if the number of prints reaches 4, exit the program, i.elimit 4The function.

Simple statistical analysis

Awk can do some simple statistical analysis tasks, again using SQL as an example. Select age,sex,count(*) num, group_concat(id) ids from person where age > 15 and age < 18 group by age,sex Awk implementation is as follows:

$ cat person.txt |awk ' BEGIN{ printf "age\tsex\tnum\tids\n" } match($0, /^id=(\w+),name=(\w+),age=(\w+),sex=(\w+)/, a) && a[3]>15 && a[3]<18 { s[a[3],a[4]]["num"]++; s[a[3],a[4]]["ids"] = (s[a[3],a[4]]["ids"] ? s[a[3],a[4]]["ids"] "," a[1] : a[1]) } END{ for(key in s){ split(key, k, SUBSEP); age=k[1]; sex=k[2]; printf "%s\t%s\t%s\t%s\n",age,sex,s[age,sex]["num"],s[age,sex]["ids"] } }'Age sex num IDS 17 1 3 6 7 8 16 0 3 3 4 5Copy the code

The AWK code is a bit long, but the logic is clear.

BEGIN prints the header line.
Match obtains id,name,age,sex, filters the data whose age>15 and age<18, and then accumulates the statistical results into the associative array s. You can think of s as an associative array like a map, and then you just have two levels of keys. (Note that in AWK, concatenated strings use Spaces instead of + signs as in Java.)
And then finally, in the END block, we’re going to iterate over s, which is an associative array, and notice, similars[a[3],a[4]]In this case, awK will use the SUBSEP variable as a keya[3].a[4]Spliced together, needsplit(key, k, SUBSEP), split the key into k by SUBSEP, which is the default value\ 034File separator.

Multi-file join processing

Awk can also implement join processing similar to SQL, to find the intersection or difference set, as follows:

$ cat user.txt
1 zhangsan
2 lisi
3 wangwu
4 pangliu

$ cat score.txt
1 86
2 57
3 92

Select a.id,a.name, b.core from user a left join score b on A.id =b.id
For score. TXT, NR==FNR, for user.txt, NR! = FNR established
$ awk 'NR==FNR{s[$1]=$2} NR! =FNR{print $1,$2,s[$1]}' score.txt user.txt
1 zhangsan 86
2 lisi 57
3 wangwu 92
4 pangliu

Of course, you can also use the FILENAME built-in variable directly, as follows
$ awk 'FILENAME=="score.txt"{s[$1]=$2} FILENAME=="user.txt"{print $1,$2,s[$1]}' score.txt user.txt

Print user.txt lines that are not in score.txt
$ awk 'FILENAME=="score.txt"{s[$1]=$2} FILENAME=="user.txt" && ! ($1 in s){print $0}' score.txt user.txt
4 pangliu
Copy the code

Example: IP to digits

Convert the IP address to a number
$ echo192.168.0.101 | awk - F.'{print strtonum("0x"sprintf("%02X",$1)sprintf("%02X",$2)sprintf("%02X",$3)sprintf("%02X",$4))}' 
3232235621

# number to IP address
$ echo 3232235621|awk -v ORS=. '{match(sprintf("%08X",$0),/(..) (..) (..) (..) /,a); for(i=1; i<=4; i++){print strtonum("0X"a[i])}}' 
192.168.0.101.
Copy the code

Example: Implement Urlencode

$ echo- n hello programming | od - An - t u1 | xargs - n1 | awk -v ORS ='{c=sprintf("%c",$1); print c~/[0-9a-zA-Z.-_]/ ? c : sprintf("%%%02X",$1)}'
hello%E7%BC%96%E7%A8%8B
Copy the code

Ability to understand grep, sed, awk

In text processing, grep, sed and awk are the most commonly used. Therefore, these three brothers are often referred to as the Three Swordsmen of Linux together, which shows their importance. Here are their similarities and differences in processing ability.

Grep < sed < awk in terms of the ability to process text. Grep < sed < awk in terms of command learning difficulty. Using SQL as an analogy, grep implements line-level WHERE regular filtering. Sed implements row-level WHERE filtering, row number filtering, update, INSERT, delete and other update functions. Awk implements column level WHERE filtering, row number filtering, update, INSERT, DELETE and group by statistics.

function	Basic commands	The grep implementation	Sed to implement	Awk implementation
Filter the first 10 rows	`seq 20 \| head -n10`	`seq 20 \| grep -m10 '.*'`	`Seq 20 \| sed - n '1, 10 p'`	`seq 20 \| awk 'NR<=10'`
Filter out the rows that contain 1		`seq 20 \| grep 1`	`seq 20 \| sed -n '/1/p'`	`seq 20 \| awk '/1/'`
Filter out rows that do not contain 1		`seq 20 \| grep -v 1`	`seq 20 \| sed -n '/1/! p'`	`seq 20 \| awk '! / 1 / '`
Filter out rows greater than or equal to 8		`seq 20 \| grep -E '^([89]\|[1-2][0-9])$'`	`seq 20 \| sed -nE '/^([89]\|[1-2][0-9])$/p'`	`seq 20 \| awk '$1 >= 8'`
Filter out 10 rows containing 1’s		`seq 20 \| grep -m10 1`		`seq 20 \| awk '/1/ && ++n <= 10'`
Multi-conditional filtering, which filters out rows that contain both 1 and 2, or those that do not		`seq 50 \| grep -P '^(? . = (* 1)? =) * 2) + \| ^ (? ! . (* 1)? ! . * 2) + '`	`seq 50 \| sed -n '/1/{/2/p; d}; / 1 /! {/ 2 /! p; d}; / 2 /! {/ 1 /! p; d}'`	`seq 50 \| awk '$0 ~ /1/ && $0 ~ /2/ \|\| $0 ! ~ /1/ && $0! ~ / 2 / '`
Filter out 11 and the next 2 rows		`seq 20 \| grep -A2 11`	`seq 20 \| sed -n '/11/,+2p'`	`seq 20 \| awk '/11/{n=1} n && n++<=3'`
Step filter, filter out every 3 rows of records, such as 3,6,9…			`seq 10\|sed -n '0~3p'`	`seq 10\|awk 'NR%3==0'`
Interval filtering, to filter out rows containing 2 to 6			`seq 20\|sed -n '/2/,/6/p'`	`seq 20\|awk '/2/,/6/'`
Extract part of text		`echo 'hello,java'\|grep -oP 'hello,\K(\w+)'`	`echo 'hello,java'\|sed -nE 's/hello,(\w+)/\1/p'`	`echo 'hello,java'\|awk 'match($0,/hello,(\w+)/,a){print a[1]}'`
Update Java to bash instead			`echo 'hello,java'\|sed 's/java/bash/g'`	`echo 'hello,java'\|awk '{gsub(/java/,"bash",$0); print $0}'`
Insert, first line inserts title			`echo 'hello,java'\|sed '1i\title'`	`echo 'hello,java'\|awk '{if(NR==1){print "title"} print $0}'`
Delete to delete the row containing Java			`echo 'hello,java'\|sed '/java/d'`	`echo 'hello,java'\|awk '{if(/java/){next} print $0}'`
The hump turns with the underline			`echo "userId"\|sed -E 's/([A-Z]+)/_\l\1/g'` `echo "user_id"\|sed -E 's/_(.) /\u\1/g'`	`echo "userId"\|awk '{print tolower(gensub(/([A-Z]+)/,"_\\1","g",$0))}'` `echo "user_id"\|awk -F_ -v ORS= '{for(i=1; i<=NF; i++){print i==1 ? $I: toupper (substr ($I, 1, 1)) substr ($I, 2)}} '`
Reverse output	`seq 9\|tac`		`seq 9\|sed -n 'G; $p; h'`	`seq 9\|awk '{s=$0 "\n" s}END{print s}'`
Statistics, total number of rows	`seq 20 \| wc -l`	`seq 20 \| grep . -c`	`seq 20 \| sed -n '$='`	`seq 20 \| awk 'END{print NR}'`
Statistics, group counting	`seq 20\|grep -o .\|sort\|uniq -c`			`seq 20\|grep -o .\|awk '{S[$0]++} END{for(k in S){print S[k],k}}'`

practice

Practice 1: Find the last 10 exception logs

tac app.log |sed '/^\S/a\\'|awk -v RS= '/ERROR/ && ++n<=10{print; if(n>=10){exit}}'|tac
Copy the code

Practice 2: Counting high-frequency words in Java code

# uniQ implementation
$ time find -name '*.java'|xargs sed -E 's/\b[A-Z]/\l&/g; s/[A-Z]/_\l&/g'|grep -w -oE '\w+'| | pv - l sort | uniq -c | sort - nrk1 | head - n5 2.16 M 0:00:03 [584 k/s] [< = >] public import string 42473 45940 46228 56442 order 41077returnReal 0m4.434s user 0m4.719s sys 0m2.911sAwk implementation version, faster than uniQ implementation, theoretically higher memory footprint than UNIQ
$ time find -name '*.java'|xargs sed -E 's/\b[A-Z]/\l&/g; s/[A-Z]/_\l&/g'|grep -w -oE '\w+'|pv -l|awk '{S[$0]++}END{for(k in S){print S[k],k}}'| sort - nrk1 | head - n5 2.16 M 0:00:02 [1.03 M/s] [< = >] public import string order 42473 45940 46228 56442 41077returnReal 0m2.366s User 0m2.324s SYS 0m3.050sCopy the code

conclusion

It is important to be familiar with text processing commands because the input data of commands and the output results after command execution are basically plain text. Therefore, if you want to solve the work requirements with Linux commands easily, you must be familiar with these common text processing commands.

Linux text command tips (top) Linux text command tips (bottom) Use a Linux command to quickly view a line

Knowledge Extension Guide

How did you compute this Fibonacci sequence?

(echo 0;echo 1) > num.txt
tail -n+0 -f num.txt|awk 'NR>1{print pre+$0; fflush()}{pre=$0}' >> num.txt
Copy the code

Try to use-pDebug the following two Xargs programs to understand why the output is different?

$ seq 6|xargs -n2|xargs -L1 printf "<%s> " 
<1> <2> <3> <4> <5> <6>

$ seq 6|xargs -n2|xargs -d'\n' -L1 printf "<%s> " 
<1 2> <3 4> <5 6>
Copy the code

Grep why is not supported by default\d.BRE, ERE, PCREIs what?

\d (BRE,ERE,PCRE)

Why does sed not add this command-uCan’t output data? Why addstdbuf -oLCan output data again?

while sleep 1;do echo $((i++)); done|sed 's/.\+/&+1/g'|bc
while sleep 1;do echo $((i++)); done|sed -u 's/.\+/&+1/g'|bc
while sleep 1;do echo $((i++)); done|stdbuf -oL sed 's/.\+/&+1/g'|bc
Copy the code

The shell pipe is blocked. 5. What are the following text commands for?

tr cut paste comm join
Copy the code

Linux Text command Techniques (part 1)

Content of the past

Awk is really a magic tool for Linux text command tips (top) Linux text command tips (bottom) character encoding solution

function	Basic commands	The grep implementation	Sed to implement	Awk implementation
Filter the first 10 rows	`seq 20 \| head -n10`	`seq 20 \| grep -m10 '.*'`	`Seq 20 \| sed - n '1, 10 p'`	`seq 20 \| awk 'NR<=10'`
Filter out the rows that contain 1		`seq 20 \| grep 1`	`seq 20 \| sed -n '/1/p'`	`seq 20 \| awk '/1/'`
Filter out rows that do not contain 1		`seq 20 \| grep -v 1`	`seq 20 \| sed -n '/1/! p'`	`seq 20 \| awk '! / 1 / '`
Filter out rows greater than or equal to 8		`seq 20 \| grep -E '^([89]\|[1-2][0-9])$'`	`seq 20 \| sed -nE '/^([89]\|[1-2][0-9])$/p'`	`seq 20 \| awk '$1 >= 8'`
Filter out 10 rows containing 1’s		`seq 20 \| grep -m10 1`		`seq 20 \| awk '/1/ && ++n <= 10'`
Multi-conditional filtering, which filters out rows that contain both 1 and 2, or those that do not		`seq 50 \| grep -P '^(? . = (* 1)? =) * 2) + \| ^ (? ! . (* 1)? ! . * 2) + '`	`seq 50 \| sed -n '/1/{/2/p; d}; / 1 /! {/ 2 /! p; d}; / 2 /! {/ 1 /! p; d}'`	`seq 50 \| awk '$0 ~ /1/ && $0 ~ /2/ \|\| $0 ! ~ /1/ && $0! ~ / 2 / '`
Filter out 11 and the next 2 rows		`seq 20 \| grep -A2 11`	`seq 20 \| sed -n '/11/,+2p'`	`seq 20 \| awk '/11/{n=1} n && n++<=3'`
Step filter, filter out every 3 rows of records, such as 3,6,9…			`seq 10\|sed -n '0~3p'`	`seq 10\|awk 'NR%3==0'`
Interval filtering, to filter out rows containing 2 to 6			`seq 20\|sed -n '/2/,/6/p'`	`seq 20\|awk '/2/,/6/'`
Extract part of text		`echo 'hello,java'\|grep -oP 'hello,\K(\w+)'`	`echo 'hello,java'\|sed -nE 's/hello,(\w+)/\1/p'`	`echo 'hello,java'\|awk 'match($0,/hello,(\w+)/,a){print a[1]}'`
Update Java to bash instead			`echo 'hello,java'\|sed 's/java/bash/g'`	`echo 'hello,java'\|awk '{gsub(/java/,"bash",$0); print $0}'`
Insert, first line inserts title			`echo 'hello,java'\|sed '1i\title'`	`echo 'hello,java'\|awk '{if(NR==1){print "title"} print $0}'`
Delete to delete the row containing Java			`echo 'hello,java'\|sed '/java/d'`	`echo 'hello,java'\|awk '{if(/java/){next} print $0}'`
The hump turns with the underline			`echo "userId"\|sed -E 's/([A-Z]+)/_\l\1/g'` `echo "user_id"\|sed -E 's/_(.) /\u\1/g'`	`echo "userId"\|awk '{print tolower(gensub(/([A-Z]+)/,"_\\1","g",$0))}'` `echo "user_id"\|awk -F_ -v ORS= '{for(i=1; i<=NF; i++){print i==1 ? $I: toupper (substr ($I, 1, 1)) substr ($I, 2)}} '`
Reverse output	`seq 9\|tac`		`seq 9\|sed -n 'G; $p; h'`	`seq 9\|awk '{s=$0 "\n" s}END{print s}'`
Statistics, total number of rows	`seq 20 \| wc -l`	`seq 20 \| grep . -c`	`seq 20 \| sed -n '$='`	`seq 20 \| awk 'END{print NR}'`
Statistics, group counting	`seq 20\|grep -o .\|sort\|uniq -c`			`seq 20\|grep -o .\|awk '{S[$0]++} END{for(k in S){print S[k],k}}'`

Linux command pick-ups – text processing

Introduction to the

Common text-related commands

Cat, TAC, and LESS

The head and tail

Wc, SORT, UNIQ

grep

Find and ls

xargs

Break up the parameter

Parameters of partial

Parameter markers

Debugging and concurrency

Common scenario

sed

grammar

A common pattern

Common action

Pattern space and reservation space

Example: Implement URldecode

awk

grammar

Operation process

respectively

An associative array

Built-in variables

piecewise

Commonly used functions

Search and extract

Simple statistical analysis

Multi-file join processing

Example: IP to digits

Example: Implement Urlencode

Ability to understand grep, sed, awk

practice

Practice 1: Find the last 10 exception logs

Practice 2: Counting high-frequency words in Java code

conclusion

Knowledge Extension Guide

Content of the past

Related Posts

The core of LinkedHashMap is just 2 points, understand and master

Understand message queue in seconds

Design Patterns in the Spring Framework (5)