Awk text processing
Awk is a style scanning and processing language that enables efficient and quick text processing using Linux’s AWK commands. Awk scans each line of text and executes the specified command.
Awk was born in 1977, borrowing from programming languages such as C and taking its name from the surnames of its three designers, Alfred Aho, Peter Weinberger and Brian Kernighan. There are many versions of AWk. This article uses GNU AWk on Ubuntu. You can install Gawk on MacOS using HomeBrew.
usage
Awk can be executed directly from the command line, or you can write files with the.awk suffix and execute them. Awk processes text in units of behavior, performing the specified behavior for each line received.
Command line execution
$ awk [ -F fs ] [ -v var=value ] 'pattern {action}' [ file ... ]
Copy the code
Where -f specifies the delimiter and -v specifies the built-in variables of AWK.
For example, in the /etc/passwd file:
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
Copy the code
If you want to output the contents of each line, use
$ awk '{print $0}' /etc/passwd
Copy the code
Where $0 represents the scanned text line.
files
The.awk file can be written in three parts, as follows:
# passwd.awk
BEGIN{
FS="\n";
print "Before action";
}
{
print $0;
}
END{
print "After action";
}
Copy the code
The BEGIN block defines the behavior before each line is processed and can be used to set awK’s built-in variables, which will take effect for each subsequent line.
The END block defines the behavior after the text is processed and can be used to output some summary information.
The block between BEGIN and END is the operation for each line. The BEGIN and END blocks can also be used when executing AWK directly from the command line.
After writing the file, execute it on the command line:
Before action
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
After action
Copy the code
variable
$+ digital
$0 represents the scanned line, $1 represents the first item separated from the line, $2 represents the second item separated from the line, and so on.
To print the user name of /etc/passwd (item 1), you can execute the following statement:
$ awk -F ':' '{print $1}' /etc/passwd
root
daemon
bin
sys
Copy the code
Here to deal with the first line of the root: x: 0-0: root: / root: / usr/bin/ZSH, for example, awk processing will be in accordance with the -f set first separator: Split this line into root x 0 0 root /root/usr/bin/zsh and print the first item root
Special variables
-
FS(field separator)
FS is the input field separator, as set above:, which defaults to a space and can be set on the command line using -f or in the BEGIN block to a string or a regular expression via FS=. Such as:
$ awk -F ":" '{print $1,$2,$3}' /etc/passwd root x 0 daemon x 1 bin x 2 sys x 3 Copy the code
-
OFS(output field separator)
OFS is the output field concatenator. The output in the above example uses a space as the output field concatenator by default, which can be modified by setting the OFS variable:
$ awk -F ":" -v OFS="-" '{print $1,$2,$3}' /etc/passwd root-x-0 daemon-x-1 bin-x-2 sys-x-3 Copy the code
-
RS(record separator)
In each of the previous examples, AWK defaults to processing text in action units, saving one record per line, because the default record separator RS is “\n”. Some text is not stored in lines like CSV files. For example:
# people.txt P1 male 15 p2 female 20 p3 male 19 Copy the code
The file above uses “\n\n” to separate records, and “\n” to separate fields in each record. It can be handled like this:
$ awk -F "\n" -v RS="\n\n" '{print $1,$2,$3}' people.txt P1 male 15 p2 female 20 p3 male 19 Copy the code
-
ORS(output field separator)
Similar to RS, ORS sets the record separator for output.
$ awk -F "\n" -v RS="\n\n" -v ORS="\n***\n" '{print $1,$2,$3}' people.txt P1 male 15 *** p2 female 20 *** p3 male 19 *** Copy the code
-
NR(number of records)
NR indicates the number of records currently being processed, or the number of records processed if NR appears in the END block
$ awk -F ":" '{print "line" NR ":" $1,$2,$3}' /etc/passwd line1:root x 0 line2:daemon x 1 line3:bin x 2 line4:sys x 3 Copy the code
If more than one file is being processed at the same time, the number of entries will add up
$ awk -F ":" '{print "record" NR ":" $1,$2,$3}' people.txt /etc/passwd record1:P1 record2:male record3:15 record4: record5:p2 record6:female record7:20 record8: record9:p3 record10:male record11:19 record12:root x 0 record13:daemon x 1 record14:bin x 2 record15:sys x 3 Copy the code
-
NF(number of fields)
NF represents the number of separated fields in a record, so this value is related to FS as set:
# delimiter with ":" $ awk -F ":" '{print "record" NR " with " NF " fields:" $1,$2,$3}'/etc/passwd record1 with 7 fields:root x 0 record2 with 7 fields:daemon x 1 record3 with 7 fields:bin x 2 record4 with 7 fields:sys x 3# delimiter with "o" $ awk -F "o" '{print "record" NR " with " NF " fields:" $1,$2,$3}' /etc/passwd record1 with 7 fields:r t:x:0:0:r record2 with 5 fields:daem n:x:1:1:daem n:/usr/sbin:/usr/sbin/n record3 with 3 fields:bin:x:2:2:bin:/bin:/usr/sbin/n l gin record4 with 3 fields:sys:x:3:3:sys:/dev:/usr/sbin/n l gin Copy the code
-
FILENAME
FILENAME is the name of the file currently being processed
$ awk -F ":" '{print FILENAME}' /etc/passwd people.txt /etc/passwd /etc/passwd /etc/passwd /etc/passwd people.txt people.txt people.txt people.txt people.txt people.txt people.txt people.txt people.txt people.txt people.txt Copy the code
This value does not make sense until the record has been processed, so trying to print FILENAME in the BEGIN block gets a null value
-
FNR
The previous NR indicates the number of entries that can be accumulated over multiple files, while FNR indicates the number of entries in the current file
awk -F ":" '{print "record" FNR ":" $1,$2,$3}' people.txt /etc/passwd record1:P1 record2:male record3:15 record4: record5:p2 record6:female record7:20 record8: record9:p3 record10:male record11:19 record1:root x 0 record2:daemon x 1 record3:bin x 2 record4:sys x 3 Copy the code
Built-in function
Awk provides built-in functions for easy text and arithmetic processing, including getting length() for string length, rand() for random numbers, and computing sine () and cosine () for sines and cosines.
These functions can be queried in the official manual.
Record the screening
All of the examples above operate on each record, and in fact can be filtered using conditions.
Regular judgment
Records can be pattern-matched using regular expressions:
$ awk -F ':' '/root/ {print $1,$2,$3}' /etc/passwd
root x 0
Copy the code
The records that contain root are screened out.
conditional
Filtering can also be done in combination with AWK’s built-in variables and functions:
Output all records after the first record whose length is greater than 2
$ awk -F ':' 'length($1)>3 && NR>1 {print $0}' /etc/passwd
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
Copy the code
If statement
Awk also provides if statements:
Print the record where the third field is 0
$ awk -F ':' '{if ($3==0)print $0}' /etc/passwd
root:x:0:0:root:/root:/usr/bin/zsh
Copy the code
Awk also has a for statement, which is similar to the form in C:
$ awk -v ORS="," 'BEGIN{ for(i=1; i<5; i++) print i}'1, 2, 3, 4,Copy the code
Characters and Numbers
Awk provides support for mathematical and logical operators. In AWK, strings and numbers can be cast directly, +0 can be cast to a number, and space concatenation can be cast to a string:
awk 'BEGIN{print "origin\tnumber\tstring"}{print $0,"\t",$0+0,"\t",$0 ""}' people.txt
origin number string
P1 0 P1
male 0 male
15 15 15
0
p2 0 p2
female 0 female
20 20 20
0
p3 0 p3
male 0 male
19 19 1
Copy the code
Here the $0 “” representation is used to concatenate the original record with a space, but this can cause problems in some cases. Consider the following statement:
$ awk 'BEGIN { print -12 " " -24 }'- | - 12 to 24Copy the code
Here we want a space between -12 and -24, but we don’t get the desired result. This is due to the precedence of the mathematical operator over the concatenation operation, so the parse order is as follows:
- 12 (""-24) the tail of the tailCopy the code
To get the correct result, use parentheses:
$ awk 'BEGIN { print -12 " " (-24) }'- | - 12 to 24Copy the code