Awk text processing

Awk is a style scanning and processing language that enables efficient and quick text processing using Linux’s AWK commands. Awk scans each line of text and executes the specified command.

Awk was born in 1977, borrowing from programming languages such as C and taking its name from the surnames of its three designers, Alfred Aho, Peter Weinberger and Brian Kernighan. There are many versions of AWk. This article uses GNU AWk on Ubuntu. You can install Gawk on MacOS using HomeBrew.

usage

Awk can be executed directly from the command line, or you can write files with the.awk suffix and execute them. Awk processes text in units of behavior, performing the specified behavior for each line received.

Command line execution

$ awk [ -F fs ] [ -v var=value ] 'pattern {action}' [ file ...  ]
Copy the code

Where -f specifies the delimiter and -v specifies the built-in variables of AWK.

For example, in the /etc/passwd file:

root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
Copy the code

If you want to output the contents of each line, use

$ awk '{print $0}' /etc/passwd
Copy the code

Where $0 represents the scanned text line.

files

The.awk file can be written in three parts, as follows:

# passwd.awk

BEGIN{
  FS="\n";
  print "Before action";
}
{
  print $0;
}
END{
  print "After action";
}
Copy the code

The BEGIN block defines the behavior before each line is processed and can be used to set awK’s built-in variables, which will take effect for each subsequent line.

The END block defines the behavior after the text is processed and can be used to output some summary information.

The block between BEGIN and END is the operation for each line. The BEGIN and END blocks can also be used when executing AWK directly from the command line.

After writing the file, execute it on the command line:

Before action
root:x:0:0:root:/root:/usr/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
After action
Copy the code

variable

$+ digital

$0 represents the scanned line, $1 represents the first item separated from the line, $2 represents the second item separated from the line, and so on.

To print the user name of /etc/passwd (item 1), you can execute the following statement:

$ awk -F ':' '{print $1}' /etc/passwd
root
daemon
bin
sys
Copy the code

Here to deal with the first line of the root: x: 0-0: root: / root: / usr/bin/ZSH, for example, awk processing will be in accordance with the -f set first separator: Split this line into root x 0 0 root /root/usr/bin/zsh and print the first item root

Special variables

  1. FS(field separator)

    FS is the input field separator, as set above:, which defaults to a space and can be set on the command line using -f or in the BEGIN block to a string or a regular expression via FS=. Such as:

    $ awk -F ":" '{print $1,$2,$3}' /etc/passwd
    root x 0
    daemon x 1
    bin x 2
    sys x 3
    Copy the code
  2. OFS(output field separator)

    OFS is the output field concatenator. The output in the above example uses a space as the output field concatenator by default, which can be modified by setting the OFS variable:

    $ awk -F ":" -v OFS="-" '{print $1,$2,$3}' /etc/passwd
    root-x-0
    daemon-x-1
    bin-x-2
    sys-x-3
    Copy the code
  3. RS(record separator)

    In each of the previous examples, AWK defaults to processing text in action units, saving one record per line, because the default record separator RS is “\n”. Some text is not stored in lines like CSV files. For example:

    # people.txt
    
    P1
    male
    15
    
    p2
    female
    20
    
    p3
    male
    19
    Copy the code

    The file above uses “\n\n” to separate records, and “\n” to separate fields in each record. It can be handled like this:

    $ awk -F "\n" -v RS="\n\n" '{print $1,$2,$3}' people.txt 
    P1 male 15
    p2 female 20
    p3 male 19
    Copy the code
  4. ORS(output field separator)

    Similar to RS, ORS sets the record separator for output.

    $ awk -F "\n" -v RS="\n\n" -v ORS="\n***\n" '{print $1,$2,$3}' people.txt
    P1 male 15
    ***
    p2 female 20
    ***
    p3 male 19
    ***
    Copy the code
  5. NR(number of records)

    NR indicates the number of records currently being processed, or the number of records processed if NR appears in the END block

    $ awk -F ":" '{print "line" NR ":" $1,$2,$3}' /etc/passwd           
    line1:root x 0
    line2:daemon x 1
    line3:bin x 2
    line4:sys x 3
    Copy the code

    If more than one file is being processed at the same time, the number of entries will add up

    $ awk -F ":" '{print "record" NR ":" $1,$2,$3}' people.txt /etc/passwd
    record1:P1  
    record2:male  
    record3:15  
    record4:  
    record5:p2  
    record6:female  
    record7:20  
    record8:  
    record9:p3  
    record10:male  
    record11:19  
    record12:root x 0
    record13:daemon x 1
    record14:bin x 2
    record15:sys x 3
    Copy the code
  6. NF(number of fields)

    NF represents the number of separated fields in a record, so this value is related to FS as set:

    # delimiter with ":"
    $ awk -F ":" '{print "record" NR " with " NF " fields:" $1,$2,$3}'/etc/passwd record1 with 7 fields:root x 0 record2 with 7 fields:daemon x 1 record3 with 7 fields:bin x 2 record4 with 7  fields:sys x 3# delimiter with "o"
    $ awk -F "o" '{print "record" NR " with " NF " fields:" $1,$2,$3}' /etc/passwd
    record1 with 7 fields:r  t:x:0:0:r
    record2 with 5 fields:daem n:x:1:1:daem n:/usr/sbin:/usr/sbin/n
    record3 with 3 fields:bin:x:2:2:bin:/bin:/usr/sbin/n l gin
    record4 with 3 fields:sys:x:3:3:sys:/dev:/usr/sbin/n l gin
    Copy the code
  7. FILENAME

    FILENAME is the name of the file currently being processed

    $ awk -F ":" '{print FILENAME}' /etc/passwd people.txt
    /etc/passwd
    /etc/passwd
    /etc/passwd
    /etc/passwd
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    people.txt
    Copy the code

    This value does not make sense until the record has been processed, so trying to print FILENAME in the BEGIN block gets a null value

  8. FNR

    The previous NR indicates the number of entries that can be accumulated over multiple files, while FNR indicates the number of entries in the current file

    awk -F ":" '{print "record" FNR ":" $1,$2,$3}' people.txt /etc/passwd
    record1:P1  
    record2:male  
    record3:15  
    record4:  
    record5:p2  
    record6:female  
    record7:20  
    record8:  
    record9:p3  
    record10:male  
    record11:19  
    record1:root x 0
    record2:daemon x 1
    record3:bin x 2
    record4:sys x 3 
    Copy the code

Built-in function

Awk provides built-in functions for easy text and arithmetic processing, including getting length() for string length, rand() for random numbers, and computing sine () and cosine () for sines and cosines.

These functions can be queried in the official manual.

Record the screening

All of the examples above operate on each record, and in fact can be filtered using conditions.

Regular judgment

Records can be pattern-matched using regular expressions:

$ awk -F ':' '/root/ {print $1,$2,$3}' /etc/passwd 
root x 0
Copy the code

The records that contain root are screened out.

conditional

Filtering can also be done in combination with AWK’s built-in variables and functions:

Output all records after the first record whose length is greater than 2
$ awk -F ':' 'length($1)>3 && NR>1 {print $0}' /etc/passwd
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
Copy the code

If statement

Awk also provides if statements:

Print the record where the third field is 0
$ awk -F ':' '{if ($3==0)print $0}' /etc/passwd
root:x:0:0:root:/root:/usr/bin/zsh
Copy the code

Awk also has a for statement, which is similar to the form in C:

$ awk -v ORS="," 'BEGIN{ for(i=1; i<5; i++) print i}'1, 2, 3, 4,Copy the code

Characters and Numbers

Awk provides support for mathematical and logical operators. In AWK, strings and numbers can be cast directly, +0 can be cast to a number, and space concatenation can be cast to a string:

awk 'BEGIN{print "origin\tnumber\tstring"}{print $0,"\t",$0+0,"\t",$0 ""}' people.txt
origin  number  string
P1       0       P1
male     0       male
15       15      15
         0       
p2       0       p2
female   0       female
20       20      20
         0       
p3       0       p3
male     0       male
19       19      1
Copy the code

Here the $0 “” representation is used to concatenate the original record with a space, but this can cause problems in some cases. Consider the following statement:

$ awk 'BEGIN { print -12 " " -24 }'- | - 12 to 24Copy the code

Here we want a space between -12 and -24, but we don’t get the desired result. This is due to the precedence of the mathematical operator over the concatenation operation, so the parse order is as follows:

- 12 (""-24) the tail of the tailCopy the code

To get the correct result, use parentheses:

$ awk 'BEGIN { print -12 " " (-24) }'- | - 12 to 24Copy the code