How do I use AWK to delete duplicate lines from a file

Learn how to use AWK! Visited [$0]++ Deletes the repeated lines without reordering or changing the original order.

Suppose you have a text file and you need to delete all the duplicate lines.

TL; DR

To delete duplicate lines in the same order, use:

awk '! visited[$0]++' your_file > deduplicated_file
Copy the code

The working principle of

This script maintains an associative array with the index (key) of the deleted row in the file, and the value of each index is the number of occurrences of that row. For each line of the file, if the number of occurrences of the line is 0, the value is increased by 1 and the line is printed, otherwise the value is increased by 1 and the line is not printed.

I was unfamiliar with AWK and wanted to understand how such a short script could be implemented. I did some research, and here are my findings:

This AWK “script”! visited[$0]++For the input fileEach lineAre performed.
visited[]Is aAn associative array(also known asmapping) type variable.awkIt’s initialized the first time we execute it, so we don’t need to initialize it.
$0The value of the variable is the contents of the row that is currently being processed.
visited[$0]By working with$0Equal key (the row being processed) to access the value in the map, the number of occurrences (which we set below).
!Invert the number of occurrences:
- inawk,The value of any non-zero number or any non-empty string istrue.
- The default initial value of a variable is an empty string, or 0 if converted to a number.
- In other words:
  - ifvisited[$0]The value of is a number greater than 0, which is resolved by taking the inversefalse.
  - ifvisited[$0]A number or an empty string whose value is equal to 0true 。
- ++Said variablevisited[$0]Plus 1.
  - If the value is null,awkAutomatically convert it to0Add one after the number.
  - Note: The increment operation is performed after we get the value of the variable.

In general, the whole expression means:

true: if the number of occurrences is 0 or an empty string
false: If the number of occurrences is greater than 0

Awk consists of a pattern or expression and an action associated with it:

< pattern/expression > {< action >}Copy the code

If the pattern is matched, the subsequent actions are performed. If the action is omitted, awk prints the input by default.

The ellipsis action is equivalent to {print $0}.

Our script consists of an AWK expression statement, omitting the action. So write it like this:

awk '! visited[$0]++' your_file > deduplicated_file
Copy the code

Equals this:

awk '! visited[$0]++ { print $0 }' your_file > deduplicated_file
Copy the code

For each line of the file, if the expression matches, the line is printed to output. Otherwise, no action is performed and nothing is printed.

Why not use the uniq command?

The uniq command can deduplicate only adjacent lines. Here’s an example:

$ cat test.txt
A
A
A
B
B
B
A
A
C
C
C
B
B
A
$ uniq < test.txt
A
B
A
C
B
A
Copy the code

Other methods

Using the sort command

We could also remove duplicate lines with the following sort command, but the original line order is not preserved.

sort -u your_file > sorted_deduplicated_file
Copy the code

Use cat + sort + cut

The above method produces a de-duplicated file, with the lines sorted based on content. The pipe connection command solves this problem.

cat -n your_file | sort -uk2 | sort -nk1 | cut -f2 -Copy the code

The working principle of

Suppose we have the following file:

abc
ghi
abc
def
xyz
def
ghi
klm
Copy the code

Cat-n test.txt displays the serial number before each line:

1       abc
2       ghi
3       abc
4       def
5       xyz
6       def
7       ghi
8       klm
Copy the code

Sort-uk2 sorts based on the second column (the k2 option), reserving the same value for the second column only once (the U option) :

1       abc
4       def
2       ghi
8       klm
5       xyz
Copy the code

Sort-nk1 sorts based on the first column (k1 option) and treats the column values as numbers (-n option) :

1       abc
2       ghi
4       def
5       xyz
8       klm
Copy the code

Finally, cut-f2 – prints each line from the second column up to the last content (-f2- option: note the – suffix, which means everything that follows the line is included).

abc
ghi
def
xyz
klm
Copy the code

reference

GNU AWK User Manual
An array in AWK
The Awk – true value
Awk expression
How does Unix remove duplicate lines from a file?
Remove duplicate rows without sorting (de-duplication)
‘! A [$0]++

The above is the full text.

Via: opensource.com/article/19/…

Author: Lazarus Lazaridis lujun9972

This article is originally compiled by LCTT and released in Linux China

How do I use AWK to delete duplicate lines from a file

TL; DR

The working principle of

Why not use the uniq command?

Other methods

Using the sort command

Use cat + sort + cut

reference

Related Posts

10 minutes for introduction to Koa

JDK11 version of HashMap source code full parsing (in detail)- a article covering all aspects

Mysql optimizing | storage engines, building tables, indexes, SQL optimization Suggestions