This article is participating in Python Theme Month. See the link to the event for more details

The Contents of the Excel table used in this article are as follows:

1 Field Selection

1.1 Repeat value processing for all fields in the whole row

Import pandas as pd df = pd.read_excel(r 'c :\Users\admin\Desktop\ data analysis test table. XLSX ') print(df.drop_duplicates())Copy the code

result:

0 Historical history sima Qian 1 novel dream of the Red Chamber Cao Xueqin 2 prose cultural journey Yu Qiuyu 3 History Ming Dynasty those things that year bright moon 4 cartoon half an hour cartoon mixed with Confucius yue 5 essay Bacon essay Bacon 9 history of comprehensive mirror Sima Guang 10 prose walker no boundary Yu Qiuyu 11 cartoon half an hour tang poetry mixed with Confucius saidCopy the code

The drop_duplicates() method replicates all values by default, reserving the value in the first row

1.2 Only the repeated value of a column is processed

To duplicate a column, you only need to use the subset parameter to identify the column name

Df = pd.read_excel(r 'c :\Users\admin\Desktop\ XLSX ') print(df.drop_duplicates(subset=" category ")Copy the code

result:

0 Historical records Sima Qian 1 Novel dream of the Red Chamber Cao Xueqin 2 Prose Cultural journey Yu Qiuyu 4 Cartoon half an hour cartoon Mixed with Master Yue 5 Essay Bacon essay BaconCopy the code

At this time, only the specified column is considered, and any duplicate column will be deleted immediately. Whether the other columns are duplicate is not considered. Therefore, compared with the data in the original table, although the titles and authors of rows 2, 5, 10 and 11 are different, they are all classified as history. So just keep line 2. , and other duplicate lines are deleted

1.3 Processing repeated values for multiple columns

Df = pd.read_excel(r 'c :\Users\admin\Desktop\ XLSX ') print(df.drop_duplicates(subset=[" category ", "user "])Copy the code

result:

0 Historical records sima Qian 1 novel dream of the Red Chamber Cao Xueqin 2 prose cultural hard journey Yu Qiuyu 3 History Ming Dynasty those things that year bright moon 4 cartoon half an hour cartoon mixed with Confucius yue 5 essay Bacon essay bacon 9 history of comprehensive mirror Sima GuangCopy the code

In this case, compared to the initial table data, since multiple columns are specified, whether the data is repeated is determined by the contents of the specified columns, rather than by looking at a single column. Even though lines 2, 5, and 10 belong to the historical category, they are not considered duplicate records because the titles and authors are different. The fourth and ninth lines were written by Yu Qiuyu, so only the third line was retained, while the tenth line was deleted.

2. The question of reserved items

The Keep parameter allows you to customize the behavior of repeated value processing

2.1 Default

Df = pd.read_excel(r 'c :\Users\admin\Desktop\ data analysis test table. XLSX ') print(df.drop_duplicates())Copy the code

result:

0 Historical history sima Qian 1 novel dream of the Red Chamber Cao Xueqin 2 prose cultural journey Yu Qiuyu 3 History Ming Dynasty those things that year bright moon 4 cartoon half an hour cartoon mixed with Confucius yue 5 essay Bacon essay Bacon 9 history of comprehensive mirror Sima Guang 10 prose walker no boundary Yu Qiuyu 11 cartoon half an hour tang poetry mixed with Confucius saidCopy the code

As you can see from the results, the default is to keep the first record. So row 4 and row 9 are duplicate compared to the original table data. Row 4 is retained and row 9 is deleted

2.2 When the Keep parameter is first

Df = pd.read_excel(r'C:\Users\admin\ desktop.xlsx ') print(df.drop_duplicates(keep='first')Copy the code

result:

0 Historical history sima Qian 1 novel dream of the Red Chamber Cao Xueqin 2 prose cultural journey Yu Qiuyu 3 History Ming Dynasty those things that year bright moon 4 cartoon half an hour cartoon mixed with Confucius yue 5 essay Bacon essay Bacon 9 history of comprehensive mirror Sima Guang 10 prose walker no boundary Yu Qiuyu 11 cartoon half an hour tang poetry mixed with Confucius saidCopy the code

As can be seen from the results, when the Keep parameter value is first, the first record is retained. So row 4 and row 9 are duplicate compared to the original table data. Row 4 is retained and row 9 is deleted

2.3 When the Keep parameter is last

Df = pd.read_excel(r'C:\Users\admin\ desktop. XLSX ') print(df.drop_duplicates(keep='last'))Copy the code

result:

0 Historical records sima Qian 4 cartoon half an hour cartoon mixed Confucius said 5 essay Bacon essay Bacon 6 novel Dream of the Red Chamber Cao Xueqin 7 prose cultural journey yu Qiu Yu 8 History Ming Dynasty those things in those years the bright moon 9 history and comprehensive mirror Sima Guang 10 prose walker no boundary Yu Qiuyu 11 cartoon half an hour tang poetry mixed with Confucius saidCopy the code

As you can see from the results, when the Keep parameter value is last, the record of the last occurrence is kept. So rows 4 and 9 are duplicated compared to the original table data. Row 9 is retained and row 4 is deleted

2.4 When the Keep parameter is False

Df = pd.read_excel(r 'c :\Users\admin\Desktop\ data analysis test table. XLSX ') print(df.drop_duplicates(keep=False))Copy the code

result:

0 Historical Record sima Qian 4 Cartoon half an hour cartoon mixed Confucius' day 5 Essay Bacon's essay Bacon 9 History of the general Mirror Sima Guang 10 Prose walker Boundless Yu Qiuyu 11 Cartoon half an hour Tang poetry mixed Confucius' dayCopy the code

As you can see from the results, when the Keep parameter value is last, all duplicates are removed. So rows 4 and 9 are duplicate compared to the original table data, so delete both rows

Note: The value of Keep can only be first, last, or False. There is no True. Don’t assume that False is True