12_ diagram Pandas repeated value processing

Pandas uses two functions to handle repeated values:

  • Duplicated () : Determines whether there is a duplicated value
  • Drop_duplicates () : deletes duplicate values

Pandas serialized articles

Pandas’ articles have been serialized and are welcome to read them:

Simulated data

In this paper, two different data are simulated:

1. One copy of order data, which will be used later

import pandas as pd
import numpy as np

Import a copy of simulated data: set aside

df1 = pd.read_excel("Order duplicate value. XLSX")
df1
Copy the code

2. Another data from the simulation:

df2 = pd.DataFrame(np.ones([10.2]),  # generate 6*2 data with all 1s
                   columns=["col1"."col2"]
                  )
df2
Copy the code

Two more fields have been added: both are selected by random sampling from the list

# Add two columns
list1 = ["a"."b"]
list2 = [2.3]

# Select 10 elements from the list at random
df2["col3"] = np.random.choice(list1,10)  
df2["col4"] = np.random.choice(list2,10)

df2
Copy the code

duplicated()

The DataFrame () function checks whether there are duplicate values in the data. It is used to mark whether the values in the Series and the rows in the DataFrame are duplicate.

Each row of data is compared to the record before it.

grammar

For DataFrame type data:

pandas.DataFrame.duplicated(subset=None,keep='first')
Copy the code

Or for Series data:

pandas.Series.duplicated(keep='first')
Copy the code

The three values of the keep parameter are described as follows:

  • First: Marks duplicates True unless they occur for the first time
  • Last: Marks duplicates as True except for the last
  • False: marks all duplicates as True

Determine which data duplicates

This function allows you to determine which data is duplicate: duplicate is marked True, otherwise False

Parameters subset

df2.duplicated(subset=["col3"])  Look separately to see if the col3 column is repeated

# the results
0    False
1     True
2    False
3     True
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool
  
df2.duplicated(subset=["col1"])  # col1 by itself: all 1, all repeat
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool
Copy the code

The above two examples look at whether a single field is repeated. The following example looks at multiple properties:

df2.duplicated(subset=["col3"."col4"])  # Look at both col3 and COL4

0    False
1     True
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9     True
dtype: bool
Copy the code

Parameters keep

df2.duplicated(subset=["col3"],keep="last") 

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False   # for the first time
8     True
9    False   # for the first time
dtype: bool
Copy the code

  • The keep argument above uses last, meaning that the last piece of data is the initial value, and the previous value is compared to it to see if there are duplicate values
  • In the following example, keep uses first(the default), treating the first occurrence of the data as the initial value and comparing subsequent data to it; If the repeated mark is True
df2.duplicated(subset=["col3"],keep="first")   # Default is first

0    False  # for the first time
1     True
2    False  # for the first time
3     True
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool
  
df2.duplicated(subset=["col3"],keep=False)   # Mark all duplicate values to True

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool
Copy the code

drop_duplicates()

This function is used to remove duplicate values from data

Syntactic form

  • Subset: specifies the optional deletion of duplicate values according to one or more specified column attributes. The default is all column properties
  • Keep: indicates the data retained after the deduplication is deleted. The default value is the first data
  • Inplace: indicates whether duplicate is generated or modified on the original data. The function of this parameter is the same for pandas
  • Ingoore_index: is the index of generated data metadata, or from 0,1,2… Natural number to n minus 1

Here is the explanation of the parameters from the website:

Use all default parameters

The above results have two characteristics:

  • The index is also the index of the original data
  • The retained data is the first of each value (if duplicate values exist)
  • To determine whether it is repeated, all column attributes are used
  • The data at the top is the same as the data below to determine whether the duplicate is False (compare sequence number).

Parameters subset

Subset specifies which properties you want to delete:

1. Delete by a single attribute field

2, through multiple field properties to delete

Parameters keep

The keep parameter keeps the data we want: the first or the last

1, keep = “first”

2, keep = “last”

With duplicated() to see if the data is duplicated, you can see that the data with duplicated indexes 7 and 9 are False because they are the last occurrence

Parameter ignore_index

This parameter indicates whether the index of the generated data is an index of the original data or a direct reranking

Parameter inplace

If you use the default value False:

If inplace is True, no data will be generated because the data is changed directly from the original data: let’s look directly at DF2

Practical cases

At the beginning of the article, we have imported the data, a few requirements:

  • Each order may have multiple states, or only one
  • We want to find out all the data for orders whose final order status is “passed”

For example, for order S1, there are 3 states, two of which are passed, but we only want to get the most recent one: 2021-01-06

Solution Step 1: First find out all the approved orders, only S7 failed

The following code can also be used to find out which order was finally approved:

order_pass = df1.query("Status == 'through '") ["Order Number"].unique()
order_pass
Copy the code

Solution Step 2: Filter out the information about the orders that are passed. The following two methods are provided

Solution Step 3: Delete DF3

df3.drop_duplicates(
  subset="Order Number".# Remove weight according to order number
  keep="last".Keep the last one
  inplace=True.# Modify in place
  ignore_index=True  # Index rearrange
)

df3  # No S7 in the result
Copy the code