This is the 7th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021
When we have fresh data crawling off the webTitle name, author, grade, how many people have seen STATS
Read the data
Using pandas’ read_CSV method to read data, USECols can select certain specified columns for reading, by default, all columns
import pandas as pd
df = pd.read_csv("foodInfo.csv", usecols=['name', 'author', 'grade', 'stats'])
Copy the code
Print (df.head())
duplicate removal
print(df.duplicated().value_counts())
Copy the code
We can see through the output data that there are 103 pieces of data and one of them is duplicated. We can also check which one is duplicated by df.duplicated()
df.drop_duplicates(keep='first', inplace=True)
Copy the code
Drop_duplicates Deduplication Has three parameters based on different scenarios
Subset: an array of column names, a subset is deleted only if all the specified columns are identical
Keep: The default value is first, first, last, and False
Inplace: True is a direct change on the original data, False is required to receive the variable
Missing value handling
Print (df.isnull().any()) print(df.isnull().any())Copy the code
Delete the missing value dropna
df.dropna(how='any', inplace=True)
Copy the code
Axis: 0 is row, 1 is column, default is row
Subset: Deletes the missing values of a specific column
How: any deletes the entire row if there is only one missing value. All deletes all columns if there is only one missing value
Thresh: The threshold at which the number of missing values will be deleted
Inplace: True is a direct change on the original data, False is required to receive the variable
Fill the missing value fillna
I am specifying a value to replace the missing value, and fill in the missing value with the average score that the author has scored in the data
Def fillByAuthor(author): count = 0 sum = 0.0 for I in range(len(df)): continue if df.author[i] == author: count = count + 1 sum = sum + df.grade[i] return round(sum / count, 2)Copy the code
A = fillByAuthor(a, inplace=True) df. Fillna (a, inplace=True)Copy the code
Inplace: True is a direct change on the original data, False is required to receive the variable
Method: pad/ffill: fill the missing value with the previous non-missing value; Backfill /bfill: Fill the missing value with the next non-missing value
None: Specifies a value to replace the missing value (default)
Limit: limits the number of fillings
Axis: Changes the filling direction
Save as
df.to_csv("clean_data.csv")
Copy the code