This is the 7th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

When we have fresh data crawling off the webTitle name, author, grade, how many people have seen STATS

Read the data

Using pandas’ read_CSV method to read data, USECols can select certain specified columns for reading, by default, all columns

import pandas as pd
df = pd.read_csv("foodInfo.csv", usecols=['name', 'author', 'grade', 'stats'])
Copy the code

Print (df.head())

duplicate removal

print(df.duplicated().value_counts())
Copy the code

We can see through the output data that there are 103 pieces of data and one of them is duplicated. We can also check which one is duplicated by df.duplicated()

df.drop_duplicates(keep='first', inplace=True)
Copy the code

Drop_duplicates Deduplication Has three parameters based on different scenarios

Subset: an array of column names, a subset is deleted only if all the specified columns are identical

Keep: The default value is first, first, last, and False

Inplace: True is a direct change on the original data, False is required to receive the variable

Missing value handling

Print (df.isnull().any()) print(df.isnull().any())Copy the code

Delete the missing value dropna

df.dropna(how='any', inplace=True)
Copy the code

Axis: 0 is row, 1 is column, default is row

Subset: Deletes the missing values of a specific column

How: any deletes the entire row if there is only one missing value. All deletes all columns if there is only one missing value

Thresh: The threshold at which the number of missing values will be deleted

Inplace: True is a direct change on the original data, False is required to receive the variable

Fill the missing value fillna

I am specifying a value to replace the missing value, and fill in the missing value with the average score that the author has scored in the data

Def fillByAuthor(author): count = 0 sum = 0.0 for I in range(len(df)): continue if df.author[i] == author: count = count + 1 sum = sum + df.grade[i] return round(sum / count, 2)Copy the code
A = fillByAuthor(a, inplace=True) df. Fillna (a, inplace=True)Copy the code

Inplace: True is a direct change on the original data, False is required to receive the variable

Method: pad/ffill: fill the missing value with the previous non-missing value; Backfill /bfill: Fill the missing value with the next non-missing value

None: Specifies a value to replace the missing value (default)

Limit: limits the number of fillings

Axis: Changes the filling direction

Save as

df.to_csv("clean_data.csv")
Copy the code