The article directories
-
- These reviews
- DataFrame core analysis method
-
- Data cleaning
-
- Determines whether there is empty data in the row
- Clean up rows/columns
- duplicate removal
- Fill missing values
- Remove whitespace from data
- Select data
-
- Pandas selects data by column
- The filter method selects the column
- Pandas selects data in rows
These reviews
DataFrame core analysis method
Data cleaning
Python uses NaN(Not a Number) to indicate missing data
Let’s take a look at some data:
df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
Copy the code
0 1 2
0 1.0 5.0 8.0
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
Copy the code
Determines whether there is empty data in the row
Axis =0 represents columns and axis=1 represents rows
Look at the line: df.isnull().any(axis=1) check the column: df.isnull().any(axis=0)
Copy the code
I made a line-by-line judgment and you have a look:
0 False
1 True
2 True
3 True
dtype: bool
Copy the code
Take a look at this again:
Look at the line: df.notnull().all(axis=1) check the column: df.notnull().all(axis=0)
Copy the code
In this case, as long as there is non-empty data, it will be judged True.
All of the above methods can be negated by taking the inverse symbol “~”.
print(~df.isnull().any(axis = 1))
Copy the code
It can also be evaluated using the loc() method.
For example, if I want to fetch all non-empty rows, I can do this:
df = df.loc[~df.isnull().any(axis = 1)]
Copy the code
0 1 2
0 1.0 5.0 8.0
Copy the code
As for the loc() method, I’ll talk about it later.
You can also specify a column for nulls:
print(df[1].isnull()) # Determine the null value of a column
print(df[1].isnull().value_counts()) # count the number of null values in a column
Copy the code
Clean up rows/columns
There is a relatively straightforward way to clear all rows and columns with empty values:
df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna()
print(df)
Copy the code
Do not attach any additional conditions, as long as there is a null value in your row, the line is cleared.
0 1 2
0 1.0 5.0 8.0
Copy the code
What if you want to do it by column? Then add:
df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna(axis=1)
print(df)
Copy the code
Well, I’m sorry to tell you, it’s all been cleaned up, because every column is empty…
Empty DataFrame
Columns: []
Index: [0.1.2.3]
Copy the code
Okay, now you tell me that you think one or two bad values in a row are actually tolerable. What about me? Then I have to do it for you:
# as long as n values are good, leave:
df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.dropna(thresh=1) # n
print(df)
Copy the code
0 1 2
0 1.0 5.0 8.0
2 2.0 3.0 NaN
Copy the code
Isn’t it. Well, if it’s not what you want, there’s nothing I can do about it.
What else, delete the specified column? Delete the specified row? Well, give it a try. Feel it out.
df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.drop(labels=1)
print(df)
Copy the code
0 1 2
0 1.0 5.0 8.0
2 2.0 3.0 NaN
3 NaN NaN NaN
Copy the code
Well, LET me delete that first column.
God’s skill!!
df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.drop(columns=2)
print(df)
Copy the code
Sorry, this is one row, this is one column…
0 1
0 1.0 5.0
1 NaN NaN
2 2.0 3.0
3 NaN NaN
Copy the code
Ah, see and cherish, I don’t know how much more can send.
duplicate removal
What if you get a data set, and it’s big, and you feel like there’s a lot of duplicates in it, and you want to do a wave of de-duplication?
One drop_duplicates I haven’t seen yet.
Let’s try another data set. I’m tired of using that one.
df = pd.DataFrame({'Country': [1.1.2.12.34.23.45.34.23.12.2.3.4.1].'Income': [1.1.2.10000.10000.5000.5002.40000.50000.8000.5000.3000.15666.1].'Age': [1.1.2.50.43.34.40.25.25.45.32.12.32.1].'group': [1.1.2.'a'.'b'.'s'.'d'.'f'.'g'.'h'.'a'.'d'.'a'.1]})
Copy the code
Country Income Age group
0 1 1 1 1
1 1 1 1 1
2 2 2 2 2
3 12 10000 50 a
4 34 10000 43 b
5 23 5000 34 s
6 45 5002 40 d
7 34 40000 25 f
8 23 50000 25 g
9 12 8000 45 h
10 2 5000 32 a
11 3 3000 12 d
12 4 15666 32 a
13 1 1 1 1
Copy the code
Direct hand weight removal:
df.drop_duplicates(inplace=True) #inplace=True Modify the original table
Copy the code
Country Income Age group
0 1 1 1 1
2 2 2 2 2
3 12 10000 50 a
4 34 10000 43 b
5 23 5000 34 s
6 45 5002 40 d
7 34 40000 25 f
8 23 50000 25 g
9 12 8000 45 h
10 2 5000 32 a
11 3 3000 12 d
12 4 15666 32 a
Copy the code
There’s one column missing.
If a duplicate row is deleted using drop_duplicates, the index corresponding to the duplicate row is deleted by default. That is, the index value has changed.
So how do we solve this problem?
df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)
print(df)
Copy the code
Country Income Age group
0 1 1 1 1
1 2 2 2 2
2 12 10000 50 a
3 34 10000 43 b
4 23 5000 34 s
5 45 5002 40 d
6 34 40000 25 f
7 23 50000 25 g
8 12 8000 45 h
9 2 5000 32 a
10 3 3000 12 d
11 4 15666 32 a
Copy the code
If you want to specify duplicate rows to keep (the default is the first row), you can use the keep argument: there is usually nothing for you to choose from, either first or last.
Deduplicates the specified data column:
df.drop_duplicates(inplace=True,subset = ['Age'],keep='last')
df = df.reset_index(drop=True)
print(df)
Copy the code
0 2 2 2 2
1 12 10000 50 a
2 34 10000 43 b
3 23 5000 34 s
4 45 5002 40 d
5 23 50000 25 g
6 12 8000 45 h
7 3 3000 12 d
8 4 15666 32 a
9 1 1 1 1
Copy the code
How about a few more lines? What is this operation called? Think about primary keys in a database.
df.drop_duplicates(inplace=True,subset = ['Age'.'group'],keep='last')
df = df.reset_index(drop=True)
print(df)
Copy the code
Country Income Age group
0 2 2 2 2
1 12 10000 50 a
2 34 10000 43 b
3 23 5000 34 s
4 45 5002 40 d
5 34 40000 25 f
6 23 50000 25 g
7 12 8000 45 h
8 3 3000 12 d
9 4 15666 32 a
10 1 1 1 1
Copy the code
With that said, let’s fill in the missing values.
Fill missing values
Now let’s switch the data set back.
Then fill in the missing values:
df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
df = df.fillna(value=0) # Fill the missing value with the specified value
print(df)
Copy the code
0 1 2
0 1.0 5.0 0.0
1 2.0 0.0 0.0
2 2.0 3.0 0.0
3 0.0 0.0 0.0
Copy the code
Fill a column with the average value of a column:
df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
print(df)
df[1] = df.fillna(df[1].mean())
print(df)
Copy the code
0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 1.0
1 2.0 NaN 2.0
2 2.0 3.0 2.0
3 NaN NaN NaN
Copy the code
Why don’t you try the second column?
Well, try not specifying a column:
df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
print(df)
df = df.fillna(df.mean())
print(df)
Copy the code
Top-down filling:
df = df.fillna(method='ffill')
print(df)
Copy the code
0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 NaN
1 2.0 5.0 NaN
2 2.0 3.0 NaN
3 2.0 3.0 NaN
Copy the code
Where there is top-down, there is bottom-up:
df = df.fillna(method='bfill')
print(df)
Copy the code
0 1 2
0 1.0 5.0 NaN
1 2.0 NaN NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
0 1 2
0 1.0 5.0 NaN
1 2.0 3.0 NaN
2 2.0 3.0 NaN
3 NaN NaN NaN
Copy the code
Here’s another trick, but it’s also annoying dirty data: whitespace
Remove whitespace from data
Create data with Spaces
dict1 = {"name": ["Little red"."Xiao Ming"."Zhang"]."age": [16.17.18]."city": ["Beijing"."Hangzhou"."Shanghai"]}
df2 = pd.DataFrame(dict1, columns=["name"."age"."city"])
print(df2)
# clear space
df2["city"] = df2["city"].map(str.strip)
print(df2)
Copy the code
name age city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18Shanghai Name Age City0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code
Select data
Pandas selects data by column
So let’s just do the most intuitive way, which is to evaluate it in parentheses.
Create data with Spaces
dict1 = {"name": ["Little red"."Xiao Ming"."Zhang"]."age": [16.17.18]."city": ["Beijing"."Hangzhou"."Shanghai"]}
df2 = pd.DataFrame(dict1, columns=["name"."age"."city"])
# clear space
df2["city"] = df2["city"].map(str.strip)
print(df2['name'])
Copy the code
0The little red1Xiao Ming2Xiao Zhang Name: Name, dtype:object
Copy the code
Of course, how can you do that if you don’t know the column name? Pick a ball without even knowing the list…
print(df2.columns)
Copy the code
How I
Index(['name'.'age'.'city'], dtype='object')
Copy the code
You usually have to choose multiple columns of data, right? Right!
Row, let’s select the multi-column data:
print(df2[['name'.'age']]) This is a list, not two strings.
Copy the code
name age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code
Select columns by data type
The current DataFrame column data type is as follows:
name object
age int64
city object
dtype: object
Copy the code
Get the object:
print(df2.select_dtypes(include='object'))
Copy the code
name city
0Little red Beijing1Xiao Ming in hangzhou2Xiao zhang in ShanghaiCopy the code
So, what if I wanted to select something other than ‘object’?
print(df2.select_dtypes(exclude='object'))
Copy the code
age
0 16
1 17
2 18
Copy the code
The filter method selects the column
It has three common arguments, which we’ll look at one by one, but note that they don’t all go together.
Select multiple columns using items:
df2 = df2.filter(items=['name'.'age'])
print(df2)
Copy the code
It’s the same thing as the one up here.
name age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code
Select matching columns using like: Require column names containing…
df2 = df2.filter(like='a')
print(df2)
Copy the code
name age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code
Select columns using regular expressions:
df2 = df2.filter(regex='[a-z]')
print(df2)
Copy the code
name age city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code
Pandas selects data in rows
Let’s look at loC methods:
df2 = df2.loc[0:2]
print(df2)
Copy the code
name age city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code
You got it?
To:
df2 = df2.loc[0:2['name'.'age']]
Copy the code
name age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code
I’ll just leave it here, and I won’t say a word.
df2 = df2.loc[(df2['age'] >16) & (df2['age'] <18)]
df2 = df2.loc[(df2['age'] >16) | (df2['age'] <18)]
Copy the code
I don’t even want to play the result here, so use your imagination.
Almost, LET me think about what else.
Lambda expressions, for lambda,
df2 = df2.loc[lambda x:x.city == 'Beijing']
Copy the code
Well, it looks like this.
If there is no accident, this article is over, see you!!