Pandas (3)

The article directories

- These reviews
- DataFrame core analysis method
- - Data cleaning
  - - Determines whether there is empty data in the row
    - Clean up rows/columns
    - duplicate removal
    - Fill missing values
    - Remove whitespace from data
  - Select data
  - - Pandas selects data by column
    - The filter method selects the column
    - Pandas selects data in rows

These reviews

DataFrame core analysis method

Data cleaning

Python uses NaN(Not a Number) to indicate missing data

Let’s take a look at some data:

df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])
Copy the code

     0    1    2
0  1.0  5.0  8.0
1  2.0  NaN  NaN
2  2.0  3.0  NaN
3  NaN  NaN  NaN
Copy the code

Determines whether there is empty data in the row

Axis =0 represents columns and axis=1 represents rows

Look at the line: df.isnull().any(axis=1) check the column: df.isnull().any(axis=0)
Copy the code

I made a line-by-line judgment and you have a look:

0    False
1     True
2     True
3     True
dtype: bool
Copy the code

Take a look at this again:

Look at the line: df.notnull().all(axis=1) check the column: df.notnull().all(axis=0)
Copy the code

In this case, as long as there is non-empty data, it will be judged True.

All of the above methods can be negated by taking the inverse symbol “~”.

print(~df.isnull().any(axis = 1))
Copy the code

It can also be evaluated using the loc() method.

For example, if I want to fetch all non-empty rows, I can do this:

df = df.loc[~df.isnull().any(axis = 1)]
Copy the code

     0    1    2
0  1.0  5.0  8.0
Copy the code

As for the loc() method, I’ll talk about it later.

You can also specify a column for nulls:

print(df[1].isnull())	# Determine the null value of a column
print(df[1].isnull().value_counts())	# count the number of null values in a column
Copy the code

Clean up rows/columns

There is a relatively straightforward way to clear all rows and columns with empty values:

df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.dropna()

print(df)
Copy the code

Do not attach any additional conditions, as long as there is a null value in your row, the line is cleared.

     0    1    2
0  1.0  5.0  8.0
Copy the code

What if you want to do it by column? Then add:

df = pd.DataFrame([[1.5.8], [2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.dropna(axis=1)

print(df)
Copy the code

Well, I’m sorry to tell you, it’s all been cleaned up, because every column is empty…

Empty DataFrame
Columns: []
Index: [0.1.2.3]
Copy the code

Okay, now you tell me that you think one or two bad values in a row are actually tolerable. What about me? Then I have to do it for you:

# as long as n values are good, leave:
df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.dropna(thresh=1)	# n

print(df)
Copy the code

     0    1    2
0  1.0  5.0  8.0
2  2.0  3.0  NaN
Copy the code

Isn’t it. Well, if it’s not what you want, there’s nothing I can do about it.

What else, delete the specified column? Delete the specified row? Well, give it a try. Feel it out.

df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.drop(labels=1)

print(df)
Copy the code

     0    1    2
0  1.0  5.0  8.0
2  2.0  3.0  NaN
3  NaN  NaN  NaN
Copy the code

Well, LET me delete that first column.

God’s skill!!

df = pd.DataFrame([[1.5.8],[np.nan,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.drop(columns=2)

print(df)
Copy the code

Sorry, this is one row, this is one column…

     0    1
0  1.0  5.0
1  NaN  NaN
2  2.0  3.0
3  NaN  NaN
Copy the code

Ah, see and cherish, I don’t know how much more can send.

duplicate removal

What if you get a data set, and it’s big, and you feel like there’s a lot of duplicates in it, and you want to do a wave of de-duplication?

One drop_duplicates I haven’t seen yet.

Let’s try another data set. I’m tired of using that one.

df = pd.DataFrame({'Country': [1.1.2.12.34.23.45.34.23.12.2.3.4.1].'Income': [1.1.2.10000.10000.5000.5002.40000.50000.8000.5000.3000.15666.1].'Age': [1.1.2.50.43.34.40.25.25.45.32.12.32.1].'group': [1.1.2.'a'.'b'.'s'.'d'.'f'.'g'.'h'.'a'.'d'.'a'.1]})
Copy the code

 	Country  Income  Age group
0         1       1    1     1
1         1       1    1     1
2         2       2    2     2
3        12   10000   50     a
4        34   10000   43     b
5        23    5000   34     s
6        45    5002   40     d
7        34   40000   25     f
8        23   50000   25     g
9        12    8000   45     h
10        2    5000   32     a
11        3    3000   12     d
12        4   15666   32     a
13        1       1    1     1
Copy the code

Direct hand weight removal:

df.drop_duplicates(inplace=True)	#inplace=True Modify the original table
Copy the code

    Country  Income  Age group
0         1       1    1     1
2         2       2    2     2
3        12   10000   50     a
4        34   10000   43     b
5        23    5000   34     s
6        45    5002   40     d
7        34   40000   25     f
8        23   50000   25     g
9        12    8000   45     h
10        2    5000   32     a
11        3    3000   12     d
12        4   15666   32     a
Copy the code

There’s one column missing.

If a duplicate row is deleted using drop_duplicates, the index corresponding to the duplicate row is deleted by default. That is, the index value has changed.

So how do we solve this problem?

df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)
print(df)
Copy the code

	Country  Income  Age group
0         1       1    1     1
1         2       2    2     2
2        12   10000   50     a
3        34   10000   43     b
4        23    5000   34     s
5        45    5002   40     d
6        34   40000   25     f
7        23   50000   25     g
8        12    8000   45     h
9         2    5000   32     a
10        3    3000   12     d
11        4   15666   32     a
Copy the code

If you want to specify duplicate rows to keep (the default is the first row), you can use the keep argument: there is usually nothing for you to choose from, either first or last.

Deduplicates the specified data column:

df.drop_duplicates(inplace=True,subset = ['Age'],keep='last')

df = df.reset_index(drop=True)

print(df)
Copy the code

0        2       2    2     2
1       12   10000   50     a
2       34   10000   43     b
3       23    5000   34     s
4       45    5002   40     d
5       23   50000   25     g
6       12    8000   45     h
7        3    3000   12     d
8        4   15666   32     a
9        1       1    1     1
Copy the code

How about a few more lines? What is this operation called? Think about primary keys in a database.

df.drop_duplicates(inplace=True,subset = ['Age'.'group'],keep='last')

df = df.reset_index(drop=True)

print(df)
Copy the code

    Country  Income  Age group
0         2       2    2     2
1        12   10000   50     a
2        34   10000   43     b
3        23    5000   34     s
4        45    5002   40     d
5        34   40000   25     f
6        23   50000   25     g
7        12    8000   45     h
8         3    3000   12     d
9         4   15666   32     a
10        1       1    1     1
Copy the code

With that said, let’s fill in the missing values.

Fill missing values

Now let’s switch the data set back.

Then fill in the missing values:

df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

df = df.fillna(value=0)	# Fill the missing value with the specified value

print(df)
Copy the code

     0    1    2
0  1.0  5.0  0.0
1  2.0  0.0  0.0
2  2.0  3.0  0.0
3  0.0  0.0  0.0
Copy the code

Fill a column with the average value of a column:

df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

print(df)

df[1] = df.fillna(df[1].mean())

print(df)
Copy the code

     0    1   2
0  1.0  5.0 NaN
1  2.0  NaN NaN
2  2.0  3.0 NaN
3  NaN  NaN NaN

     0    1    2
0  1.0  5.0  1.0
1  2.0  NaN  2.0
2  2.0  3.0  2.0
3  NaN  NaN  NaN
Copy the code

Why don’t you try the second column?

Well, try not specifying a column:

df = pd.DataFrame([[1.5,np.nan],[2,np.nan,np.nan],[2.3,np.nan],[np.nan,np.nan,np.nan]])

print(df)

df = df.fillna(df.mean())

print(df)
Copy the code

Top-down filling:

df = df.fillna(method='ffill')

print(df)
Copy the code

     0    1   2
0  1.0  5.0 NaN
1  2.0  NaN NaN
2  2.0  3.0 NaN
3  NaN  NaN NaN

     0    1   2
0  1.0  5.0 NaN
1  2.0  5.0 NaN
2  2.0  3.0 NaN
3  2.0  3.0 NaN
Copy the code

Where there is top-down, there is bottom-up:

df = df.fillna(method='bfill')

print(df)
Copy the code

     0    1   2
0  1.0  5.0 NaN
1  2.0  NaN NaN
2  2.0  3.0 NaN
3  NaN  NaN NaN

     0    1   2
0  1.0  5.0 NaN
1  2.0  3.0 NaN
2  2.0  3.0 NaN
3  NaN  NaN NaN
Copy the code

Here’s another trick, but it’s also annoying dirty data: whitespace

Remove whitespace from data

Create data with Spaces
dict1 = {"name": ["Little red"."Xiao Ming"."Zhang"]."age": [16.17.18]."city": ["Beijing"."Hangzhou"."Shanghai"]}
df2 = pd.DataFrame(dict1, columns=["name"."age"."city"])

print(df2)

# clear space
df2["city"] = df2["city"].map(str.strip)

print(df2)
Copy the code

    name  age    city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18Shanghai Name Age City0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code

Select data

Pandas selects data by column

So let’s just do the most intuitive way, which is to evaluate it in parentheses.

Create data with Spaces
dict1 = {"name": ["Little red"."Xiao Ming"."Zhang"]."age": [16.17.18]."city": ["Beijing"."Hangzhou"."Shanghai"]}
df2 = pd.DataFrame(dict1, columns=["name"."age"."city"])

# clear space
df2["city"] = df2["city"].map(str.strip)

print(df2['name'])
Copy the code

0The little red1Xiao Ming2Xiao Zhang Name: Name, dtype:object
Copy the code

Of course, how can you do that if you don’t know the column name? Pick a ball without even knowing the list…

print(df2.columns)
Copy the code

How I

Index(['name'.'age'.'city'], dtype='object')
Copy the code

You usually have to choose multiple columns of data, right? Right!

Row, let’s select the multi-column data:

print(df2[['name'.'age']])	This is a list, not two strings.
Copy the code

    name  age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code

Select columns by data type

The current DataFrame column data type is as follows:

name    object
age      int64
city    object
dtype: object
Copy the code

Get the object:

print(df2.select_dtypes(include='object'))
Copy the code

  	name  city
0Little red Beijing1Xiao Ming in hangzhou2Xiao zhang in ShanghaiCopy the code

So, what if I wanted to select something other than ‘object’?

print(df2.select_dtypes(exclude='object'))
Copy the code

   age
0   16
1   17
2   18
Copy the code

The filter method selects the column

It has three common arguments, which we’ll look at one by one, but note that they don’t all go together.

Select multiple columns using items:

df2 = df2.filter(items=['name'.'age'])

print(df2)
Copy the code

It’s the same thing as the one up here.

  name  age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code

Select matching columns using like: Require column names containing…

df2 = df2.filter(like='a')

print(df2)
Copy the code

  	name  age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code

Select columns using regular expressions:

df2 = df2.filter(regex='[a-z]')

print(df2)
Copy the code

  	name  age  city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code

Pandas selects data in rows

Let’s look at loC methods:

df2 = df2.loc[0:2]

print(df2)
Copy the code

  	name  age city
0The little red16Beijing1Xiao Ming17hangzhou2Xiao zhang,18ShanghaiCopy the code

You got it?

To:

df2 = df2.loc[0:2['name'.'age']]
Copy the code

  	name  age
0The little red16
1Xiao Ming17
2Xiao zhang,18
Copy the code

I’ll just leave it here, and I won’t say a word.

df2 = df2.loc[(df2['age'] >16) & (df2['age'] <18)]
df2 = df2.loc[(df2['age'] >16) | (df2['age'] <18)]
Copy the code

I don’t even want to play the result here, so use your imagination.

Almost, LET me think about what else.

Lambda expressions, for lambda,

df2 = df2.loc[lambda x:x.city == 'Beijing']
Copy the code

Well, it looks like this.

If there is no accident, this article is over, see you!!