Introduction to Pandas, skills that must be mastered

To summarize the ways I often use to pandas:

  • Create DataFrame data
  • View data information
  • View header and tail files
  • Synchronized access
  • Slice access
  • Common function usage

Import packages

import pandas as pd
import numpy as np
Copy the code

Use tip 1- Create DataFrame data

Method 1: Create your own

df1 = pd.DataFrame({
    "name": ["Xiao Ming"."Little red"."Note"."King"."GuanYu"."Pei liu"."Zhang fei"]."age": [20.18.27.20.28.18.25]."sex": ["Male"."Female"."Male"."Male"."Male"."Female"."Female"]."score": [669.570.642.590.601.619.701]."address": ["Beijing"."Shenzhen"."Guangzhou"."Wuhan"."Shenzhen"."Guangzhou"."Changsha"]
})

df1
Copy the code

The data are as follows:

Method 2: Read data from the local file. There is now a local file: student info. XLSX reads directly from pd.read_excel() :

df2 = pd.read_excel("Student information. XLSX")
df2
Copy the code

And you can see that the effect is the same

Use Tip # 2- Data exploration

View data shape

Shape indicates how many rows and columns the data consists of:

df1.shape  # (7, 5)
Copy the code

View the field property name

df1.columns
Copy the code

View the data type of the property

df1.dtypes
Copy the code

You can see that there are only two data types: INT64 and Object

Check whether data is missing

df1.isnull()   # True if missing, False otherwise
Copy the code

df1.isnull().sum(a)# Count the number of missing values. A True count once
Copy the code

The results show that there is no missing value in this data

View the data row index

df1.index
Copy the code

View the data description

df1.describe
Copy the code

View statistics

Only numeric data is displayed.

df1.describe()
Copy the code

The results of statistical values include: number count, mean mean, variance STD, min\ Max, quartile 25%, median 50%, and 3/4 quartile 75%.

Use tip 3- Look at header and tail files

The head and tail methods allow you to quickly view the header and tail files of the data.

head

df1.head()   The default is to view the first 5 rows of data
df1.head(3)  # Specifies the number of lines to display
Copy the code

tail

df1.tail()    # Default tail 5 lines
df1.tail(3)   # specifies the trailing 3 rows of data
Copy the code

Use tip 4- Take numbers

Take the data we want from pandas’ DataFrame box and process it

Retrieves data from a field

Let’s fetch the data from the name column:

name = df1["name"]
name

# the results
0Xiao Ming1The little red2Note:3wang4GuanYu5Pei liu6Name: Name, dtype:object
Copy the code

Retrieves data from multiple fields

For example, we fetch data from the name and age columns:

name_age = df1[["name"."age"]]  
name_age

# the results
  name age
0Xiao Ming20
1The little red18
2Note:27
3wang20
4GuanYu28
5Pei liu18
6Zhang fei25
Copy the code

Select the data based on the field type

For example, if we want to select the data of the field type int64, we can see the data type of the field: age and Score are int64

1. Select a single data type

# 1. Select a single data type

df1.select_dtypes(include='int64')

# the results
  age score
0	20	669
1	18	570
2	27	642
3	20	590
4	28	601
5	18	619
6	25	701
Copy the code

2. Select multiple types

df1.select_dtypes(include=['int64'.'object'])

# the results
  name  age sex score address
0Xiao Ming20669Beijing1The little red18570shenzhen2Note:27642Guangzhou3wang20590wuhan4GuanYu28601shenzhen5Pei liu18619Guangzhou6Zhang fei25701changshaCopy the code

Since there are only INT64 and object in the data, we have selected them all.

3. Select the data excluded from certain data types:

Select data other than int64
# Exclude data other than name and Score fields
df1.select_dtypes(exclude='int64') 

# the results
  name sex address
0Xiao Ming is a male from Beijing1Little red girl Shenzhen2Xiao Sun is from Guangzhou3Wang Xiao male wuhan4Guan Yu, male, shenzhen5Liu Bei is a female from Guangzhou6Zhang Fei female changshaCopy the code

Take a number based on its magnitude

1. Take the number directly by judging the size:

df1[df1["age"] = =20]  Age is equal to 20
df1[df1["age"] != 20]  # Age does not equal 20
df1[df1["age"] > =20]  Age ≥ 20
Copy the code

2. Multiple judgment conditions are used

Ambiguous the first time you use the method above: the keyword is ambiguous. Pandas has solved this problem by writing the following:

df1[(df1["age"] > =20) & (df1["age"] < 27)]
Copy the code

Take the number from the string

1. Take the number by a single condition

# 1. Single piece of data
df1[df1["name"] = ="Xiao Ming"]  

# the results
  name  age sex  score address
0Xiao Ming20669BeijingCopy the code

2. Take the number through multiple conditions

Select data whose name is Xiao Ming or age is over 25

df1[(df1["name"] = ="Xiao Ming") | (df1["age"] > 25)]

# the results
  name  age sex  score address
0Xiao Ming20669Beijing2Note:27642Guangzhou4GuanYu28601shenzhenCopy the code

3, string beginning, end, include function

  • str.startswith(string)
  • str.endswith(string)
  • str.contains(string)
# 1, take the name beginning with "small"
df1[df1["name"].str.startswith("Small")]  # name starts with "small"

# the results
 name  age sex  score address
0Xiao Ming20669Beijing1The little red18570shenzhen2Note:27642GuangzhouCopy the code
# begins with "off"
df1[df1["name"].str.startswith("Closed")]

# the results
 name  age sex  score address
4GuanYu28601shenzhenCopy the code
# 3. End with "Fi.
df1[df1["name"].str.endswith("Fei")]

# the results
  name  age sex  score address
6Zhang fei25701changshaCopy the code
# Select data that contains "small" : it will be selected whether small is at the beginning or at the end
df1[df1["name"].str.contains("Small")]

# the results

  name  age sex  score address
0Xiao Ming20669Beijing1The little red18570shenzhen2Note:27642Guangzhou3wang20590wuhanCopy the code

The wangxiao above is not small beginning, but contains small, so it is also selected.

4, the string take the reverse operation

The inverse sign is a wavy line: ~

Here’s an example: take a name that contains no small data, only three names that do not contain small print.

# Retrieve data that does not contain small data
df1[~df1["name"].str.contains("Small")]

# the results
  name  age sex  score address
4GuanYu28601shenzhen5Pei liu18619Guangzhou6Zhang fei25701changshaCopy the code

Use tip 5- Slice to count

Slicing is a Python concept that can also be used in Pandas. There are three concepts in slices: start, stop, and step

  • Start: start index, including
  • Stop: ends the index
  • Step: step size, can be positive or negative;

[start:stop:step]

The step size is positive

1, through the following three cases: the default start index is 0, the default step is 1

2. If the start index is specified and the end index is not specified, data is fetched to the end

df1[4:]  # start index 4 and end index 4

# the results
  name  age sex  score address
4GuanYu28601shenzhen5Pei liu18619Guangzhou6Zhang fei25701changshaCopy the code

3. Change the step size

df1[0:4:2]  Change step: take a row of data every two values

# the results
  name  age sex  score address
0Xiao Ming20669Beijing2Note:27642GuangzhouCopy the code

The above example does not specify a starting index:

df1[:4:2]  # Starts at 0 by default
Copy the code

4. Specify only the step size

df1[::2]   # From start to finish, step size 2

# the results
  name  age sex  score address
0Xiao Ming20669Beijing2Note:27642Guangzhou4GuanYu28601shenzhen6Zhang fei25701changshaCopy the code

The step size is negative

1. The step size is -1, and the output is in reverse order by default

df1[::-1]  # Output in reverse order

# the results
   name  age sex  score address
6Zhang fei25701changsha5Pei liu18619Guangzhou4GuanYu28601shenzhen3wang20590wuhan2Note:27642Guangzhou1The little red18570shenzhen0Xiao Ming20669BeijingCopy the code

2, the step size is negative, specify the start index and the end index, the start index is greater than the end index

df1[4:0: -1]
  name  age sex  score address
4GuanYu28601shenzhen3wang20590wuhan2Note:27642Guangzhou1The little red18570shenzhenCopy the code

3. Start and end indexes are negative

df1[-1: -5: -1]  The last row of the record has an index of -1 and does not contain data with an index of -5
 name  age sex  score address
6Zhang fei25701changsha5Pei liu18619Guangzhou4GuanYu28601shenzhen3wang20590wuhanCopy the code

Use Tip 6- Common functions

Number of statistical elements

Value_counts () method value_counts() method value_counts()

⚠️ : dF1 has added a new column: class class, which will be used later

Let’s say we want to count how many times each city appears:

# How many times each city appears in the statistics

address = df1["address"].value_counts()
address
Copy the code

The result is automatically data of type Series sorted in descending order

Index to reset

Index resets use reset_index() :

address_new = address.reset_index()
address_new
Copy the code

For example, if we want to extract data from data where sex=” male “separately:

fale = df1[df1["sex"] = ="Male"]
fale
Copy the code

We observe that the index in front of the data is still the same, but we want it to start at 0, which is more consistent with our custom:

fale_1 = fale.reset_index()
fale_1
Copy the code

The index is the one we want, but there is a new column of data, which is the data from the original index. This is not the data we want, and we need to remove it:

fale_1 = fale.reset_index(drop=True)  # add parameters to solve
fale_1
Copy the code

Property renaming

We use the rename function, passing columsn:

address_new = address_new.rename(columns={"index":"address"."address":"number"
                                        })
address_new
Copy the code

Groupby use

Groupby mainly implements the function of group statistics:

1. Let’s say we want to get the total scores of men and women

# Total score of male and female: sum

sex_score = df1.groupby("sex") ["score"].sum()
sex_score
Copy the code

2. Find the mean of men and women respectively

# Mean for men and women

sex_score = df1.groupby("sex") ["score"].mean()
sex_score
Copy the code

3, according to the sex, class score

# Get the total score according to gender and class first

sex_class = df1.groupby(["sex"."class"[])"score"].sum()
sex_class
Copy the code

One line of code implements the above function:

# One-line code implementation

df1.groupby(["sex"."class"[])"score"].sum().reset_index()
Copy the code

The apply function

The same DF1 data set as above:

1. Requirement 1: We want to change the male gender to 1 and the female gender to 0

# 1, change: -1 for men, 0 for women

df2 = df1.copy()  # Make a copy

df2["sex"] = df2["sex"].apply(lambda x: 1 if x=="Male" else 0)  # Use anonymous functions
df2
Copy the code

We can also customize a function to do this:

Custom functions

def apply_sex(x) :
    return 1 if x == "Male" else 0

df3 = df1.copy()  # Make a copy of DF3

df3["sex"] = df3["sex"].apply(apply_sex)  # With custom functions
df3
Copy the code

2, for example, we want to add a “city” after each city, become Beijing, Shenzhen, etc. :

# 2, add a word to each city: city, become Beijing, Shenzhen, etc

df4 = df1.copy()

df4["address"] = df4["address"].apply(lambda x: x + "The city")
df4
Copy the code

conclusion

In this article, the author introduces the creation of Pandas DataFrame data, the exploration of common data information, and how to obtain the data specified by us from the data box. Finally, the author introduces the methods of processing data commonly used by the author. Pandas is truly powerful, and learning it well will save us a lot of time in manipulating data.