Two-dimensional data, a Series container, has both row and column indexes

1. Create DataFrame

1.1 Creating a DataFrame from a List

Data, index, and columns need to be specified

Specify that data and index/columns are lists or Np.arange

df1 = pd.DataFrame(data=[[1.2.3], [11.12.13]], index=['r_1'.'r_2'], columns=['A'.'B'.'C'])
df2 = pd.DataFrame(data=[[1], [11]], index=['r_1'.'r_2'], columns=['A'])
df3 = pd.DataFrame(data=np.arange(12).reshape(3.4), index=list("abc"), columns=list("ABCD"))
Copy the code

A B C

r_1 1 2 3 r_2 11 12 13


​ A r_1 1 r_2 11


A B C D a 0 1 2 3 b 4 5 6 7 c 8 9 10 11

1.2 Creating a DataFrame using a dictionary

1.2.1 Method 1: Pass in a single dictionary. Note that it must be a single-key multi-value dictionary (for single-value dictionaries, [] must also be added)

dict = {"name": ["jack"."HanMeimei"]."age": ["100"."100"]}
Dict = {"name": "jack", "age": "100"}#
# dict = {" name: "[]" jack ", "age" : "100"]} # if it is a single value, must add []
df3 = pd.DataFrame(dict, index=list("ab"))
Copy the code

age age1 name

A 100.0 NaN MaYun1 B 100.0 NaN MaYun2 C NaN 100.0 MaYun3

1.2.2 Method 2: Pass in the dictionary list. Each dictionary is a row of data, and the missing columns complement nan

dict = [{"name": "MaYun1"."age": 100}, {"name": "MaYun2"."age": 100}, {"name": "MaYun3"."age1": 100}]
# dict = {"name": "jack", "age": "100"}
df4 = pd.DataFrame(dict, index=list("abc"))
Copy the code

2. DataFrame basic attributes

dict = {"name": ["jack"."HanMeimei"."Lucy"]."age": ["100"."90"."98"]."salary": [30000, 50000, 999000]}
df5 = pd.DataFrame(dict)
print(df5)
print(df5.head(1))
print(df5.tail(1))
print(df5.info())
print(df5.index)
print(df5.columns)
print(df5.values)
print(df5.describe())
Copy the code

3. All data is sorted by the specified column

df5 = df5.sort_values(by='salary', ascending=True)
print(df5)
Copy the code

4. DataFrame simple row and column slices

dict = {"name": ["jack"."HanMeimei"."Lucy"."Mr Green"."Mrs Han"."Lily"]."age": [100, 90,98,90,100,30], "salary": [30000, 50000, 999000,90000,80000,75000]} df6 = pd.dataframe (dict)print(df6)

Extract the first five lines
print(df6[0:5])
Select name from name column
print(df6["name"])
Select name from the first three rows
print(df6[0:3]["name"])
Copy the code

5. Loc row and column sections

5.1 verbose troublesome version + can not see, only see 5.2

5.1.1 comprehensive

dict = {"name": ["jack"."HanMeimei"."Lucy"."Mr Green"."Mrs Han"."Lily"]."age": [100, 90, 98, 90, 100, 30]"salary": [30000, 50000, 999000, 90000, 80000, 75000]}
df7 = pd.DataFrame(dict, index=list("abcdef"))
print(df7)

# fetch element with row tag 'A' and column tag 'name'
print(df7.loc['a'.'name'])


# fetch element with row label 'f' and column label ['name','age'
print(df7.loc['f'['name'.'age']])


['c','f']; ['name','age']
print(df7.loc[['c'.'f'], ['name'.'age']])


Select element with row label (slice 'a':'e') and column label ['name','age']
# Note section closure
print(df7.loc['a':'e'['name'.'age']])

Extract elements with row labels (slice 'a':'e') and column labels ['name','age']
print(df7.loc['a':'e'.'age':'salary'])
Copy the code

5.1.2 Fetching Single Row – Fetching all data from ‘a’ row

Either of the following ways can be used
df7.loc['a',:]
df7.loc['c']
Copy the code

name Lucy age 98 salary 999000

5.1.3 Fetch discontinuous Multiple Rows – Fetch all data with row labels’ A ‘, ‘C’

df7.loc[['a'.'c']]# note nesting []
Copy the code

5.1.4 Slice removal of consecutive multiple lines

df7['a':'c']
Copy the code

5.1.5 Fetch single Column – Fetch all data from column labeled ‘name’

Either of the following ways can be used
print(df7.loc[:,'name'])
print(df7['name'])
Copy the code

a jack b HanMeimei c Lucy d Mr Green e Mrs Han f Lily

5.1.6 Fetching Discontinuous Multiple Columns – Fetching all data with row labels ‘name’, ‘age’

df7.loc[:,['name'.'age']]
df7[['name'.'age']]
Copy the code

Remember the following one, too much of it will cause trouble

The basic format is:

Df7.loc [row, column]Copy the code

If you take consecutive rows or columns — use slicing:

If discontiguous rows or columns are fetched – use a list []

Slices and lists can be mixed

For the column:

5.5.1 Continuous multiple rows and columns
df7.loc['a':'c'.'name':'age'> Name age a Jack 100 b HanMeimei 90 C Lucy 98Copy the code
5.5.2 Discontinuous multiple rows + Continuous multiple columns
df7.loc[['a'.'c'].'name':'salary'Age > Name age salary a Jack 100 30000 C Lucy 98 999000Copy the code
5.5.3 Discontinuous multiple rows + discontinuous multiple columns
df7.loc[['a'.'c'], ['name'.'salary'Name salary a jack 30000 C Lucy 999000 name salary a Jack 30000 C Lucy 999000Copy the code
5.5.4 All Rows + Discontinuous Multiple Columns (same for all columns)
df7.loc[:,['name'.'salary'[] note: Just write an empty slice of the line:  > name salary a jack 30000 b HanMeimei 50000 c Lucy 999000 d Mr Green 90000 e Mrs Han 80000 f Lily 75000Copy the code
5.5.5 Discontinuous Multiple Rows + Single Column (same for Single Row)
df7.loc[['a'.'c'].'name'] Series > a jack c Lucy Name: Name, dtype: object <class'pandas.core.series.Series'>
Copy the code
df7.loc[['a'.'c'], ['name']]
type(df7.loc[['a'.'c'], ['name'[]) DataFrame > name a jack c Lucy <class'pandas.core.frame.DataFrame'>
Copy the code

6. Iloc row and column sections

It’s just evaluated by position, just like loC

Just note that, unlike loC, slices do not contain the last number

Iloc [1:3,0:1] > name b HanMeimei d Mr Green df7.iloc[1:3,0:1] > age name b HanMeimei c LucyCopy the code

7. Assign to change data

You can use LOC or ILOC

Df7. Iloc [1:3, 1:3] = 99999999print(df7)
>       name       age    salary
a       jack       100     30000
b  HanMeimei  99999999  99999999
c       Lucy  99999999  99999999
d   Mr Green        90     90000
e    Mrs Han       100     80000
f       Lily        30     75000
Copy the code

Boolean index

Let’s look at an example

Create a Dataframe

Score = {"Name": ["Zhang Wuji"."Zhao"."Joe"."Big Joe"."Yang Yuhuan"."The sable cicada"."Beauty"."The prince"."Ginger tooth"."Li bai"."Du fu"."Wang wei"."Li Xiaoyu"]."Chinese": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100,100]"Mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85,100]"English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95,100]}
df_score = pd.DataFrame(Score)
print(df_score)
Copy the code

8.1 Take out the data of all people with an English score greater than 90

# I get a Series
loc_ = df_score.loc[:,"English"] > 90
print(loc_)
print(type(loc_))# <class 'pandas.core.series.Series'>
# dataframe Boolean index that filters out all rows with a value of true
print(df_score[loc_])

# can also be shortened to
print(df_score[df_score.loc[:,"English"] > 90)Copy the code

8.2 Take out the data of all people with English score less than 90 (~)

Note: add ~ take the opposite

print(df_score[~(df_score.loc[:, "English"] > 90)])
Copy the code

8.3 The data of all students with an English score greater than 90 and a Chinese score greater than 80 were extracted

print(df_score[(df_score.loc[:, "English"] > 90)&(df_score.loc[:, "Chinese"] < 80)])
Copy the code

8.4 Screenshot of Knowledge points

9. String methods

Create a dataframe
student = {"Name": ["Zhang Wuji"."Zhao"."Joe"."Big Joe"."Yang Yuhuan"."The sable cicada"."Beauty"."The prince"."Ginger tooth"."Li bai"."Du fu"."Wang wei"."Li Xiaoyu"]."Chinese": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100, 100]"Mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85, 100]"English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95, 100]"Class": ["Class 3, Grade 1"."Class 1, Grade 1"."Class 3, Grade 2"."Class 1, Grade 2"."Class 13, Grade one"."Class 7, Grade 3"."Class 3, Grade 5"."Class 3, Grade 4"."Class 5, Grade 1"."Class 7, Grade one"."Class 4, Grade 1"."Class 9, Grade one"."Class 10, Grade one"],
           }
df_student = pd.DataFrame(student)
print(df_student)
Copy the code

9.1 LEN — Select data with column element string length greater than 5

print(df_student[df_student["Class"].str.len() > 5])
Copy the code

9.2 REPLACE — Change the “grade” element in the “class” column to “school grade one”

Note that a Series is returned on the right side of the equal sign, which is assigned to the column corresponding to the original DataFrame
df_student["Class"] = df_student["Class"].str.replace("First grade"."First grade in school")
print(df_student)
# Here is the LOC usage for fetching columns
df_student.loc[:,"Class"] = df_student.loc[:,"Class"].str.replace("First grade"."First grade in school")
Copy the code

The [class] column contains data for “school” and “1”

print(df_student[
          (df_student["Class"].str.contains("School"))
          &
          (df_student["Class"].str.contains("1")))Copy the code

9.4 split – Split character string

9.5 GET — Prints the first character of a student’s name (last name)

print((df_student["Name"].str.get(0)))
Copy the code

9.6 the match – regular expression match, find out the name contains’ | wang li ‘data

reg = '| wang li'
print(df_student[df_student["Name"].str.match(reg)])
Copy the code

9.7 pad – Padding character *

# Note that width=10 indicates that the current character + the * to be filled, together with a width of 10
# add * to both sides of the string, resulting in a string length of 10, not enough to use *.
df_student["Name"] = df_student["Name"].str.pad(width=10, side='both', fillchar=The '*')
# add - to both sides of the string, resulting in a string length of 20, not enough to add -
df_student["Name"] = df_student["Name"].str.pad(width=20, side='right', fillchar=The '-')
print(df_student)
Copy the code

9.5 Knowledge Screenshot

10. Add a column of statistical total apply method

10.1 Direct Addition

df_student["Total"] = df_student["Chinese"] + df_student["Mathematics"] + df_student["English"]
Copy the code

10.2 Traversal using the Apply method of Series (apply passes a function, more powerful)

df_student['total'] = pd.Series(df_student.index.tolist()).apply(
    lambda i: df_student.loc[i, "Chinese"] + df_student.loc[i, "Mathematics"] + df_student.loc[i, "English"])
    
# 1. To use the Series apply method, generate a Series from the Index of the DataFrame.
pd.Series(df_student.index.tolist())
# 2. This is followed by a lambda expression, which can also be passed by defining functions (writing functions does a lot of processing), as shown in the following example
Copy the code
# Let the person whose language is greater than 90 add 1000 points to his Chinese score, and then calculate the total score
def sum1(i):
    if df_student.loc[i, "Chinese"] > 90:
        df_student.loc[i, "Chinese"] = df_student.loc[i, "Chinese"] + 1000
    return df_student.loc[i, "Chinese"] + df_student.loc[i, "Mathematics"] + df_student.loc[i, "English"]


df_student['total'] = pd.Series(df_student.index.tolist()).apply(
    lambda i: sum1(i))
Copy the code

11. Missing data processing

# use numpy to generate a set of random integers (5 rows and 7 columns between 0 and 100)
rand = np.random.randint(0, 100, (5, 7))
Generate DataFrame using numpy uploaded data
df = pd.DataFrame(rand, columns=list("ABCDEFG"))
Define some Nans
df.loc[0:3, "A":"B"] = np.nan
print(df)
Copy the code

11.1 Determining whether NaN is used

11.1.1 Determining whether the entire DF is Nan

# is null
print(pd.isnull(df)) The result is DataFrameCopy the code

# not null
print(pd.notnull(df))
Copy the code

11.1.2 Determining whether the column specified by DF is Nan

# print NUll data in column A
print(df[pd.isnull(df["A"]])Copy the code

# print not NUll data in column A
print(df[pd.notnull(df["A"]])Copy the code

11.2 Deleting Nan Data in df

# how is not entered. Default is any
If only one of them is NaN, the line will be deleted
print(df.dropna(axis=0))
Copy the code

This line will be deleted only if all NaN are used
print(df.dropna(axis=0,how="all"))
Copy the code