Two-dimensional data, a Series container, has both row and column indexes
1. Create DataFrame
1.1 Creating a DataFrame from a List
Data, index, and columns need to be specified
Specify that data and index/columns are lists or Np.arange
df1 = pd.DataFrame(data=[[1.2.3], [11.12.13]], index=['r_1'.'r_2'], columns=['A'.'B'.'C'])
df2 = pd.DataFrame(data=[[1], [11]], index=['r_1'.'r_2'], columns=['A'])
df3 = pd.DataFrame(data=np.arange(12).reshape(3.4), index=list("abc"), columns=list("ABCD"))
Copy the code
A B C
r_1 1 2 3 r_2 11 12 13
A r_1 1 r_2 11
A B C D a 0 1 2 3 b 4 5 6 7 c 8 9 10 11
1.2 Creating a DataFrame using a dictionary
1.2.1 Method 1: Pass in a single dictionary. Note that it must be a single-key multi-value dictionary (for single-value dictionaries, [] must also be added)
dict = {"name": ["jack"."HanMeimei"]."age": ["100"."100"]}
Dict = {"name": "jack", "age": "100"}#
# dict = {" name: "[]" jack ", "age" : "100"]} # if it is a single value, must add []
df3 = pd.DataFrame(dict, index=list("ab"))
Copy the code
age age1 name
A 100.0 NaN MaYun1 B 100.0 NaN MaYun2 C NaN 100.0 MaYun3
1.2.2 Method 2: Pass in the dictionary list. Each dictionary is a row of data, and the missing columns complement nan
dict = [{"name": "MaYun1"."age": 100}, {"name": "MaYun2"."age": 100}, {"name": "MaYun3"."age1": 100}]
# dict = {"name": "jack", "age": "100"}
df4 = pd.DataFrame(dict, index=list("abc"))
Copy the code
2. DataFrame basic attributes
dict = {"name": ["jack"."HanMeimei"."Lucy"]."age": ["100"."90"."98"]."salary": [30000, 50000, 999000]}
df5 = pd.DataFrame(dict)
print(df5)
print(df5.head(1))
print(df5.tail(1))
print(df5.info())
print(df5.index)
print(df5.columns)
print(df5.values)
print(df5.describe())
Copy the code
3. All data is sorted by the specified column
df5 = df5.sort_values(by='salary', ascending=True)
print(df5)
Copy the code
4. DataFrame simple row and column slices
dict = {"name": ["jack"."HanMeimei"."Lucy"."Mr Green"."Mrs Han"."Lily"]."age": [100, 90,98,90,100,30], "salary": [30000, 50000, 999000,90000,80000,75000]} df6 = pd.dataframe (dict)print(df6)
Extract the first five lines
print(df6[0:5])
Select name from name column
print(df6["name"])
Select name from the first three rows
print(df6[0:3]["name"])
Copy the code
5. Loc row and column sections
5.1 verbose troublesome version + can not see, only see 5.2
5.1.1 comprehensive
dict = {"name": ["jack"."HanMeimei"."Lucy"."Mr Green"."Mrs Han"."Lily"]."age": [100, 90, 98, 90, 100, 30]"salary": [30000, 50000, 999000, 90000, 80000, 75000]}
df7 = pd.DataFrame(dict, index=list("abcdef"))
print(df7)
# fetch element with row tag 'A' and column tag 'name'
print(df7.loc['a'.'name'])
# fetch element with row label 'f' and column label ['name','age'
print(df7.loc['f'['name'.'age']])
['c','f']; ['name','age']
print(df7.loc[['c'.'f'], ['name'.'age']])
Select element with row label (slice 'a':'e') and column label ['name','age']
# Note section closure
print(df7.loc['a':'e'['name'.'age']])
Extract elements with row labels (slice 'a':'e') and column labels ['name','age']
print(df7.loc['a':'e'.'age':'salary'])
Copy the code
5.1.2 Fetching Single Row – Fetching all data from ‘a’ row
Either of the following ways can be used
df7.loc['a',:]
df7.loc['c']
Copy the code
name Lucy age 98 salary 999000
5.1.3 Fetch discontinuous Multiple Rows – Fetch all data with row labels’ A ‘, ‘C’
df7.loc[['a'.'c']]# note nesting []
Copy the code
5.1.4 Slice removal of consecutive multiple lines
df7['a':'c']
Copy the code
5.1.5 Fetch single Column – Fetch all data from column labeled ‘name’
Either of the following ways can be used
print(df7.loc[:,'name'])
print(df7['name'])
Copy the code
a jack b HanMeimei c Lucy d Mr Green e Mrs Han f Lily
5.1.6 Fetching Discontinuous Multiple Columns – Fetching all data with row labels ‘name’, ‘age’
df7.loc[:,['name'.'age']]
df7[['name'.'age']]
Copy the code
Remember the following one, too much of it will cause trouble
The basic format is:
Df7.loc [row, column]Copy the code
If you take consecutive rows or columns — use slicing:
If discontiguous rows or columns are fetched – use a list []
Slices and lists can be mixed
For the column:
5.5.1 Continuous multiple rows and columns
df7.loc['a':'c'.'name':'age'> Name age a Jack 100 b HanMeimei 90 C Lucy 98Copy the code
5.5.2 Discontinuous multiple rows + Continuous multiple columns
df7.loc[['a'.'c'].'name':'salary'Age > Name age salary a Jack 100 30000 C Lucy 98 999000Copy the code
5.5.3 Discontinuous multiple rows + discontinuous multiple columns
df7.loc[['a'.'c'], ['name'.'salary'Name salary a jack 30000 C Lucy 999000 name salary a Jack 30000 C Lucy 999000Copy the code
5.5.4 All Rows + Discontinuous Multiple Columns (same for all columns)
df7.loc[:,['name'.'salary'[] note: Just write an empty slice of the line: > name salary a jack 30000 b HanMeimei 50000 c Lucy 999000 d Mr Green 90000 e Mrs Han 80000 f Lily 75000Copy the code
5.5.5 Discontinuous Multiple Rows + Single Column (same for Single Row)
df7.loc[['a'.'c'].'name'] Series > a jack c Lucy Name: Name, dtype: object <class'pandas.core.series.Series'>
Copy the code
df7.loc[['a'.'c'], ['name']]
type(df7.loc[['a'.'c'], ['name'[]) DataFrame > name a jack c Lucy <class'pandas.core.frame.DataFrame'>
Copy the code
6. Iloc row and column sections
It’s just evaluated by position, just like loC
Just note that, unlike loC, slices do not contain the last number
Iloc [1:3,0:1] > name b HanMeimei d Mr Green df7.iloc[1:3,0:1] > age name b HanMeimei c LucyCopy the code
7. Assign to change data
You can use LOC or ILOC
Df7. Iloc [1:3, 1:3] = 99999999print(df7)
> name age salary
a jack 100 30000
b HanMeimei 99999999 99999999
c Lucy 99999999 99999999
d Mr Green 90 90000
e Mrs Han 100 80000
f Lily 30 75000
Copy the code
Boolean index
Let’s look at an example
Create a Dataframe
Score = {"Name": ["Zhang Wuji"."Zhao"."Joe"."Big Joe"."Yang Yuhuan"."The sable cicada"."Beauty"."The prince"."Ginger tooth"."Li bai"."Du fu"."Wang wei"."Li Xiaoyu"]."Chinese": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100,100]"Mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85,100]"English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95,100]}
df_score = pd.DataFrame(Score)
print(df_score)
Copy the code
8.1 Take out the data of all people with an English score greater than 90
# I get a Series
loc_ = df_score.loc[:,"English"] > 90
print(loc_)
print(type(loc_))# <class 'pandas.core.series.Series'>
# dataframe Boolean index that filters out all rows with a value of true
print(df_score[loc_])
# can also be shortened to
print(df_score[df_score.loc[:,"English"] > 90)Copy the code
8.2 Take out the data of all people with English score less than 90 (~)
Note: add ~ take the opposite
print(df_score[~(df_score.loc[:, "English"] > 90)])
Copy the code
8.3 The data of all students with an English score greater than 90 and a Chinese score greater than 80 were extracted
print(df_score[(df_score.loc[:, "English"] > 90)&(df_score.loc[:, "Chinese"] < 80)])
Copy the code
8.4 Screenshot of Knowledge points
9. String methods
Create a dataframe
student = {"Name": ["Zhang Wuji"."Zhao"."Joe"."Big Joe"."Yang Yuhuan"."The sable cicada"."Beauty"."The prince"."Ginger tooth"."Li bai"."Du fu"."Wang wei"."Li Xiaoyu"]."Chinese": [78, 90, 87, 88, 56, 94, 92, 85, 93, 91, 59, 100, 100]"Mathematics": [91, 59, 100, 75, 30, 95, 91, 59, 100, 10, 95, 85, 100]"English": [91, 59, 100, 75, 30, 95, 10, 95, 85, 75, 30, 95, 100]"Class": ["Class 3, Grade 1"."Class 1, Grade 1"."Class 3, Grade 2"."Class 1, Grade 2"."Class 13, Grade one"."Class 7, Grade 3"."Class 3, Grade 5"."Class 3, Grade 4"."Class 5, Grade 1"."Class 7, Grade one"."Class 4, Grade 1"."Class 9, Grade one"."Class 10, Grade one"],
}
df_student = pd.DataFrame(student)
print(df_student)
Copy the code
9.1 LEN — Select data with column element string length greater than 5
print(df_student[df_student["Class"].str.len() > 5])
Copy the code
9.2 REPLACE — Change the “grade” element in the “class” column to “school grade one”
Note that a Series is returned on the right side of the equal sign, which is assigned to the column corresponding to the original DataFrame
df_student["Class"] = df_student["Class"].str.replace("First grade"."First grade in school")
print(df_student)
# Here is the LOC usage for fetching columns
df_student.loc[:,"Class"] = df_student.loc[:,"Class"].str.replace("First grade"."First grade in school")
Copy the code
The [class] column contains data for “school” and “1”
print(df_student[
(df_student["Class"].str.contains("School"))
&
(df_student["Class"].str.contains("1")))Copy the code
9.4 split – Split character string
9.5 GET — Prints the first character of a student’s name (last name)
print((df_student["Name"].str.get(0)))
Copy the code
9.6 the match – regular expression match, find out the name contains’ | wang li ‘data
reg = '| wang li'
print(df_student[df_student["Name"].str.match(reg)])
Copy the code
9.7 pad – Padding character *
# Note that width=10 indicates that the current character + the * to be filled, together with a width of 10
# add * to both sides of the string, resulting in a string length of 10, not enough to use *.
df_student["Name"] = df_student["Name"].str.pad(width=10, side='both', fillchar=The '*')
# add - to both sides of the string, resulting in a string length of 20, not enough to add -
df_student["Name"] = df_student["Name"].str.pad(width=20, side='right', fillchar=The '-')
print(df_student)
Copy the code
9.5 Knowledge Screenshot
10. Add a column of statistical total apply method
10.1 Direct Addition
df_student["Total"] = df_student["Chinese"] + df_student["Mathematics"] + df_student["English"]
Copy the code
10.2 Traversal using the Apply method of Series (apply passes a function, more powerful)
df_student['total'] = pd.Series(df_student.index.tolist()).apply(
lambda i: df_student.loc[i, "Chinese"] + df_student.loc[i, "Mathematics"] + df_student.loc[i, "English"])
# 1. To use the Series apply method, generate a Series from the Index of the DataFrame.
pd.Series(df_student.index.tolist())
# 2. This is followed by a lambda expression, which can also be passed by defining functions (writing functions does a lot of processing), as shown in the following example
Copy the code
# Let the person whose language is greater than 90 add 1000 points to his Chinese score, and then calculate the total score
def sum1(i):
if df_student.loc[i, "Chinese"] > 90:
df_student.loc[i, "Chinese"] = df_student.loc[i, "Chinese"] + 1000
return df_student.loc[i, "Chinese"] + df_student.loc[i, "Mathematics"] + df_student.loc[i, "English"]
df_student['total'] = pd.Series(df_student.index.tolist()).apply(
lambda i: sum1(i))
Copy the code
11. Missing data processing
# use numpy to generate a set of random integers (5 rows and 7 columns between 0 and 100)
rand = np.random.randint(0, 100, (5, 7))
Generate DataFrame using numpy uploaded data
df = pd.DataFrame(rand, columns=list("ABCDEFG"))
Define some Nans
df.loc[0:3, "A":"B"] = np.nan
print(df)
Copy the code
11.1 Determining whether NaN is used
11.1.1 Determining whether the entire DF is Nan
# is null
print(pd.isnull(df)) The result is DataFrameCopy the code
# not null
print(pd.notnull(df))
Copy the code
11.1.2 Determining whether the column specified by DF is Nan
# print NUll data in column A
print(df[pd.isnull(df["A"]])Copy the code
# print not NUll data in column A
print(df[pd.notnull(df["A"]])Copy the code
11.2 Deleting Nan Data in df
# how is not entered. Default is any
If only one of them is NaN, the line will be deleted
print(df.dropna(axis=0))
Copy the code
This line will be deleted only if all NaN are used
print(df.dropna(axis=0,how="all"))
Copy the code