The Pandas library is designed for data analysis and is an important factor in making Python a powerful and efficient data analysis environment.
Pandas data structure
1, import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
S1=pd.Series([‘ a ‘, ‘b’, ‘c’]) Series is a data structure consisting of a set of data and a set of indexes (row indexes)
Series([‘ a ‘, ‘b’, ‘c’],index=(1,3,4)
S1=pd.Series({1: ‘a’,2: ‘b’,3: ‘c’}
S1.index() returns the index
S1.values() returns the value
7, Df=pd.DataFrame([‘ a ‘, ‘b’, ‘c’]) DataFrame is a data structure consisting of a group of data and two groups of indexes (row index)
8, Df = pd DataFrame ([[a, a], [b, b], [c, c]], the columns = [‘ small ‘, ‘capital’], index = [‘ a ‘, ‘2’, ‘3’])
Columms is a column index, and index is a row index
9, PIP install -i pypi.tuna.tsinghua.edu.cn/simple pyspider tsinghua mirror
10, data = {‘ lowercase: [‘ a ‘, ‘b’, ‘c’], ‘capital’ : [‘ a ‘, ‘b’, ‘c’]} into the dictionary
Df=Pd.DataFrame(data)
11, Df. Index (Df). The columns ()
Second, read data
12, df = pd read_excel (r ‘C: \ \ user \… XLSX ‘, or sheet_name = “sheet1”)
Pd. Read_excel (r ‘C: \ \ user \… XLSX ‘,sheet_name=0) read the Excel table
13, Pd. Read_excel (r ‘C: \ \ user \… XLSX, ‘index_col = 0, the header = 0)
Index_col specifies the row index, and header specifies the column index
14 and pd. Read_excel (r ‘C: \ \ user \… XLSX ‘,usecols=[0,1]) import specified columns without index_col and header
15, pd. Read_tablel (r ‘C: \ \ user \… TXT ‘,sep = ‘) import TXT file,sep specifies what delimiter is
Df.head (2) displays the first two lines. By default, the first five lines are displayed
Df.shape displays data in rows and columns, without row or column indexes
Df.info () displays the types of data in the table
19, df.describe() Distribution values (and, mean, variance, etc.) at the finger ends of numeric types in a table
Data preprocessing
20, df.info() shows which data in the table is empty
The df.isnull() method determines which value is missing, returning True if missing and False otherwise
Df.dropna () deletes rows containing missing values by default
23, df. Dropna (how= ‘all’) drop all null rows
24, df.fillna(0) fills all null values with 0
25, df. Fillna ({” gender “:” male “, “age” : “30”}) on gender hollow column value fill in male, age filled with 30
26. Df.drop_duplicates () by default, it duplicates all values. The first row of values is reserved
27. Df. drop_duplicates(subset= ‘gender’) Retains the first row of the query for duplicate values in the gender column
28, df. Drop_duplicates (subset=[‘ gender ‘, ‘company’], keep= ‘last’) checks duplicates of the gender and company columns
Keep defaults to first, can be set to last or False
29, df[‘ ID ‘]. Dtype Query the data type of the ID column
30, df[‘ ID ‘].astype(‘ float ‘) converts the data type of the ID column to float
31, Data types: int, float, object, string, Unicode, datetime
32, df[‘ ID ‘][1] second data in ID column
Df.columns =[‘ upper case ‘, ‘lower case’, ‘Chinese’] add column index to no index table
34, df.index=[1,2,3
35, df.set_index(‘ number ‘) specifies the column to be used as the row column
36, df.rename(index={‘ order number ‘:’ new order number ‘, ‘customer name’ : ‘new customer name’}) rename the row index
Df. rename(columns={1: ‘1’,2: ‘2’}) rename columns
38, df.reset_index() converts all indexes to columns by default
39, df.reset_index(level=0) convert index 0 to column
40, df.reset_index(drop=True) Drop the original index
4. Data selection
41, df[[‘ ID ‘, ‘name’]
42, df.iloc[[1,3],[2,4]] select data by row number
Df.iloc [1,1] selects row 3 and column 2 of the table. The first row is the column index by default
44, df.iloc[:,0:4] # get the values of columns 1 through 4
45, df.loc[‘ 一 ‘] #loc a Series of rows that can be accessed as a list
46, df.loc[‘ 1 ‘][0] or df.loc[‘ 1 ‘][‘ 2 ‘]
Df.iloc [1]# df.iloc[1]# df.iloc[1
48, df.iloc[[1,3]]# select * from list
49, df.iloc[1:3]# select line 2 and line 4
50, df[df[‘ age ‘]<45
51, df[(df[‘ age ‘]<45)&(df[‘ ID ‘]<4)] #
52, df. Iloc [[1, 3], [2, 4]] of df. Loc [[‘ a ‘, ‘2’], [‘ age ‘, ‘ID’]] # loc is name, iloc is numbered
53, df [df [‘ age ‘] < 45] [[‘ age ‘, ‘ID’]] # by age conditions line first, then through different index named column
54, df.iloc[1:3,2:4]# slice index
Five, numerical operation
55, df[‘ age ‘].replace(100,33)# replace 100 in the age column with 33
56, df.replace(np.nan,0)# equals fillna(), where np.nan is the default representation in Python
57, df. Replace # ((A, B), C) for A replacement, replace A, B C
58, df. Replace ({‘ A ‘:’ A ‘, ‘B’ : ‘B’, ‘C’ : ‘C’})# many-to-many replace
59, df. Sort_values (by=[‘ order number ‘],ascending=False)
60, df.sort_values(by=[‘ order id ‘],na_position= ‘first’
The default missing value is last
50, df. Sort_values (by=[‘col1’, ‘col2’], Ascending =[False,True])#
62, df[‘ sales’]. Rank (method= ‘first’)#
63, df. Drop ([‘ sales’, ‘ID’],axis=1)# drop(‘ sales’, ‘ID’
Df.drop (df.columns[[4,5]],axis=1)#
65, df. Drop (colums=[‘ sales’, ‘ID’])# drop(colums=[‘ sales’, ‘ID’]
66, df. Drop ([‘ a ‘, ‘b’],axis=0)# drop(‘ a ‘, ‘b’],axis=0
67, df.drop(df.index[[4,5]],axis=0
68, df. Drop (index=[‘ a ‘, ‘b’])# drop(index=[‘ a ‘, ‘b’]
64, df[‘ ID ‘].value_counts()# count (
70, df[‘ ID ‘].value_counts(normalize=Ture,sort=False)#
71, df[‘ ID ‘].unique()#
72, df[‘ age ‘].isin([‘ a ‘,11])# check whether this column contains a or 11
Bins =[70,bins [‘ ID ‘],bins=[70,bins
74, pd. Qcut (df[‘ ID ‘],3)#ID cut into 3 parts, the number of data in each part is as same as possible
75, df. Insert (2, ‘merchandise’,[‘ book ‘, ‘pen’, ‘calculator’])#
76, df[‘ merchandise ‘]=[‘ book ‘, ‘pen’, ‘calculator’])# insert new column at end of table
77, df.t row and column interchange
Df.tack ()# convert table data to tree data
79, df.set_index([‘ ID ‘, ‘name’]).stack().reset_index(
Convert to tree data, and reset the row index
80, df. Melt (id_vars=[‘ ID ‘, ‘name’],var_name= ‘year’,value_name= ‘sale’)#id_var
Var_name indicates that the original column index is converted to the column name corresponding to the row index. Value_name indicates the column name corresponding to the new index
81, df[‘ C1 ‘]. Apply (lambda x:x+1)#
Df.applymap (lambda x:x+1) performs the same operation on all data in the table
6. Data operation
Df[‘ ID ‘]+ df[‘ ID ‘]#
84, df[‘ ID ‘]> df[‘ ID ‘]# = equal comparison operation
Df.count ()# count the number of non-null values in each column
86, df.count(axis=1)# count the number of non-null values per line
87, df[‘ ID ‘].count()# count the number of non-null values in the specified column
88, df.sum(axis=1)# sum for each column/row
89, df.mean(axis=1)#
90, df.max(axis=1
91, df.min(axis=1)# minimize each column/row
92, df.median(axis=1)#
93, df.mode(axis=1)# maximum number of values per column/row
Df.var (axis=1
Df.std (axis=1)# standard deviation for each column/row
96, df. Quantile (0.25)# Find 1/4 quantile, can be 0.5, 0.75 equal quantile
Df.corr ()# select correlation from DataFrame
7. Time series
98, From datetime import datetime
99, datatime.now()# return the present time
100, datatime.now().year# return year.month \.day
101, datatime.now().weekday()-1
102, datatime.now().isocalendar()#
103, (2018, 41, 7) # week 41, Day 7 of 2018
104, datatime.now().date()#
105, datatime.now().time()#
106, datatime. Now (). Strftime (‘ % % m – Y H: % d % % m: ‘% S’) return to the 2020-03-13 09:09:12 #
107, From dateutil.parer import parse
Parse (str_time)# convert the time of a string to a time format
109, pd.Datetimeindex([‘ 2020-02-03 ‘,2020-03-05 ‘])#
[‘ 2018 ‘]# get data for 2018
[‘ 2018-01 ‘]# retrieve data from January 2018
112, data[‘ 2018-01-05 ‘:’ 2018-01-15 ‘]# obtain the data of this period
113. Non-time indexed table processing
114, df[df[‘ dealtime ‘]==datetime(2018,08,05)]
115, df[df[‘ transaction time ‘]>datetime(2018,08,05)]
[(df 116, df [‘ closing time] > datetime,08,05 (2018) & (df [‘ closing time] < datetime,08,15 (2018))]
117, cha = datatime,5,21,19,50 (2018) – datatime,5,18,17,50 (2018)
118, cha.days# return the time difference between days
Cha. seconds# returns the time difference in seconds
120, cha.seconds/3600# Return the time difference between hours
121, datatime(2018,5,21,19,50)+timedelta(days=1)#
Datatime (2018,5,21,19,50)+timedelta(seconds=20)#
123, datatime(2018,5,21,19,50)-timedelta(days=1)#
Pivot Tables
124, df.groupby(‘ groupby ‘).count()#
125, df.groupby(‘ groupby ‘).sum()#
126, df.groupby(‘ customer category ‘, ‘region category’).sum()#
127, df. Groupby (” customer classification “, “regional classification ‘) [‘ ID ‘] sum () # ID add after multiple columns classification
128, df[‘ ID ‘]#DataFrame
Groupby (‘ ID ‘).sum () = df.groupby(‘ ID ‘).sum ()
130, df. Groupby (‘ sum ‘, ‘count’). Aggregate ([‘ sum ‘, ‘count’]#
131, df. Groupby (‘ customer classification). The aggregate ({” ID “:” count “, “sales” : “sum”})
132, # Aggregate can perform different aggregate operations for different columns
133, df.groupby(‘ customer category ‘).sum().reset_index()# change index to standard DataFrame
134, pd. Pivot_table (data values, index, columms, aggfunc, fill_value, margins, dropna, margins_name)
Table df,values, index, columns, aggfunc, aggfunc, fill_value, aggfunc, aggfunc, aggfunc, aggfunc, aggfunc Short: Does the site have a TAB column? Margins_name: column name of the aggregate column
136, pd. Pivot_table (df, values = [‘ ID ‘and’ sales’], index ‘=’ customer classification, columms = ‘area’, aggfunc = {” ID “:” count “, “sales” : “Sum”}), fill_value = 0, margins = true, dropna = None, margins_name = ‘total’)
Nine, multi-form splicing
Pd. merge(df1,df2)# Merge the common columns in two tables automatically by default
138, pd.merge(df1,df2,on= “student number”)#on, merge(df1,df2,on= “student number”)#on
Merge (df1,df2,on=[‘ id ‘, ‘name’]#on) merge(df1,df2,on=[‘ id ‘, ‘name’]#on
140, pd.merge(df1,df2, left_ON = ‘student number’ right_on= ‘number’) # merge(df1,df2, left_ON = ‘student number’ right_on= ‘number’
141, pd.merge(df1,df2,left_index= ‘student id’ right_index= ‘id’
142, pd.merge(df1,df2,left_index= ‘student number’ right_on= ‘number’
Pd. merge(df1,df2,on= ‘id’,how= ‘inner’)
144, pd.merge(df1,df2,on= ‘id’,how= ‘left’)
145, pd.merge(df1,df2,on= ‘id’,how= ‘right’)
146, pd. Merge (df1,df2,on= ‘outer’,how= ‘outer’)
Concat ([df1,df2])# concat([df1,df2]
Concat ([df1,df2], ignore_index=True
Concat ([df1,df2], ignore_index=True). Drop_duplicates ()# delete duplicate values
10. Export files
Df.to_excel (excel_writer=r ‘c :\\users\zhoulifu\Desktop\ test.xlsx ‘)# Export file format XLSX is implemented using the to_excel method with the excel_writer parameter
151, df.to_excel(Excel_writer =r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’)
Df.to_excel (excel_writer=r ‘c :\\users\zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, index=False
153, df.to_excel(excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, Index =False,columns=[‘ ID ‘, ‘volume’, ‘name’])#
Df.to_excel (excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, Index = False, the columns = [‘ ID ‘, ‘sales’,’ name ‘], encoding = “utf-8”) # set the export of columns
Df.to_excel (excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, The index = False, the columns = [‘ ID ‘, ‘sales’,’ name ‘], encoding = “utf-8”, na_rep = 0) # missing value filled in
156, write =pd.ExcelWriter(excelPath,engine= ‘xlsxwirter’)#
151, df1.to_excel(writer,sheet_name= ‘table 1’)
158, df2.to_excel(writer,sheet_name= ‘table 2’)
159, writer. The save ()