Used by the common methods of the Python Pandas library

The Pandas library is designed for data analysis and is an important factor in making Python a powerful and efficient data analysis environment.

Pandas data structure

1, import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

S1=pd.Series([‘ a ‘, ‘b’, ‘c’]) Series is a data structure consisting of a set of data and a set of indexes (row indexes)

Series([‘ a ‘, ‘b’, ‘c’],index=(1,3,4)

S1=pd.Series({1: ‘a’,2: ‘b’,3: ‘c’}

S1.index() returns the index

S1.values() returns the value

7, Df=pd.DataFrame([‘ a ‘, ‘b’, ‘c’]) DataFrame is a data structure consisting of a group of data and two groups of indexes (row index)

8, Df = pd DataFrame ([[a, a], [b, b], [c, c]], the columns = [‘ small ‘, ‘capital’], index = [‘ a ‘, ‘2’, ‘3’])

Columms is a column index, and index is a row index

9, PIP install -i pypi.tuna.tsinghua.edu.cn/simple pyspider tsinghua mirror

10, data = {‘ lowercase: [‘ a ‘, ‘b’, ‘c’], ‘capital’ : [‘ a ‘, ‘b’, ‘c’]} into the dictionary

Df=Pd.DataFrame(data)

11, Df. Index (Df). The columns ()

Second, read data

12, df = pd read_excel (r ‘C: \ \ user \… XLSX ‘, or sheet_name = “sheet1”)

Pd. Read_excel (r ‘C: \ \ user \… XLSX ‘,sheet_name=0) read the Excel table

13, Pd. Read_excel (r ‘C: \ \ user \… XLSX, ‘index_col = 0, the header = 0)

Index_col specifies the row index, and header specifies the column index

14 and pd. Read_excel (r ‘C: \ \ user \… XLSX ‘,usecols=[0,1]) import specified columns without index_col and header

15, pd. Read_tablel (r ‘C: \ \ user \… TXT ‘,sep = ‘) import TXT file,sep specifies what delimiter is

Df.head (2) displays the first two lines. By default, the first five lines are displayed

Df.shape displays data in rows and columns, without row or column indexes

Df.info () displays the types of data in the table

19, df.describe() Distribution values (and, mean, variance, etc.) at the finger ends of numeric types in a table

Data preprocessing

20, df.info() shows which data in the table is empty

The df.isnull() method determines which value is missing, returning True if missing and False otherwise

Df.dropna () deletes rows containing missing values by default

23, df. Dropna (how= ‘all’) drop all null rows

24, df.fillna(0) fills all null values with 0

25, df. Fillna ({” gender “:” male “, “age” : “30”}) on gender hollow column value fill in male, age filled with 30

26. Df.drop_duplicates () by default, it duplicates all values. The first row of values is reserved

27. Df. drop_duplicates(subset= ‘gender’) Retains the first row of the query for duplicate values in the gender column

28, df. Drop_duplicates (subset=[‘ gender ‘, ‘company’], keep= ‘last’) checks duplicates of the gender and company columns

Keep defaults to first, can be set to last or False

29, df[‘ ID ‘]. Dtype Query the data type of the ID column

30, df[‘ ID ‘].astype(‘ float ‘) converts the data type of the ID column to float

31, Data types: int, float, object, string, Unicode, datetime

32, df[‘ ID ‘][1] second data in ID column

Df.columns =[‘ upper case ‘, ‘lower case’, ‘Chinese’] add column index to no index table

34, df.index=[1,2,3

35, df.set_index(‘ number ‘) specifies the column to be used as the row column

36, df.rename(index={‘ order number ‘:’ new order number ‘, ‘customer name’ : ‘new customer name’}) rename the row index

Df. rename(columns={1: ‘1’,2: ‘2’}) rename columns

38, df.reset_index() converts all indexes to columns by default

39, df.reset_index(level=0) convert index 0 to column

40, df.reset_index(drop=True) Drop the original index

4. Data selection

41, df[[‘ ID ‘, ‘name’]

42, df.iloc[[1,3],[2,4]] select data by row number

Df.iloc [1,1] selects row 3 and column 2 of the table. The first row is the column index by default

44, df.iloc[:,0:4] # get the values of columns 1 through 4

45, df.loc[‘ 一 ‘] #loc a Series of rows that can be accessed as a list

46, df.loc[‘ 1 ‘][0] or df.loc[‘ 1 ‘][‘ 2 ‘]

Df.iloc [1]# df.iloc[1]# df.iloc[1

48, df.iloc[[1,3]]# select * from list

49, df.iloc[1:3]# select line 2 and line 4

50, df[df[‘ age ‘]<45

51, df[(df[‘ age ‘]<45)&(df[‘ ID ‘]<4)] #

52, df. Iloc [[1, 3], [2, 4]] of df. Loc [[‘ a ‘, ‘2’], [‘ age ‘, ‘ID’]] # loc is name, iloc is numbered

53, df [df [‘ age ‘] < 45] [[‘ age ‘, ‘ID’]] # by age conditions line first, then through different index named column

54, df.iloc[1:3,2:4]# slice index

Five, numerical operation

55, df[‘ age ‘].replace(100,33)# replace 100 in the age column with 33

56, df.replace(np.nan,0)# equals fillna(), where np.nan is the default representation in Python

57, df. Replace # ((A, B), C) for A replacement, replace A, B C

58, df. Replace ({‘ A ‘:’ A ‘, ‘B’ : ‘B’, ‘C’ : ‘C’})# many-to-many replace

59, df. Sort_values (by=[‘ order number ‘],ascending=False)

60, df.sort_values(by=[‘ order id ‘],na_position= ‘first’

The default missing value is last

50, df. Sort_values (by=[‘col1’, ‘col2’], Ascending =[False,True])#

62, df[‘ sales’]. Rank (method= ‘first’)#

63, df. Drop ([‘ sales’, ‘ID’],axis=1)# drop(‘ sales’, ‘ID’

Df.drop (df.columns[[4,5]],axis=1)#

65, df. Drop (colums=[‘ sales’, ‘ID’])# drop(colums=[‘ sales’, ‘ID’]

66, df. Drop ([‘ a ‘, ‘b’],axis=0)# drop(‘ a ‘, ‘b’],axis=0

67, df.drop(df.index[[4,5]],axis=0

68, df. Drop (index=[‘ a ‘, ‘b’])# drop(index=[‘ a ‘, ‘b’]

64, df[‘ ID ‘].value_counts()# count (

70, df[‘ ID ‘].value_counts(normalize=Ture,sort=False)#

71, df[‘ ID ‘].unique()#

72, df[‘ age ‘].isin([‘ a ‘,11])# check whether this column contains a or 11

Bins =[70,bins [‘ ID ‘],bins=[70,bins

74, pd. Qcut (df[‘ ID ‘],3)#ID cut into 3 parts, the number of data in each part is as same as possible

75, df. Insert (2, ‘merchandise’,[‘ book ‘, ‘pen’, ‘calculator’])#

76, df[‘ merchandise ‘]=[‘ book ‘, ‘pen’, ‘calculator’])# insert new column at end of table

77, df.t row and column interchange

Df.tack ()# convert table data to tree data

79, df.set_index([‘ ID ‘, ‘name’]).stack().reset_index(

Convert to tree data, and reset the row index

80, df. Melt (id_vars=[‘ ID ‘, ‘name’],var_name= ‘year’,value_name= ‘sale’)#id_var

Var_name indicates that the original column index is converted to the column name corresponding to the row index. Value_name indicates the column name corresponding to the new index

81, df[‘ C1 ‘]. Apply (lambda x:x+1)#

Df.applymap (lambda x:x+1) performs the same operation on all data in the table

6. Data operation

Df[‘ ID ‘]+ df[‘ ID ‘]#

84, df[‘ ID ‘]> df[‘ ID ‘]# = equal comparison operation

Df.count ()# count the number of non-null values in each column

86, df.count(axis=1)# count the number of non-null values per line

87, df[‘ ID ‘].count()# count the number of non-null values in the specified column

88, df.sum(axis=1)# sum for each column/row

89, df.mean(axis=1)#

90, df.max(axis=1

91, df.min(axis=1)# minimize each column/row

92, df.median(axis=1)#

93, df.mode(axis=1)# maximum number of values per column/row

Df.var (axis=1

Df.std (axis=1)# standard deviation for each column/row

96, df. Quantile (0.25)# Find 1/4 quantile, can be 0.5, 0.75 equal quantile

Df.corr ()# select correlation from DataFrame

7. Time series

98, From datetime import datetime

99, datatime.now()# return the present time

100, datatime.now().year# return year.month \.day

101, datatime.now().weekday()-1

102, datatime.now().isocalendar()#

103, (2018, 41, 7) # week 41, Day 7 of 2018

104, datatime.now().date()#

105, datatime.now().time()#

106, datatime. Now (). Strftime (‘ % % m – Y H: % d % % m: ‘% S’) return to the 2020-03-13 09:09:12 #

107, From dateutil.parer import parse

Parse (str_time)# convert the time of a string to a time format

109, pd.Datetimeindex([‘ 2020-02-03 ‘,2020-03-05 ‘])#

[‘ 2018 ‘]# get data for 2018

[‘ 2018-01 ‘]# retrieve data from January 2018

112, data[‘ 2018-01-05 ‘:’ 2018-01-15 ‘]# obtain the data of this period

113. Non-time indexed table processing

114, df[df[‘ dealtime ‘]==datetime(2018,08,05)]

115, df[df[‘ transaction time ‘]>datetime(2018,08,05)]

[(df 116, df [‘ closing time] > datetime,08,05 (2018) & (df [‘ closing time] < datetime,08,15 (2018))]

117, cha = datatime,5,21,19,50 (2018) – datatime,5,18,17,50 (2018)

118, cha.days# return the time difference between days

Cha. seconds# returns the time difference in seconds

120, cha.seconds/3600# Return the time difference between hours

121, datatime(2018,5,21,19,50)+timedelta(days=1)#

Datatime (2018,5,21,19,50)+timedelta(seconds=20)#

123, datatime(2018,5,21,19,50)-timedelta(days=1)#

Pivot Tables

124, df.groupby(‘ groupby ‘).count()#

125, df.groupby(‘ groupby ‘).sum()#

126, df.groupby(‘ customer category ‘, ‘region category’).sum()#

127, df. Groupby (” customer classification “, “regional classification ‘) [‘ ID ‘] sum () # ID add after multiple columns classification

128, df[‘ ID ‘]#DataFrame

Groupby (‘ ID ‘).sum () = df.groupby(‘ ID ‘).sum ()

130, df. Groupby (‘ sum ‘, ‘count’). Aggregate ([‘ sum ‘, ‘count’]#

131, df. Groupby (‘ customer classification). The aggregate ({” ID “:” count “, “sales” : “sum”})

132, # Aggregate can perform different aggregate operations for different columns

133, df.groupby(‘ customer category ‘).sum().reset_index()# change index to standard DataFrame

134, pd. Pivot_table (data values, index, columms, aggfunc, fill_value, margins, dropna, margins_name)

Table df,values, index, columns, aggfunc, aggfunc, fill_value, aggfunc, aggfunc, aggfunc, aggfunc, aggfunc Short: Does the site have a TAB column? Margins_name: column name of the aggregate column

136, pd. Pivot_table (df, values = [‘ ID ‘and’ sales’], index ‘=’ customer classification, columms = ‘area’, aggfunc = {” ID “:” count “, “sales” : “Sum”}), fill_value = 0, margins = true, dropna = None, margins_name = ‘total’)

Nine, multi-form splicing

Pd. merge(df1,df2)# Merge the common columns in two tables automatically by default

138, pd.merge(df1,df2,on= “student number”)#on, merge(df1,df2,on= “student number”)#on

Merge (df1,df2,on=[‘ id ‘, ‘name’]#on) merge(df1,df2,on=[‘ id ‘, ‘name’]#on

140, pd.merge(df1,df2, left_ON = ‘student number’ right_on= ‘number’) # merge(df1,df2, left_ON = ‘student number’ right_on= ‘number’

141, pd.merge(df1,df2,left_index= ‘student id’ right_index= ‘id’

142, pd.merge(df1,df2,left_index= ‘student number’ right_on= ‘number’

Pd. merge(df1,df2,on= ‘id’,how= ‘inner’)

144, pd.merge(df1,df2,on= ‘id’,how= ‘left’)

145, pd.merge(df1,df2,on= ‘id’,how= ‘right’)

146, pd. Merge (df1,df2,on= ‘outer’,how= ‘outer’)

Concat ([df1,df2])# concat([df1,df2]

Concat ([df1,df2], ignore_index=True

Concat ([df1,df2], ignore_index=True). Drop_duplicates ()# delete duplicate values

10. Export files

Df.to_excel (excel_writer=r ‘c :\\users\zhoulifu\Desktop\ test.xlsx ‘)# Export file format XLSX is implemented using the to_excel method with the excel_writer parameter

151, df.to_excel(Excel_writer =r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’)

Df.to_excel (excel_writer=r ‘c :\\users\zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, index=False

153, df.to_excel(excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, Index =False,columns=[‘ ID ‘, ‘volume’, ‘name’])#

Df.to_excel (excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, Index = False, the columns = [‘ ID ‘, ‘sales’,’ name ‘], encoding = “utf-8”) # set the export of columns

Df.to_excel (excel_writer=r ‘c :\\ Users \zhoulifu\Desktop\ test.xlsx ‘,sheet_name= ‘document’, The index = False, the columns = [‘ ID ‘, ‘sales’,’ name ‘], encoding = “utf-8”, na_rep = 0) # missing value filled in

156, write =pd.ExcelWriter(excelPath,engine= ‘xlsxwirter’)#

151, df1.to_excel(writer,sheet_name= ‘table 1’)

158, df2.to_excel(writer,sheet_name= ‘table 2’)

159, writer. The save ()

Used by the common methods of the Python Pandas library

Related Posts

I embellished my project with these 18 amazing libraries, which really blinded me!

Koa Routing nanny level tutorial

CSS naming mess? Try BEM