2.1.1 concat

  • Pandas function, pd. Concat ()

  • You can join vertically (default, Axis =0 or axis=’index’) or horizontally (axis=1 or axis=’columns’)

  • Can connect multiple objects (multiple DFS)

  • Vertical connection :N DFS stacked from top to bottom

    • Default outer join (JOIN =’outer’), data with the same column name is merged into one column, and data with different column names is populated with NaN
    • Inner join (join=’inner’) preserves only the common part of the data
  • Horizontal connection :N DFS side by side from left to right

    • Default outer join (JOIN =’outer’), matching each self-index, missing value NaN
    • Inner join (JOIN =’inner’), which keeps only the shared data matched by the index
  • The API is introduced

pd.concat([df1,df2,df3...] ,axis=,join=' ',ignore_index=) # df is passed between list # default vertical join,axis=0or'index'; Horizontal connection axis=1or'columns'# ignore_index Defaults to False, does not reset index, keeps respective df index; Equals True, resets the index0.1.2.3.4..Copy the code

2.1.2 append

  • Longitudinal connection only
  • Appends an object to an existing df (can be df or dictionary, appends dictionary ignore_index=True)
  • API usage
The ignore_index parameter defaults to False and keeps the respective DF index; Equal to True resets the index0.1.2.3.4.. Df1.append (df2,ignore_index=) # append dictionary object # ignore_index must be True, otherwise error cannot append dictionary df1.append(dict,ignore_index=True)Copy the code

2.1.3 the merge

  • DataFrame method and df1. Merge (df2)
  • Only two DF objects can be joined horizontally, similar to mysql join
  • Horizontally merge DF1 and DF2 by specifying that the column values are equal, and then return the new DF3, which can continue merging with DF4
  • The default is inner join (can also be set to left join, outer join, right join)
  • API usage
pd.merge(df1,df2,on=' ',how=' 'Df2 [['the column 1'.'the column 2'. ] ],on=' ',how=' ') # on='name', indicating which column to base the merge operation on'Fixed value', represents how to process the row index after merging. The fixed parameters are as follows: # how= 'right'; # how= 'right'; # how= 'right''outer'Keep all data # how= in the left and right side tables dF1 and DF2'inner'SQL > select inner from dF1 and DF2Copy the code

2.1.4 the join

  • DataFrame method and df1. Join (df2)
  • Horizontal merging is only possible. Data can be merged based on the row index of two DFS or the row index of one DF and the column index of the other DF
  • Default is left connect (can also be set to left connect, outside connect, right connect)
  • API usage
Join (df2,lsuffix= df1.join(df2,lsuffix= df1.join(df2,lsuffix=' ', rsuffix=' ', how=' 'Df1 # on= df1 # on= df1 # on='name', is the column index dff1. Join (df2,lsuffix=' ', rsuffix=' ', how=' ',on=' ')
Copy the code

2.2 Missing data processing

2.2.1 NaN introduction and Loading

  • A missing value is meaningless, unlike other types of data. Nans are not equal to zero, nor are they equal to an empty string, nor are two Nans equal
  • Detects whether a value is missing
Pd.isnull () pd.isna() # Return False if pd.notnull() pd.notna()Copy the code
  • Load contains missing data
pd.read_csv('path',na_values=,keep_default_na=) # na_values, keep_default_na=) False Do not display default missing valuesCopy the code

The cause of missing values

  • The original data contains missing values
  • Missing values are generated during data merge (join)

2.2.2 Viewing the Missing Value

  • PIP install Missingno
import missingno asMsno. Bar (df) msno. Matrix (df) msno.Copy the code

2.2.3 Missing Value processing

  • Delete missing values

    • Deleting missing values will cause loss of information and is not recommended. You can try deleting missing values when the proportion of missing data is relatively low
    • Press the row to delete
Df.dropna (axis=0, how='any', subset=['name'. ] , inplace=True, thresh=n) subset, not used with thresh parameter subsets receive a list of elements in the list are column names: Thresh =n Optional parameter Value is int, remove by rowNaNValue, removeNaNIf the number of remaining values (columns) after the value is greater than or equal to n, the line is retained0
0, or 'index': Deletes the row containing the missing value1, or 'columns': Deletes columns containing missing values. Default0# Optional parameter how='any', with the inplace=True argument'any': Deletes the row or column if the NA value exists'all': If all values are NA, delete the row or column'any'Inplce =True inplce=True Inplce =True inplce=TrueCopy the code
  • According to the columns to delete
df.drop(['name'. ] , axis=1,inplace=) # axis0, delete by line; axis=1# inplace defaults to False and is not deleted on source files; Equal to True, delete on the source fileCopy the code
  • Fill missing values

    • Missing values are filled in with non-time series data and replaced with constants or statistics
# use constant df.fillna(0,inplace=True) # use statistic (mean, median, mode of missing value column) df['name'].fillna(df['name']. Aggregate function (), inplace=True)Copy the code
    • Time sequence data missing value processing

      • Fill with the non-null value of the last time series hollow value
df.fillna(method='ffill',inplace=True)
Copy the code
    • Fill with the next non-null value of the time series hollow value
df.fillna(method='bfill',inplace=True)
Copy the code
  • Linear interpolation method

    • For most temporal data, specific column values change over time. Therefore, using Bfill and FFILL for interpolation is not the optimal solution to solve the problem of missing values.
    • Linear interpolation is a kind of interpolation missing value technique, which assumes strict linear relationship between data points and calculates the value of missing data points by using non-missing values in adjacent data points.
df.interpolate(limit_direction="both", inplace=True)
Copy the code

2.3 Data Sorting [Know]

There are three ways to transform a wide dataset into a long dataset:

  • Melt converts one or more column names specified to the value of a column

  • Stack changes all column names to the values of a column

    • Unstack is the reverse operation of the stack
  • Wide_to_long disassembles columns with the same initial column name

2.3.1 The MELT function converts one or more column names specified to the value of a column

pd.melt(
    df,
    id_vars=['name'], # all columns value_vars=['name'], # transform all columns into a new column (variable); All corresponding values become the value of the second new column (value) var_name='New column name', # new column name value_name='New column name'# specifies the id_vars parameter. The returned df contains undeformed columns and deformed columns. # Specifies the value_vars parameterCopy the code

2.3.2 Stack Changes all column names to a column value

XXXXXXXXXX s = df.stack() # return the series object of the hierarchy index s.reset_index() # convert to dfs.rename_axis() # name the different row index levels s.stack () # reverse operation of stack, Return df objectCopy the code

2.3.3 WiDE_to_LONG functions process wide data with numeric suffixes in column names

pd.wide_to_long(
	  df,
    stubnames=['Column name beginning string'], # start with column name to process, extract column beginning with specified string, can be multiple I =['name'], # specifies the column name j= as the row index'name'The rest of the column after the beginning of the column name (the numeric part) will be a column, specifying the column name sep=' '# specify delimiter in column name)Copy the code

2.4 Pandas Data type

2.4.1 What are the data types of pandas

  • Pandas is a package built on NUMpy, so the data types in PANDAS are implemented based on the nDARray type in NUMpy

  • Pandas data structure objects and data types:

    • Dataframe table

      • Series column

        • Object –> python STR string
        • Int64 –> python int int
        • Float64 –> python float
        • Datetime64 –> Python datetime
        • Bool –> python bool True False
        • Timedelta [ns]–> Time difference between two points in time, in nanoseconds
        • Category –> Specific category data types, such as male, female, and other by gender

2.4.2 Type Conversion

  • astype()

    • Note the special case: When using the Astype function to convert the target to a numeric type, the Astype function requires that the DataFrame column must have the same data type. When some data is missing but not a NaN (e.g. ‘missing’,’null’, etc.), the entire column becomes string rather than numeric

      • In this case, the to_numeric function is used
# convert to string df['name'].astype(STR) # convert to numeric type df['name'].astype(int)
df['name'].astype(float)
Copy the code
  • # convert to string df[' column name '].astype(STR) # convert to numeric df[' column name '].astype(int) df[' column name '].astype(float)Copy the code
  • to_numeric()

    • The ability to convert data in a series object to numeric types, such as int, float
pd.to_numeric(df['name'],errors=' ',downcast=' ')
Copy the code
  • Parameters that

    • Errors which determines what to do when the function encounters a value that cannot be converted

      • The default value is raise, and to_numeric will throw an error if it encounters a value that cannot be converted
      • Coerce: If To_numeric encounters a value that cannot be converted, NaN is returned
      • Ignore: What if To_numeric encounters a value that cannot be converted and does nothing
    • The downcast parameter minimizes the amount of data stored

      • The default is None, and the accepted parameter value isinteger,float
      • In order todowncast='float'For example, float subtypes float16, float32, and float64, so setting downcast=float converts data to float16 that can store a floating-point number with fewer bytes
    • Note: The downcast and errors arguments are separate, and if errors occur during downcast, an exception will be thrown even if errors are set to ignore

2.4.3 Category Data types

  • What is the category

    • Special data types are composed of a fixed and limited number of variables, such as gender, which can be divided into male, female and confidentiality. The transformation method can use astype functions
    • The categories in category type data are sequential
  • Create data of category type

  • Methods a

s = pd.Series(
    pd.Categorical(
        ["a"."b"."c"."d"# series data categories=["c"."b"."a"] # select * from a list of categories that match the categoryNaNThe output is as follows0      a
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['c'.'b'.'a']
Copy the code
  • Method 2
series_cat = pd.Series(['B'.'D'.'C'.'A'], dtype='category') # Specify the category type series_cat via dtpye0    B
1    D
2    C
3    A
dtype: category
Categories (4, object): ['A'.'B'.'C'.'D']
Copy the code
  • throughCategoricalDtypeSpecifies the type order of the category data
from pandas.api.types importCategoricalDtype # construct category object data type order cat = CategoricalDtype (categories = ['B'.'D'.'A'.'C'], ordered=True) # append the category data type to series_catCopy the code
  • throughCat.reorder_categories () Series object of type categoriesMethod to modify collation rules
series_cat.cat.reorder_categories(
    ['D'.'B'.'C'.'A'], # order =True, # order =True, # inplace=TrueCopy the code