2 Pandas Cleans data

2.1.1 concat

Pandas function, pd. Concat ()
You can join vertically (default, Axis =0 or axis=’index’) or horizontally (axis=1 or axis=’columns’)
Can connect multiple objects (multiple DFS)
Vertical connection :N DFS stacked from top to bottom
- Default outer join (JOIN =’outer’), data with the same column name is merged into one column, and data with different column names is populated with NaN
- Inner join (join=’inner’) preserves only the common part of the data
Horizontal connection :N DFS side by side from left to right
- Default outer join (JOIN =’outer’), matching each self-index, missing value NaN
- Inner join (JOIN =’inner’), which keeps only the shared data matched by the index
The API is introduced

pd.concat([df1,df2,df3...] ,axis=,join=' ',ignore_index=) # df is passed between list # default vertical join,axis=0or'index'; Horizontal connection axis=1or'columns'# ignore_index Defaults to False, does not reset index, keeps respective df index; Equals True, resets the index0.1.2.3.4..Copy the code

2.1.2 append

Longitudinal connection only
Appends an object to an existing df (can be df or dictionary, appends dictionary ignore_index=True)
API usage

The ignore_index parameter defaults to False and keeps the respective DF index; Equal to True resets the index0.1.2.3.4.. Df1.append (df2,ignore_index=) # append dictionary object # ignore_index must be True, otherwise error cannot append dictionary df1.append(dict,ignore_index=True)Copy the code

2.1.3 the merge

DataFrame method and df1. Merge (df2)
Only two DF objects can be joined horizontally, similar to mysql join
Horizontally merge DF1 and DF2 by specifying that the column values are equal, and then return the new DF3, which can continue merging with DF4
The default is inner join (can also be set to left join, outer join, right join)
API usage

pd.merge(df1,df2,on=' ',how=' 'Df2 [['the column 1'.'the column 2'. ] ],on=' ',how=' ') # on='name', indicating which column to base the merge operation on'Fixed value', represents how to process the row index after merging. The fixed parameters are as follows: # how= 'right'; # how= 'right'; # how= 'right''outer'Keep all data # how= in the left and right side tables dF1 and DF2'inner'SQL > select inner from dF1 and DF2Copy the code

2.1.4 the join

DataFrame method and df1. Join (df2)
Horizontal merging is only possible. Data can be merged based on the row index of two DFS or the row index of one DF and the column index of the other DF
Default is left connect (can also be set to left connect, outside connect, right connect)
API usage

Join (df2,lsuffix= df1.join(df2,lsuffix= df1.join(df2,lsuffix=' ', rsuffix=' ', how=' 'Df1 # on= df1 # on= df1 # on='name', is the column index dff1. Join (df2,lsuffix=' ', rsuffix=' ', how=' ',on=' ')
Copy the code

2.2 Missing data processing

2.2.1 NaN introduction and Loading

A missing value is meaningless, unlike other types of data. Nans are not equal to zero, nor are they equal to an empty string, nor are two Nans equal
Detects whether a value is missing

Pd.isnull () pd.isna() # Return False if pd.notnull() pd.notna()Copy the code

Load contains missing data

pd.read_csv('path',na_values=,keep_default_na=) # na_values, keep_default_na=) False Do not display default missing valuesCopy the code

The cause of missing values

The original data contains missing values
Missing values are generated during data merge (join)

2.2.2 Viewing the Missing Value

PIP install Missingno

import missingno asMsno. Bar (df) msno. Matrix (df) msno.Copy the code

2.2.3 Missing Value processing

Delete missing values
- Deleting missing values will cause loss of information and is not recommended. You can try deleting missing values when the proportion of missing data is relatively low
- Press the row to delete

Df.dropna (axis=0, how='any', subset=['name'. ] , inplace=True, thresh=n) subset, not used with thresh parameter subsets receive a list of elements in the list are column names: Thresh =n Optional parameter Value is int, remove by rowNaNValue, removeNaNIf the number of remaining values (columns) after the value is greater than or equal to n, the line is retained0
0, or 'index': Deletes the row containing the missing value1, or 'columns': Deletes columns containing missing values. Default0# Optional parameter how='any', with the inplace=True argument'any': Deletes the row or column if the NA value exists'all': If all values are NA, delete the row or column'any'Inplce =True inplce=True Inplce =True inplce=TrueCopy the code

According to the columns to delete

df.drop(['name'. ] , axis=1,inplace=) # axis0, delete by line; axis=1# inplace defaults to False and is not deleted on source files; Equal to True, delete on the source fileCopy the code

Fill missing values
- Missing values are filled in with non-time series data and replaced with constants or statistics

# use constant df.fillna(0,inplace=True) # use statistic (mean, median, mode of missing value column) df['name'].fillna(df['name']. Aggregate function (), inplace=True)Copy the code

- Time sequence data missing value processing
  - Fill with the non-null value of the last time series hollow value

df.fillna(method='ffill',inplace=True)
Copy the code

- Fill with the next non-null value of the time series hollow value

df.fillna(method='bfill',inplace=True)
Copy the code

Linear interpolation method
- For most temporal data, specific column values change over time. Therefore, using Bfill and FFILL for interpolation is not the optimal solution to solve the problem of missing values.
- Linear interpolation is a kind of interpolation missing value technique, which assumes strict linear relationship between data points and calculates the value of missing data points by using non-missing values in adjacent data points.

df.interpolate(limit_direction="both", inplace=True)
Copy the code

2.3 Data Sorting [Know]

There are three ways to transform a wide dataset into a long dataset:

Melt converts one or more column names specified to the value of a column

Stack changes all column names to the values of a column

Unstack is the reverse operation of the stack

Wide_to_long disassembles columns with the same initial column name

2.3.1 The MELT function converts one or more column names specified to the value of a column

pd.melt(
    df,
    id_vars=['name'], # all columns value_vars=['name'], # transform all columns into a new column (variable); All corresponding values become the value of the second new column (value) var_name='New column name', # new column name value_name='New column name'# specifies the id_vars parameter. The returned df contains undeformed columns and deformed columns. # Specifies the value_vars parameterCopy the code

2.3.2 Stack Changes all column names to a column value

XXXXXXXXXX s = df.stack() # return the series object of the hierarchy index s.reset_index() # convert to dfs.rename_axis() # name the different row index levels s.stack () # reverse operation of stack, Return df objectCopy the code

2.3.3 WiDE_to_LONG functions process wide data with numeric suffixes in column names

pd.wide_to_long(
	  df,
    stubnames=['Column name beginning string'], # start with column name to process, extract column beginning with specified string, can be multiple I =['name'], # specifies the column name j= as the row index'name'The rest of the column after the beginning of the column name (the numeric part) will be a column, specifying the column name sep=' '# specify delimiter in column name)Copy the code

2.4 Pandas Data type

2.4.1 What are the data types of pandas

Pandas is a package built on NUMpy, so the data types in PANDAS are implemented based on the nDARray type in NUMpy

Pandas data structure objects and data types:

Dataframe table

Series column

Object –> python STR string

Int64 –> python int int

Float64 –> python float

Datetime64 –> Python datetime

Bool –> python bool True False

Timedelta [ns]–> Time difference between two points in time, in nanoseconds

Category –> Specific category data types, such as male, female, and other by gender

2.4.2 Type Conversion

astype()
- Note the special case: When using the Astype function to convert the target to a numeric type, the Astype function requires that the DataFrame column must have the same data type. When some data is missing but not a NaN (e.g. ‘missing’,’null’, etc.), the entire column becomes string rather than numeric
  - In this case, the to_numeric function is used

# convert to string df['name'].astype(STR) # convert to numeric type df['name'].astype(int)
df['name'].astype(float)
Copy the code

# convert to string df[' column name '].astype(STR) # convert to numeric df[' column name '].astype(int) df[' column name '].astype(float)Copy the code

to_numeric()
- The ability to convert data in a series object to numeric types, such as int, float

pd.to_numeric(df['name'],errors=' ',downcast=' ')
Copy the code

Parameters that
- Errors which determines what to do when the function encounters a value that cannot be converted
  - The default value is raise, and to_numeric will throw an error if it encounters a value that cannot be converted
  - Coerce: If To_numeric encounters a value that cannot be converted, NaN is returned
  - Ignore: What if To_numeric encounters a value that cannot be converted and does nothing
- The downcast parameter minimizes the amount of data stored
  - The default is None, and the accepted parameter value isinteger,float
  - In order todowncast='float'For example, float subtypes float16, float32, and float64, so setting downcast=float converts data to float16 that can store a floating-point number with fewer bytes
- Note: The downcast and errors arguments are separate, and if errors occur during downcast, an exception will be thrown even if errors are set to ignore

2.4.3 Category Data types

What is the category
- Special data types are composed of a fixed and limited number of variables, such as gender, which can be divided into male, female and confidentiality. The transformation method can use astype functions
- The categories in category type data are sequential
Create data of category type
Methods a

s = pd.Series(
    pd.Categorical(
        ["a"."b"."c"."d"# series data categories=["c"."b"."a"] # select * from a list of categories that match the categoryNaNThe output is as follows0      a
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['c'.'b'.'a']
Copy the code

Method 2

series_cat = pd.Series(['B'.'D'.'C'.'A'], dtype='category') # Specify the category type series_cat via dtpye0    B
1    D
2    C
3    A
dtype: category
Categories (4, object): ['A'.'B'.'C'.'D']
Copy the code

throughCategoricalDtypeSpecifies the type order of the category data

from pandas.api.types importCategoricalDtype # construct category object data type order cat = CategoricalDtype (categories = ['B'.'D'.'A'.'C'], ordered=True) # append the category data type to series_catCopy the code

throughCat.reorder_categories () Series object of type categoriesMethod to modify collation rules

series_cat.cat.reorder_categories(
    ['D'.'B'.'C'.'A'], # order =True, # order =True, # inplace=TrueCopy the code

2.1.1 concat

2.1.2 append

2.1.3 the merge

2.1.4 the join

2.2 Missing data processing

2.2.1 NaN introduction and Loading

2.2.2 Viewing the Missing Value

2.2.3 Missing Value processing

2.3 Data Sorting [Know]

2.3.1 The MELT function converts one or more column names specified to the value of a column

2.3.2 Stack Changes all column names to a column value

2.3.3 WiDE_to_LONG functions process wide data with numeric suffixes in column names

2.4 Pandas Data type

2.4.1 What are the data types of pandas

2.4.2 Type Conversion

2.4.3 Category Data types

Related Posts

IOS implements the one-to-many proxy solution

I want to learn the abstract Factory Model of design patterns | more challenges in August

How did I become a programmer