1. Generate data tables

We will import the library for pandas (numpy)

import numpy as np
import pandas as pd
Copy the code

Import a CSV or XLSX file

df = pd.DataFrame(pd.read_csv('name.csv',header=1))
df = pd.DataFrame(pd.read_excel('name.xlsx'))
Copy the code

Create tables with pandas

Df = pd DataFrame ({" id ": [6] 1001100 2100 3100 4100 5100," date ": pd. Date_range (' 20130102 ', periods = 6), "City" : [' Beijing ', 'SH', 'through', 'give it', 'Shanghai', 'Beijing'], "age" : [23,44,54,32,34,32], "Category" : [' 100 - A ', '100 - B', '110 - A', '110 - C', '210 - A', '130 - F], "price" : [1200, np. Nan, 2133543 3, np, nan, 4432]}. columns =['id','date','city','category','age','price'])Copy the code

2. Data table information viewing

Dimension check

df.shape
Copy the code

Basic data table information (dimension, column name, data format, occupied space, etc.)

df.info()
Copy the code

Format of data for each column

df.dtypes
Copy the code

A column format

df['B'].dtype
Copy the code

A null value

df.isnull()
Copy the code

View a null column of values

df['B'].unique()
Copy the code

View the values of the data table

df.values
Copy the code

View column names

df.columns
Copy the code

View the first 10 rows and the last 10 rows

Df.head () # default data for the first 10 lines df.tail() # default data for the last 10 linesCopy the code

Data table cleaning

Fill the null values with the number 0

df.fillna(value=0)
Copy the code

NA is populated with the mean of the column PRINCE

df['prince'].fillna(df['prince'].mean())
Copy the code

Clear character Spaces for the city field

df['city']=df['city'].map(str.strip)
Copy the code

Case conversion

df['city']=df['city'].str.lower()
Copy the code

Changing the data format

df['price'].astype('int')
Copy the code

Changing column names

df.rename(columns={'category': 'category-size'})
Copy the code

Duplicate values that appear after deletion

df['city'].drop_duplicates()
Copy the code

Deletes the first duplicate value

df['city'].drop_duplicates(keep='last')
Copy the code

Data to replace

df['city'].replace('sh', 'shanghai')
Copy the code

4. Data preprocessing

Df1 = pd. DataFrame ({" id ": [8] 1001100 2100 3100 4100 5100 6100 7100. "gender":['male','female','male','female','male','female','male','female'], "pay":['Y','N','Y','Y','N','Y','N','Y',], "M - point" :,12,20,40,40,40,30,20 [10]})Copy the code

Table merge

Df_inner =pd.merge(df,df1,how='inner') Df_left =pd.merge(df,df1,how='left') # df_right=pd.merge(df,df1,how='right') # df_outer=pd.merge(df,df1,how=' right') #Copy the code

Set index column

df_inner.set_index('id')
Copy the code

Sort by the value of a particular column

df_inner.sort_values(by=['age'])
Copy the code

Sort by index column

df_inner.sort_index()
Copy the code

If the prince column has a value greater than 3000, the group column displays high, otherwise displays low

df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low')
Copy the code

Group markup of data that compound multiple conditions

df_inner.loc[(df_inner['city'] == 'beijing') & (df_inner['price'] >= 4000), 'sign']=1
Copy the code

The category field values are sorted and a table is created with index columns of dF_inner named category and size

pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))
Copy the code

Matches the split table with the original DF_INNER table

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
Copy the code

5. Data extraction

Three functions are mainly used: LOC, ILOC and IX. Loc function is extracted according to label value, ILOC is extracted according to position, and IX can be extracted according to label and position at the same time.

Extracts the value of a single row by index

df_inner.loc[3]
Copy the code

Extracts region row values by index

df_inner.iloc[0:5]
Copy the code

Reset the index

df_inner.reset_index()
Copy the code

Set the date to index

df_inner=df_inner.set_index('date')
Copy the code

Extract all data before 4 days

df_inner[:'2013-01-04']
Copy the code

Extract data by location region using ILOC

Df_inner. Iloc [:3,:2] # the number before and after the colon is no longer the label name of the index, but the location of the data, starting from 0, the first three rows, the first two columns.Copy the code

Adapt ilOC to separate data extraction by location

Df_inner. Iloc [[0,2,5],[4,5]Copy the code

Extract data by index label and location mix using IX

Df_inner. Ix [:'2013-01-03',:4] #2013-01-03Copy the code

Check whether the value of the city column is Beijing

df_inner['city'].isin(['beijing'])
Copy the code

Determine whether the city column contains Beijing and Shanghai, and then extract the data that meets the conditions

df_inner.loc[df_inner['city'].isin(['beijing','shanghai'])]
Copy the code

Extract the first three characters and generate a data table

pd.DataFrame(category.str[:3])
Copy the code

6. Data screening

The data is filtered, counted and summed with the combination of greater than, less than, and not.

Filter with and

df_inner.loc[(df_inner['age'] > 25) & (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']]
Copy the code

Use or to filter

df_inner.loc[(df_inner['age'] > 25) | (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']].sort(['age'])
Copy the code

Filter with the “not” condition

df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id'])
Copy the code

The filtered data is counted in the city column

df_inner.loc[(df_inner['city'] != 'beijing'), ['id','city','age','category','gender']].sort(['id']).city.count()
Copy the code

Use Query for filtering

df_inner.query('city == ["beijing", "shanghai"]')
Copy the code

Sum the filtered structure as Prince

df_inner.query('city == ["beijing", "shanghai"]').price.sum()
Copy the code

7. Data summary

The main functions are groupby and pivote_table

Count all columns together

df_inner.groupby('city').count()
Copy the code

Count the ID field by city

df_inner.groupby('city')['id'].count()
Copy the code

A summary count is performed on both fields

df_inner.groupby(['city','size'])['id'].count()
Copy the code

Summarize the city field and calculate the total and mean of Prince separately

df_inner.groupby('city')['price'].agg([len,np.sum, np.mean])
Copy the code

Viii. Data statistics

Data were sampled, standard deviation, covariance and correlation coefficient were calculated

Simple data sampling

df_inner.sample(n=3)
Copy the code

Manually set the sampling weight

Weights = [0, 0, 0, 0, 0.5, 0.5] df_inner. Sample (n=2, weights=weights)Copy the code

Do not put back after sampling

df_inner.sample(n=6, replace=False)
Copy the code

Sample and put it back

df_inner.sample(n=6, replace=True)
Copy the code

Table descriptive statistics

Df_inner.describe ().round(2).T # Round sets the display of decimal places, with T indicating transposeCopy the code

Calculate the standard deviation of the columns

df_inner['price'].std()
Copy the code

Calculate the covariance between the two fields

df_inner['price'].cov(df_inner['m-point'])
Copy the code

The covariance between all fields in a data table

df_inner.cov()
Copy the code

Correlation analysis of two fields

Df_inner ['price'].corr(df_inner['m-point']) #Copy the code

Correlation analysis of data tables

df_inner.corr()
Copy the code

Ix. Data output

The analyzed data can be output in XLSX or CSV format

Write to Excel

df_inner.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc')
Copy the code

Written to the CSV

df_inner.to_csv('excel_to_python.csv')
Copy the code

These are some basic uses of Pandas.

If you want to learn Python, but can’t find a path or resources to learn it, welcome to programming at your fingertips.

Learn Python interactive online, faster and better!

Original link: www.jb51.net/article/136…