Pandas provides fast, flexible, and expressive data structures and is a powerful Python library for data analysis.

This article is included in the pre-machine learning tutorial series.

Series and DataFrame

Machine Learning: 30 minutes to learn the common NumPy language for Pandas Pandas is particularly good for handling tabular data, such as SQL tables and EXCEL tables. Ordered or disordered time series. Arbitrary matrix data with row and column labels.

Open the Jupyter Notebook and import NUMpy and PANDAS to begin our tutorial:

import numpy as np
import pandas as pd
Copy the code

1. pandas.Series

Series is a one-dimensional NDARray array with an index. Index values may not be unique, but they must be hashable.

pd.Series([1, 3, 5, np.nan, 6, 8])
Copy the code

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
Copy the code

We can see that the default index values are numbers like 0, 1, 2, 3, 4, and 5. Add the index attribute and specify ‘c’,’a’,’ I ‘,’yong’,’j’,’ I ‘.

pd.Series([1, 3, 5, np.nan, 6, 8], index=['c','a','i','yong','j','i'])
Copy the code

In the output below, we can see that index is repeatable.

C 1.0 A 3.0 I 5.0 yong NaN J 6.0 I 8.0 DTYPE: float64Copy the code

2. pandas.DataFrame

A DataFrame is a table structure with rows and columns. You can think of it as a dictionary structure for multiple Series objects.

pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['i','ii','iii'], columns=['A', 'B', 'C'])
Copy the code

The output table is as follows, where index corresponds to its rows and columns to its columns.

	A	B	C
i	1	2	3
ii	4	5	6
iii	7	8	9

2. Pandas are very busy

1. Access data

Prepare data and randomly generate A 2-d array of 6 rows and 4 columns with row labels for dates from 20210101 to 20210106 and column labels for A, B, C, and D.

import numpy as np
import pandas as pd
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20210101', periods=6), columns=list('ABCD'))
df
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-01	0.270961	0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	1.64592	0.69841
2021-01-03	0.325415	0.602236	0.134508	1.28121
2021-01-04	0.33032	1.40384	0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	0.253845
2021-01-06	0.816064	1.30197	0.656281	1.2718

1.1 the head () and tail ()

Look at the first few lines of the table:

df.head(2)
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-01	0.270961	0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	1.64592	0.69841

Look at the next few lines of the table:

df.tail(3)
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-04	0.33032	1.40384	0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	0.253845
2021-01-06	0.816064	1.30197	0.656281	1.2718

1.2 the describe ()

The describe method is used to generate description statistics for the DataFrame. You can easily view the distribution of data sets. Note that the statistical distribution does not contain NaN values.

df.describe()
Copy the code

The display is as follows:

	A	B	C	D
count	6	6	6	6
mean	0.0825402	0.0497552	0.181309	0.22896
std	0.551412	1.07834	0.933155	1.13114
min	0.816064	1.40384	1.64592	1.2718
25%	0.18	0.553043	0.737194	0.587269
50%	0.298188	0.134555	0.106933	0.287363
75%	0.342885	0.987901	0.556601	1.16805
max	0.696541	1.30197	0.656281	1.48804

Let’s start by reviewing our mathematical formulas.

Mean:

$\bar x = \frac{\sum_{i=1}^{n}{x_i}}{n}$

Variance (variance) :

$s^2 = \frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}$

The standard deviation (STD) :

$s = \sqrt{\frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}}$

We explain the meaning of attributes of the describe statistics in pandas. Let’s just take A as an example.

countRepresents counting. There are six non-null data in column A.
meanRepresents the average. The average value of all non-null data in column A is 0.0825402.
stdThis is the standard deviation. The standard deviation of column A is 0.551412.
minRepresents the minimum value. The minimum value of column A is -0.816064. That is, 0% data is less than -0.816064.
25%Represents a quarter of a number. The quartile of column A is -0.18. That is, 25% of the data is less than -0.18.
50%Represents one half of a fraction. The quartile of column A is 0.298188. That is, 50% of the data is less than 0.298188.
75%Represents the third quarter of the quartile. The third quartile of column A is 0.342885. That is, 75% of the data is smaller than 0.342885.
maxRepresents the maximum value. The maximum value of column A is 0.696541. That is, 100% of data is smaller than 0.696541.

1.3 T

T generally stands for Transpose, or Transpose. Row and column conversion.

df.T
Copy the code

The display table is as follows:

	2021-01-01	2021-01-02	2021-01-03	2021-01-04	2021-01-05	2021-01-06
A	0.270961	0.696541	0.325415	0.33032	0.348708	0.816064
B	0.405463	0.136352	0.602236	1.40384	1.27175	1.30197
C	0.348373	1.64592	0.134508	0.93809	0.626011	0.656281
D	0.828572	0.69841	1.28121	1.48804	0.253845	1.2718

1.4 sort_values ()

Specify a column to sort, and the following code sorts positively according to column C.

df.sort_values(by='C')
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-02	0.696541	0.136352	1.64592	0.69841
2021-01-04	0.33032	1.40384	0.93809	1.48804
2021-01-03	0.325415	0.602236	0.134508	1.28121
2021-01-01	0.270961	0.405463	0.348373	0.828572
2021-01-05	0.348708	1.27175	0.626011	0.253845
2021-01-06	0.816064	1.30197	0.656281	1.2718

1.5 nlargest ()

Select the largest n rows of data for a column. Df.nlargest (2,’A’) indicates that the largest 2 rows of column A are returned.

df.nlargest(2,'A')
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-02	0.696541	0.136352	1.64592	0.69841
2021-01-05	0.348708	1.27175	0.626011	0.253845

1.6 the sample ()

The sample method displays random sample data.

Df.sample (5) returns five random rows of data.

df.sample(5)
Copy the code

The parameter frac stands for fraction. Frac =0.01 means that 1% random data is returned for sample display.

Df. Sample (frac = 0.01)Copy the code

2. Select data

2.1 Selection based on labels

We enter df[‘A’] to select column A.

df['A']
Copy the code

Output column A data, which is also A Series object:

2021-01-01 0.270961 2021-01-02 0.696541 2021-01-03 0.325415 2021-01-04-0.330320 2021-01-05 0.348708 2021-01-06 -0.816064 Name: A, dType: FLOAT64Copy the code

Df [0:3] This code is the same as df.head(3). Df [0:3] is the array selection method for NumPy, which shows that Pandas has good support for NumPy.

df[0:3]
Copy the code

The display table is as follows:

	A	B	C	D
2021-01-01	0.270961	0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	1.64592	0.69841
2021-01-03	0.325415	0.602236	0.134508	1.28121

Specify row and column labels through loC methods.

df.loc['2021-01-01':'2021-01-02', ['A', 'B']]
Copy the code

The display table is as follows:

	A	B
2021-01-01	0.270961	0.405463
2021-01-02	0.696541	0.136352

2.2 Selection based on location

Iloc is different from LOC. Loc specifies the specific tag, while ILOC specifies the index location of the tag. Df.iloc [3:5, 0:3] indicates that rows with indexes 3 and 4 and columns with indexes 0, 1, and 2 are selected. That is, rows 4 and 5, columns 1, 2, and 3. Note that the index number starts at 0. The colon indicates the interval, with the left and right sides indicating the beginning and end, respectively. For example, 3:5 represents the left open and right closed interval [3,5], which does not include 5 itself.

df.iloc[3:5, 0:3]
Copy the code

	A	B	C
2021-01-04	0.33032	1.40384	0.93809
2021-01-05	0.348708	1.27175	0.626011

df.iloc[:, 1:3]
Copy the code

	B	C
2021-01-01	0.405463	0.348373
2021-01-02	0.136352	1.64592
2021-01-03	0.602236	0.134508
2021-01-04	1.40384	0.93809
2021-01-05	1.27175	0.626011
2021-01-06	1.30197	0.656281

2.3 Boolean index

The DataFrame can be filtered by a condition and returns if the condition is True. If the condition is False, filter it out.

We set A filter to determine if column A is greater than 0.

filter = df['A'] > 0
filter
Copy the code

The output is as follows, you can see the actions of 2021-01-04 and 2021-01-06 False.

2021-01-01     True
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Name: A, dtype: bool
Copy the code

We view the data set through a filter.

df[filter]
# df[df['A'] > 0]
Copy the code

Looking at the table, we can see that the rows 2021-01-04 and 2021-01-06 have been filtered out.

	A	B	C	D
2021-01-01	0.270961	0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	1.64592	0.69841
2021-01-03	0.325415	0.602236	0.134508	1.28121
2021-01-05	0.348708	1.27175	0.626011	0.253845

3. Handle the missing value

Prepare data.

Df2 = df. Copy () df2. Loc [: 3, 'E'] = 1.0 f_series = {' 2021-01-02 ', 1.0 '2021-01-03' : 2.0, '2021-01-04' : 3.0, '2021-01-05' : 4.0, '2021-01-06' : 5.0} df2 [' F '] = pd. The Series (df2 f_series)Copy the code

The display table is as follows:

	A	B	C	D	F	E
2021-01-01	0.270961	0.405463	0.348373	0.828572	nan	1
2021-01-02	0.696541	0.136352	1.64592	0.69841	1	1
2021-01-03	0.325415	0.602236	0.134508	1.28121	2	1
2021-01-04	0.33032	1.40384	0.93809	1.48804	3	nan
2021-01-05	0.348708	1.27175	0.626011	0.253845	4	nan
2021-01-06	0.816064	1.30197	0.656281	1.2718	5	nan

3.1 dropna ()

Use the dropna method to clear NaN values. Note: The dropa method returns a new DataFrame and does not alter the original DataFrame.

df2.dropna(how='any')
Copy the code

The above code indicates that the row data is deleted when any value is null.

	A	B	C	D	F	E
2021-01-02	0.696541	0.136352	1.64592	0.69841	1	1
2021-01-03	0.325415	0.602236	0.134508	1.28121	2	1

3.2 fillna ()

Use the filna command to fill in the NaN value.

df2.fillna(df2.mean())
Copy the code

The above code shows that the average of each column is used to fill in the gaps. Similarly, fillna does not update the original DataFrame. To update the original DataFrame, use the code df2 = df2.fillna(df2.mean()).

The display table is as follows:

	A	B	C	D	F	E
2021-01-01	0.270961	0.405463	0.348373	0.828572	3	1
2021-01-02	0.696541	0.136352	1.64592	0.69841	1	1
2021-01-03	0.325415	0.602236	0.134508	1.28121	2	1
2021-01-04	0.33032	1.40384	0.93809	1.48804	3	1
2021-01-05	0.348708	1.27175	0.626011	0.253845	4	1
2021-01-06	0.816064	1.30197	0.656281	1.2718	5	1

4. Operation method

4.1 agg ()

Agg stands for Aggregate.

Common aggregation methods are as follows:

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

df.mean()
Copy the code

Returns the average value of each column

A 0.082540b 0.049755 c-0.18139d 0.228960dType: float64Copy the code

You can view the row average by adding the parameter axis.

df.mean(axis=1)
Copy the code

Output:

2021-01-01 0.260611 2021-01-02-0.377860 2021-01-03 0.217470 2021-01-04-0.296053 2021-01-05 0.498156 2021-01-06 0.032404 dtype: float64Copy the code

What if we want to see multiple aggregate statistics for a column? We can then call the AGG method:

df.agg(['std','mean'])['A']
Copy the code

Results returned show standard deviation STD and mean:

STD 0.551412 mean 0.082540 Name: A, DType: FLOAT64Copy the code

Apply different aggregate functions to different columns:

df.agg({'A':['max','mean'],'B':['mean','std','var']})
Copy the code

The result is as follows:

	A	B
max	0.696541	nan
mean	0.0825402	0.0497552
std	nan	1.07834
var	nan	1.16281

4.2 the apply ()

Apply () is a call to a method. For example, df.apply(np.sum) calls the np.sum method for each column and returns the sum of the values for each column.

df.apply(np.sum)
Copy the code

The output is:

A 0.495241 B 0.298531 C-1.087857 D 1.373762 DTYPE: FLOAT64Copy the code

The Apply method supports lambda expressions.

df.apply(lambda n: n*2)
Copy the code

	A	B	C	D
2021-01-01	0.541923	0.810925	0.696747	1.65714
2021-01-02	1.39308	0.272704	3.29185	1.39682
2021-01-03	0.65083	1.20447	0.269016	2.56242
2021-01-04	0.66064	2.80768	1.87618	2.97607
2021-01-05	0.697417	2.5435	1.25202	0.50769
2021-01-06	1.63213	2.60393	1.31256	2.5436

4.3 value_counts ()

The value_counts method displays duplicate statistics of rows and columns. We regenerate some integer data to ensure that there is some data duplication.

Np.random. Seed (101) df3 = pd.dataframe (np.random. Randint (0,9,size = (6,4)),columns=list('ABCD')) df3Copy the code

	A	B	C	D
0	1	6	7	8
1	4	8	5	0
2	5	8	1	3
3	8	3	3	2
4	8	3	7	0
5	7	8	4	3

Call the value_counts() method.

df3['A'].value_counts()
Copy the code

Looking at the output we can see that the number 8 in column A has two numbers and the number of other numbers is 1.

8    2
7    1
5    1
4    1
1    1
Name: A, dtype: int64
Copy the code

4.4 the STR

Pandas Built-in string handling method.

names = pd.Series(['andrew','bobo','claire','david','4'])
names.str.upper()
Copy the code

Using the above code we set all strings in Series to uppercase.

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         4
dtype: object
Copy the code

Capital letters:

names.str.capitalize()
Copy the code

The output is:

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object
Copy the code

Check whether it is a number:

names.str.isdigit()
Copy the code

The output is:

0    False
1    False
2    False
3    False
4     True
dtype: bool
Copy the code

String splitting:

tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
tickers = pd.Series(tech_finance)
tickers.str.split(',').str[0:2]
Copy the code

Splitting the string with commas results in:

0    [GOOG, APPL]
1      [JPM, BAC]
dtype: object
Copy the code

5. Merger

5.1 the concat ()

Concat is used to concatenate data sets. Let’s prepare the data first.

data_one = {'Col1': ['A0', 'A1', 'A2', 'A3'],'Col2': ['B0', 'B1', 'B2', 'B3']}
data_two = {'Col1': ['C0', 'C1', 'C2', 'C3'], 'Col2': ['D0', 'D1', 'D2', 'D3']}
one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)
Copy the code

Concat method is used to concatenate two data sets.

pt(pd.concat([one,two]))
Copy the code

Get the table:

	Col1	Col2
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3
0	C0	D0
1	C1	D1
2	C2	D2
3	C3	D3

5.2 the merge ()

Merge is the equivalent of the JOIN method in an SQL operation and is used to join two data sets in some kind of relationship

Registrations = pd. DataFrame ({' reg_id: [1, 2, 3, 4], 'name' : [' Andrew ', 'Bobo', 'Claire', 'David']}) logins = Pd. DataFrame ({' log_id: [1, 2, 3, 4], 'name' : [' Xavier ', 'Andrew', 'Yolanda', 'Bobo]})Copy the code

Select * from outer; select * from outer; select * from outer;

pd.merge(left=registrations, right=logins, how='outer',on='name')
Copy the code

The return result is:

	reg_id	name	log_id
0	1	Andrew	2
1	2	Bobo	4
2	3	Claire	nan
3	4	David	nan
4	nan	Xavier	1
5	nan	Yolanda	3

Note that how: {‘left’, ‘right’, ‘outer’, ‘inner’} has 4 connections. Indicates whether nan values for the left and right tables are selected. For example, left indicates that all data in the left table is reserved. If the data in the right table is a nan value, the data on the right is not displayed. In simple terms, think of the left and right tables as two sets.

Left indicates the set of all left tables + intersection of two tables
Right: the set of all right tables + intersection of two tables
Outer means to take the union of two tables
Inner means to take the intersection of two tables

6. Grouping GroupBy

The grouping function in Pandas is very similar to the SQL statement SELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2. It doesn’t matter if you’re not familiar with SQL. Grouping is the process of splitting, counting, and merging table data into a single column.

Prepare data.

np.random.seed(20201212)
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df
Copy the code

As you can see, our columns A and B have A lot of duplicate data. At this point we can group by foo/bar or one/two.

	A	B	C	D
0	foo	one	0.270961	0.325415
1	bar	one	0.405463	0.602236
2	foo	two	0.348373	0.134508
3	bar	three	0.828572	1.28121
4	foo	two	0.696541	0.33032
5	bar	two	0.136352	1.40384
6	foo	one	1.64592	0.93809
7	foo	three	0.69841	1.48804

6.1 Single-column Grouping

We apply the Groupby method to group the data in the table above.

df.groupby('A')
Copy the code

The groupby method returns an object of type DataFrameGroupBy by executing the above code. We can’t see it directly, so we need to apply the aggregate function. See section 4.1 of this article.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014C6742E248>
Copy the code

Let’s try the aggregate function sum.

df.groupby('A').sum()
Copy the code

The display table is as follows:

A	C	D
bar	0.559461	0.724868
foo	1.02846	0.410533

6.2 Multi-column grouping

The groupby method supports passing in multiple columns as arguments.

df.groupby(['A', 'B']).sum()
Copy the code

The following information is displayed after grouping:

A	B	C	D
bar	one	0.405463	0.602236
	one	0.405463	0.602236
	three	0.828572	1.28121
	two	0.136352	1.40384
foo	one	1.37496	0.612675
	three	0.69841	1.48804
	two	1.04491	0.464828

6.3 Applying the multiple aggregation method

We apply agG (), passing an array of aggregate methods as arguments to the method. The following code classifies according to A and only counts the values of column C.

df.groupby('A')['C'].agg([np.sum, np.mean, np.std])
Copy the code

You can see the results of the aggregate functions for group bar and group foo as follows:

A	sum	mean	std
bar	0.559461	0.186487	0.618543
foo	1.02846	0.205692	0.957242

6.4 Different aggregation statistics for different columns

The following code performs different aggregation statistics for column C and D, summing column C and conducting standard deviation statistics for column D.

df.groupby('A').agg({'C': 'sum', 'D': lambda x: np.std(x, ddof=1)})
Copy the code

The output is as follows:

A	C	D
bar	0.559461	1.37837
foo	1.02846	0.907422

6.5 more

More about Pandas goupby method: please refer to the website (pandas.pydata.org/pandas-docs…

It’s very nice to watch TV

1. reshape

0 0 Reshape the table For complex tables, we need to transform them into something we can understand, such as separate statistics grouped by certain attributes.

1.1 the stack () and unstack ()

The stack method divides the table into index and data. Index columns are retained and data is stacked.

Prepare data.

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
Copy the code

Based on the above code, we create a composite index.

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])
Copy the code

We create a DataFrame with a composite index.

np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
Copy the code

The output is as follows:

A	B	C	D
bar	one	0.270961	0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	1.64592	0.69841
foo	one	0.325415	0.602236
	two	0.134508	1.28121
qux	one	0.33032	1.40384
	two	0.93809	1.48804

We do the stack method.

stacked = df.stack()
stacked
Copy the code

The stacked (compressed) output is shown in the following table. Note: Your output using Jupyter Notebook/Lab may be different from the following. The output below has been adjusted to make it easier to display in Markdown.

First second bar one A 0.942502 bar one B 0.060742 bar two A 1.340975 bar two b-1.712152 baz one A 1.899275 baz one B 1.237799 baz two A -1.589069 baz two B 1.288342 foo one A -0.326792 foo one B 1.576351 foo two A 1.526528 foo two B 1.410695 Qux one A 0.420718 Qux one B -0.288002 Qux two A 0.361586 Qux two B 0.177352 DTYPE: Float64Copy the code

We unstack to expand the data.

stacked.unstack()
Copy the code

Output the original table.

A	B	C	D
bar	one	0.270961	0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	1.64592	0.69841
foo	one	0.325415	0.602236
	two	0.134508	1.28121
qux	one	0.33032	1.40384
	two	0.93809	1.48804

Let’s add the level parameter.

stacked.unstack(level=0)
#stacked.unstack(level=1)
Copy the code

If level=0, the output is as follows. If level=1, the output is as follows:

second	first	bar	baz	foo	qux
one	A	0.942502	1.89927	0.326792	0.420718
one	B	0.060742	1.2378	1.57635	0.288002
two	A	1.34097	1.58907	1.52653	0.361586
two	B	1.71215	1.28834	1.4107	0.177352

1.2 pivot_table ()

Pivot_table represents pivot_table, a table format that dynamically arranges and summarizes data.

We generate dataframes for columns without indexes.

np.random.seed(99)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                    'B': ['A', 'B', 'C'] * 4,
                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D': np.random.randn(12),
                    'E': np.random.randn(12)})
df
Copy the code

The display table is as follows:

	A	B	C	D	E
0	one	A	foo	0.142359	0.0235001
1	one	B	foo	2.05722	0.456201
2	two	C	foo	0.283262	0.270493
3	three	A	bar	1.32981	1.43501
4	one	B	bar	0.154622	0.882817
5	one	C	bar	0.0690309	0.580082
6	two	A	foo	0.75518	0.501565
7	three	B	foo	0.825647	0.590953
8	one	C	foo	0.113069	0.731616
9	one	A	bar	2.36784	0.261755
10	two	B	bar	0.167049	0.855796
11	three	C	bar	0.685398	0.187526

By observing the data, we can obviously conclude that columns A, B and C have certain attribute meanings. We perform the pivot_table method.

pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])
Copy the code

Use columns D and E as data columns, A and B as composite row indexes, and C’s data values as column indexes.

	(‘D’, ‘bar’)	(‘D’, ‘foo’)	(‘E’, ‘bar’)	(‘E’, ‘foo’)
(‘one’, ‘A’)	2.36784	0.142359	0.261755	0.0235001
(‘one’, ‘B’)	0.154622	2.05722	0.882817	0.456201
(‘one’, ‘C’)	0.0690309	0.113069	0.580082	0.731616
(‘three’, ‘A’)	1.32981	nan	1.43501	nan
(‘three’, ‘B’)	nan	0.825647	nan	0.590953
(‘three’, ‘C’)	0.685398	nan	0.187526	nan
(‘two’, ‘A’)	nan	0.75518	nan	0.501565
(‘two’, ‘B’)	0.167049	nan	0.855796	nan
(‘two’, ‘C’)	nan	0.283262	nan	0.270493

2. Time series

Date_range is the method used to generate date intervals in Pandas. We execute the following code:

rng = pd.date_range('1/1/2021', periods=100, freq='S')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code

The date_range method performs 100 time segments at 1-second intervals starting from 0 seconds on January 1, 2021. The following output is displayed:

2021-01-01 00:00:00 475 2021-01-01 00:00:01 145 2021-01-01 00:00:02 13 2021-01-01 00:00:03 240 2021-01-01 00:00:04 183 . 2021-01-01 00:01:35 413 2021-01-01 00:01:36 330 2021-01-01 00:01:37 272 2021-01-01 00:01:38 304 2021-01-01 00:01:39 151 Freq: S, Length: 100, dtype: int32Copy the code

Let’s try changing the freq parameter from S(second) to M(Month).

rng = pd.date_range('1/1/2021', periods=100, freq='M')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code

Output:

2021-01-31 311 2021-02-28 256 2021-03-31 327 2021-04-30 151 2021-05-31 484... 2028-12-31 170 2029-01-31 492 2029-02-28 205 2029-03-31 90 2029-04-30 446 Freq: M, Length: 100, dtype: int32Copy the code

We set the date generation to be quarterly as a frequency.

prng = pd.period_range('2018Q1', '2020Q4', freq='Q-NOV')
pd.Series(np.random.randn(len(prng)), prng)
Copy the code

Output all quarters between q1 2018 and Q4 2020.

2018Q1 0.833025 2018Q2-0.509514 2018Q3-0.735542 2018q4-0.224403 2019Q1-0.119709 2019Q2-1.379413 2019Q3 0.871741 2019Q4 0.877493 2020Q1 0.577611 2020Q2-0.365737 2020Q3-0.473404 2020Q4 0.529800 Freq: q-Nov, dTYPE: FLOAT64Copy the code

Classification of 3.

Pandas has a special data type called “table of contents” (dtype=”category”) that is classified by setting certain columns to be directories.

Prepare data.

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df
Copy the code

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

We add a new column grade and set its data type to category.

df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Copy the code

We can see that the grade column has only three values a, B, and E.

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Copy the code

We replace a, B, and e in that order with very good, good, very bad.

df["grade"].cat.categories = ["very good", "good", "very bad"]
Copy the code

The table is as follows:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

Let’s sort the table:

df.sort_values(by="grade", ascending=False)
Copy the code

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

View the number of categories:

df.groupby("grade").size()
Copy the code

The above code output is:

grade
very good    3
good         2
very bad     1
dtype: int64
Copy the code

4. IO

Pandas supports reading and writing data from files in CSV, JSON, and EXCEL formats. The file formats supported by Pandas are as follows.

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

Let’s just use a CSV file as an example. For other formats, please refer to the table above.

We import data from a CSV file. Don’t worry too much about the domain name of the url below.

df = pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")
Copy the code

View the first 5 rows of data:

df.head(5)
Copy the code

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	122.23	37.88	41	880	129	322	126	8.3252	452600	NEAR BAY
1	122.22	37.86	21	7099	1106	2401	1138	8.3014	358500	NEAR BAY
2	122.24	37.85	52	1467	190	496	177	7.2574	352100	NEAR BAY
3	122.25	37.85	52	1274	235	558	219	5.6431	341300	NEAR BAY
4	122.25	37.85	52	1627	280	565	259	3.8462	342200	NEAR BAY

5. Drawing

Pandas supports Matplotlib, a powerful Python visualization tool. This section provides a brief overview of the drawing methods supported by Pandas, and we will cover matplotlib in detail in the next article. In order not to miss the update, welcome to follow me.

np.random.seed(999)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
Copy the code

Let’s show this by calling the plot method directly. There are two caveats:

The plot method is called by Pandas, not by matplotlib.
We know that the Python language does not require semicolons for closing statements. The semicolon here indicates that the image is displayed directly after rendering the drawing.

df.plot();
Copy the code

df.plot.bar();
Copy the code

df.plot.bar(stacked=True);
Copy the code

Four or more

We will cover matplotlib in the next post. Please follow our pre-machine learning tutorial series or my personal blog, blog.caiyongji.com/.

2. Machine learning is used to hunt Pandas

Series and DataFrame

1. pandas.Series

2. pandas.DataFrame

2. Pandas are very busy

1. Access data

1.1 the head () and tail ()

1.2 the describe ()

1.3 T

1.4 sort_values ()

1.5 nlargest ()

1.6 the sample ()

2. Select data

2.1 Selection based on labels

2.2 Selection based on location

2.3 Boolean index

3. Handle the missing value

3.1 dropna ()

3.2 fillna ()

4. Operation method

4.1 agg ()

4.2 the apply ()

4.3 value_counts ()

4.4 the STR

5. Merger

5.1 the concat ()

5.2 the merge ()

6. Grouping GroupBy

6.1 Single-column Grouping

6.2 Multi-column grouping

6.3 Applying the multiple aggregation method

6.4 Different aggregation statistics for different columns

6.5 more

It’s very nice to watch TV

1. reshape

1.1 the stack () and unstack ()

1.2 pivot_table ()

2. Time series

Classification of 3.

4. IO

5. Drawing

Four or more

Related Posts

· Huashan Lunjian – History of convolutional networks

Recommended system engineering to build a map search map services

PyTorch Distributed Autograd (6) —-