Pandas provides fast, flexible, and expressive data structures and is a powerful Python library for data analysis.

This article is included in the pre-machine learning tutorial series.

Series and DataFrame

Machine Learning: 30 minutes to learn the common NumPy language for Pandas Pandas is particularly good for handling tabular data, such as SQL tables and EXCEL tables. Ordered or disordered time series. Arbitrary matrix data with row and column labels.

Open the Jupyter Notebook and import NUMpy and PANDAS to begin our tutorial:

import numpy as np
import pandas as pd
Copy the code

1. pandas.Series

Series is a one-dimensional NDARray array with an index. Index values may not be unique, but they must be hashable.

pd.Series([1, 3, 5, np.nan, 6, 8])
Copy the code

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
Copy the code

We can see that the default index values are numbers like 0, 1, 2, 3, 4, and 5. Add the index attribute and specify ‘c’,’a’,’ I ‘,’yong’,’j’,’ I ‘.

pd.Series([1, 3, 5, np.nan, 6, 8], index=['c','a','i','yong','j','i'])
Copy the code

In the output below, we can see that index is repeatable.

C 1.0 A 3.0 I 5.0 yong NaN J 6.0 I 8.0 DTYPE: float64Copy the code

2. pandas.DataFrame

A DataFrame is a table structure with rows and columns. You can think of it as a dictionary structure for multiple Series objects.

pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['i','ii','iii'], columns=['A', 'B', 'C'])
Copy the code

The output table is as follows, where index corresponds to its rows and columns to its columns.

A B C
i 1 2 3
ii 4 5 6
iii 7 8 9

2. Pandas are very busy

1. Access data

Prepare data and randomly generate A 2-d array of 6 rows and 4 columns with row labels for dates from 20210101 to 20210106 and column labels for A, B, C, and D.

import numpy as np
import pandas as pd
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20210101', periods=6), columns=list('ABCD'))
df
Copy the code

The display table is as follows:

A B C D
2021-01-01 0.270961 0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 1.64592 0.69841
2021-01-03 0.325415 0.602236 0.134508 1.28121
2021-01-04 0.33032 1.40384 0.93809 1.48804
2021-01-05 0.348708 1.27175 0.626011 0.253845
2021-01-06 0.816064 1.30197 0.656281 1.2718

1.1 the head () and tail ()

Look at the first few lines of the table:

df.head(2)
Copy the code

The display table is as follows:

A B C D
2021-01-01 0.270961 0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 1.64592 0.69841

Look at the next few lines of the table:

df.tail(3)
Copy the code

The display table is as follows:

A B C D
2021-01-04 0.33032 1.40384 0.93809 1.48804
2021-01-05 0.348708 1.27175 0.626011 0.253845
2021-01-06 0.816064 1.30197 0.656281 1.2718

1.2 the describe ()

The describe method is used to generate description statistics for the DataFrame. You can easily view the distribution of data sets. Note that the statistical distribution does not contain NaN values.

df.describe()
Copy the code

The display is as follows:

A B C D
count 6 6 6 6
mean 0.0825402 0.0497552 0.181309 0.22896
std 0.551412 1.07834 0.933155 1.13114
min 0.816064 1.40384 1.64592 1.2718
25% 0.18 0.553043 0.737194 0.587269
50% 0.298188 0.134555 0.106933 0.287363
75% 0.342885 0.987901 0.556601 1.16805
max 0.696541 1.30197 0.656281 1.48804

Let’s start by reviewing our mathematical formulas.

Mean:


x ˉ = i = 1 n x i n \bar x = \frac{\sum_{i=1}^{n}{x_i}}{n}

Variance (variance) :


s 2 = i = 1 n ( x i x ˉ ) 2 n s^2 = \frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}

The standard deviation (STD) :


s = i = 1 n ( x i x ˉ ) 2 n s = \sqrt{\frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}}

We explain the meaning of attributes of the describe statistics in pandas. Let’s just take A as an example.

  • countRepresents counting. There are six non-null data in column A.
  • meanRepresents the average. The average value of all non-null data in column A is 0.0825402.
  • stdThis is the standard deviation. The standard deviation of column A is 0.551412.
  • minRepresents the minimum value. The minimum value of column A is -0.816064. That is, 0% data is less than -0.816064.
  • 25%Represents a quarter of a number. The quartile of column A is -0.18. That is, 25% of the data is less than -0.18.
  • 50%Represents one half of a fraction. The quartile of column A is 0.298188. That is, 50% of the data is less than 0.298188.
  • 75%Represents the third quarter of the quartile. The third quartile of column A is 0.342885. That is, 75% of the data is smaller than 0.342885.
  • maxRepresents the maximum value. The maximum value of column A is 0.696541. That is, 100% of data is smaller than 0.696541.

1.3 T

T generally stands for Transpose, or Transpose. Row and column conversion.

df.T
Copy the code

The display table is as follows:

2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-05 2021-01-06
A 0.270961 0.696541 0.325415 0.33032 0.348708 0.816064
B 0.405463 0.136352 0.602236 1.40384 1.27175 1.30197
C 0.348373 1.64592 0.134508 0.93809 0.626011 0.656281
D 0.828572 0.69841 1.28121 1.48804 0.253845 1.2718

1.4 sort_values ()

Specify a column to sort, and the following code sorts positively according to column C.

df.sort_values(by='C')
Copy the code

The display table is as follows:

A B C D
2021-01-02 0.696541 0.136352 1.64592 0.69841
2021-01-04 0.33032 1.40384 0.93809 1.48804
2021-01-03 0.325415 0.602236 0.134508 1.28121
2021-01-01 0.270961 0.405463 0.348373 0.828572
2021-01-05 0.348708 1.27175 0.626011 0.253845
2021-01-06 0.816064 1.30197 0.656281 1.2718

1.5 nlargest ()

Select the largest n rows of data for a column. Df.nlargest (2,’A’) indicates that the largest 2 rows of column A are returned.

df.nlargest(2,'A')
Copy the code

The display table is as follows:

A B C D
2021-01-02 0.696541 0.136352 1.64592 0.69841
2021-01-05 0.348708 1.27175 0.626011 0.253845

1.6 the sample ()

The sample method displays random sample data.

Df.sample (5) returns five random rows of data.

df.sample(5)
Copy the code

The parameter frac stands for fraction. Frac =0.01 means that 1% random data is returned for sample display.

Df. Sample (frac = 0.01)Copy the code

2. Select data

2.1 Selection based on labels

We enter df[‘A’] to select column A.

df['A']
Copy the code

Output column A data, which is also A Series object:

2021-01-01 0.270961 2021-01-02 0.696541 2021-01-03 0.325415 2021-01-04-0.330320 2021-01-05 0.348708 2021-01-06 -0.816064 Name: A, dType: FLOAT64Copy the code

Df [0:3] This code is the same as df.head(3). Df [0:3] is the array selection method for NumPy, which shows that Pandas has good support for NumPy.

df[0:3]
Copy the code

The display table is as follows:

A B C D
2021-01-01 0.270961 0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 1.64592 0.69841
2021-01-03 0.325415 0.602236 0.134508 1.28121

Specify row and column labels through loC methods.

df.loc['2021-01-01':'2021-01-02', ['A', 'B']]
Copy the code

The display table is as follows:

A B
2021-01-01 0.270961 0.405463
2021-01-02 0.696541 0.136352

2.2 Selection based on location

Iloc is different from LOC. Loc specifies the specific tag, while ILOC specifies the index location of the tag. Df.iloc [3:5, 0:3] indicates that rows with indexes 3 and 4 and columns with indexes 0, 1, and 2 are selected. That is, rows 4 and 5, columns 1, 2, and 3. Note that the index number starts at 0. The colon indicates the interval, with the left and right sides indicating the beginning and end, respectively. For example, 3:5 represents the left open and right closed interval [3,5], which does not include 5 itself.

df.iloc[3:5, 0:3]
Copy the code
A B C
2021-01-04 0.33032 1.40384 0.93809
2021-01-05 0.348708 1.27175 0.626011
df.iloc[:, 1:3]
Copy the code
B C
2021-01-01 0.405463 0.348373
2021-01-02 0.136352 1.64592
2021-01-03 0.602236 0.134508
2021-01-04 1.40384 0.93809
2021-01-05 1.27175 0.626011
2021-01-06 1.30197 0.656281

2.3 Boolean index

The DataFrame can be filtered by a condition and returns if the condition is True. If the condition is False, filter it out.

We set A filter to determine if column A is greater than 0.

filter = df['A'] > 0
filter
Copy the code

The output is as follows, you can see the actions of 2021-01-04 and 2021-01-06 False.

2021-01-01     True
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Name: A, dtype: bool
Copy the code

We view the data set through a filter.

df[filter]
# df[df['A'] > 0]
Copy the code

Looking at the table, we can see that the rows 2021-01-04 and 2021-01-06 have been filtered out.

A B C D
2021-01-01 0.270961 0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 1.64592 0.69841
2021-01-03 0.325415 0.602236 0.134508 1.28121
2021-01-05 0.348708 1.27175 0.626011 0.253845

3. Handle the missing value

Prepare data.

Df2 = df. Copy () df2. Loc [: 3, 'E'] = 1.0 f_series = {' 2021-01-02 ', 1.0 '2021-01-03' : 2.0, '2021-01-04' : 3.0, '2021-01-05' : 4.0, '2021-01-06' : 5.0} df2 [' F '] = pd. The Series (df2 f_series)Copy the code

The display table is as follows:

A B C D F E
2021-01-01 0.270961 0.405463 0.348373 0.828572 nan 1
2021-01-02 0.696541 0.136352 1.64592 0.69841 1 1
2021-01-03 0.325415 0.602236 0.134508 1.28121 2 1
2021-01-04 0.33032 1.40384 0.93809 1.48804 3 nan
2021-01-05 0.348708 1.27175 0.626011 0.253845 4 nan
2021-01-06 0.816064 1.30197 0.656281 1.2718 5 nan

3.1 dropna ()

Use the dropna method to clear NaN values. Note: The dropa method returns a new DataFrame and does not alter the original DataFrame.

df2.dropna(how='any')
Copy the code

The above code indicates that the row data is deleted when any value is null.

A B C D F E
2021-01-02 0.696541 0.136352 1.64592 0.69841 1 1
2021-01-03 0.325415 0.602236 0.134508 1.28121 2 1

3.2 fillna ()

Use the filna command to fill in the NaN value.

df2.fillna(df2.mean())
Copy the code

The above code shows that the average of each column is used to fill in the gaps. Similarly, fillna does not update the original DataFrame. To update the original DataFrame, use the code df2 = df2.fillna(df2.mean()).

The display table is as follows:

A B C D F E
2021-01-01 0.270961 0.405463 0.348373 0.828572 3 1
2021-01-02 0.696541 0.136352 1.64592 0.69841 1 1
2021-01-03 0.325415 0.602236 0.134508 1.28121 2 1
2021-01-04 0.33032 1.40384 0.93809 1.48804 3 1
2021-01-05 0.348708 1.27175 0.626011 0.253845 4 1
2021-01-06 0.816064 1.30197 0.656281 1.2718 5 1

4. Operation method

4.1 agg ()

Agg stands for Aggregate.

Common aggregation methods are as follows:

  • mean(): Compute mean of groups
  • sum(): Compute sum of group values
  • size(): Compute group sizes
  • count(): Compute count of group
  • std(): Standard deviation of groups
  • var(): Compute variance of groups
  • sem(): Standard error of the mean of groups
  • describe(): Generates descriptive statistics
  • first(): Compute first of group values
  • last(): Compute last of group values
  • nth() : Take nth value, or a subset if n is a list
  • min(): Compute min of group values
  • max(): Compute max of group values
df.mean()
Copy the code

Returns the average value of each column

A 0.082540b 0.049755 c-0.18139d 0.228960dType: float64Copy the code

You can view the row average by adding the parameter axis.

df.mean(axis=1)
Copy the code

Output:

2021-01-01 0.260611 2021-01-02-0.377860 2021-01-03 0.217470 2021-01-04-0.296053 2021-01-05 0.498156 2021-01-06 0.032404 dtype: float64Copy the code

What if we want to see multiple aggregate statistics for a column? We can then call the AGG method:

df.agg(['std','mean'])['A']
Copy the code

Results returned show standard deviation STD and mean:

STD 0.551412 mean 0.082540 Name: A, DType: FLOAT64Copy the code

Apply different aggregate functions to different columns:

df.agg({'A':['max','mean'],'B':['mean','std','var']})
Copy the code

The result is as follows:

A B
max 0.696541 nan
mean 0.0825402 0.0497552
std nan 1.07834
var nan 1.16281

4.2 the apply ()

Apply () is a call to a method. For example, df.apply(np.sum) calls the np.sum method for each column and returns the sum of the values for each column.

df.apply(np.sum)
Copy the code

The output is:

A 0.495241 B 0.298531 C-1.087857 D 1.373762 DTYPE: FLOAT64Copy the code

The Apply method supports lambda expressions.

df.apply(lambda n: n*2)
Copy the code
A B C D
2021-01-01 0.541923 0.810925 0.696747 1.65714
2021-01-02 1.39308 0.272704 3.29185 1.39682
2021-01-03 0.65083 1.20447 0.269016 2.56242
2021-01-04 0.66064 2.80768 1.87618 2.97607
2021-01-05 0.697417 2.5435 1.25202 0.50769
2021-01-06 1.63213 2.60393 1.31256 2.5436

4.3 value_counts ()

The value_counts method displays duplicate statistics of rows and columns. We regenerate some integer data to ensure that there is some data duplication.

Np.random. Seed (101) df3 = pd.dataframe (np.random. Randint (0,9,size = (6,4)),columns=list('ABCD')) df3Copy the code
A B C D
0 1 6 7 8
1 4 8 5 0
2 5 8 1 3
3 8 3 3 2
4 8 3 7 0
5 7 8 4 3

Call the value_counts() method.

df3['A'].value_counts()
Copy the code

Looking at the output we can see that the number 8 in column A has two numbers and the number of other numbers is 1.

8    2
7    1
5    1
4    1
1    1
Name: A, dtype: int64
Copy the code

4.4 the STR

Pandas Built-in string handling method.

names = pd.Series(['andrew','bobo','claire','david','4'])
names.str.upper()
Copy the code

Using the above code we set all strings in Series to uppercase.

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         4
dtype: object
Copy the code

Capital letters:

names.str.capitalize()
Copy the code

The output is:

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object
Copy the code

Check whether it is a number:

names.str.isdigit()
Copy the code

The output is:

0    False
1    False
2    False
3    False
4     True
dtype: bool
Copy the code

String splitting:

tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
tickers = pd.Series(tech_finance)
tickers.str.split(',').str[0:2]
Copy the code

Splitting the string with commas results in:

0    [GOOG, APPL]
1      [JPM, BAC]
dtype: object
Copy the code

5. Merger

5.1 the concat ()

Concat is used to concatenate data sets. Let’s prepare the data first.

data_one = {'Col1': ['A0', 'A1', 'A2', 'A3'],'Col2': ['B0', 'B1', 'B2', 'B3']}
data_two = {'Col1': ['C0', 'C1', 'C2', 'C3'], 'Col2': ['D0', 'D1', 'D2', 'D3']}
one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)
Copy the code

Concat method is used to concatenate two data sets.

pt(pd.concat([one,two]))
Copy the code

Get the table:

Col1 Col2
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
0 C0 D0
1 C1 D1
2 C2 D2
3 C3 D3

5.2 the merge ()

Merge is the equivalent of the JOIN method in an SQL operation and is used to join two data sets in some kind of relationship

Registrations = pd. DataFrame ({' reg_id: [1, 2, 3, 4], 'name' : [' Andrew ', 'Bobo', 'Claire', 'David']}) logins = Pd. DataFrame ({' log_id: [1, 2, 3, 4], 'name' : [' Xavier ', 'Andrew', 'Yolanda', 'Bobo]})Copy the code

Select * from outer; select * from outer; select * from outer;

pd.merge(left=registrations, right=logins, how='outer',on='name')
Copy the code

The return result is:

reg_id name log_id
0 1 Andrew 2
1 2 Bobo 4
2 3 Claire nan
3 4 David nan
4 nan Xavier 1
5 nan Yolanda 3

Note that how: {‘left’, ‘right’, ‘outer’, ‘inner’} has 4 connections. Indicates whether nan values for the left and right tables are selected. For example, left indicates that all data in the left table is reserved. If the data in the right table is a nan value, the data on the right is not displayed. In simple terms, think of the left and right tables as two sets.

  • Left indicates the set of all left tables + intersection of two tables
  • Right: the set of all right tables + intersection of two tables
  • Outer means to take the union of two tables
  • Inner means to take the intersection of two tables

6. Grouping GroupBy

The grouping function in Pandas is very similar to the SQL statement SELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2. It doesn’t matter if you’re not familiar with SQL. Grouping is the process of splitting, counting, and merging table data into a single column.

Prepare data.

np.random.seed(20201212)
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df
Copy the code

As you can see, our columns A and B have A lot of duplicate data. At this point we can group by foo/bar or one/two.

A B C D
0 foo one 0.270961 0.325415
1 bar one 0.405463 0.602236
2 foo two 0.348373 0.134508
3 bar three 0.828572 1.28121
4 foo two 0.696541 0.33032
5 bar two 0.136352 1.40384
6 foo one 1.64592 0.93809
7 foo three 0.69841 1.48804

6.1 Single-column Grouping

We apply the Groupby method to group the data in the table above.

df.groupby('A')
Copy the code

The groupby method returns an object of type DataFrameGroupBy by executing the above code. We can’t see it directly, so we need to apply the aggregate function. See section 4.1 of this article.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014C6742E248>
Copy the code

Let’s try the aggregate function sum.

df.groupby('A').sum()
Copy the code

The display table is as follows:

A C D
bar 0.559461 0.724868
foo 1.02846 0.410533

6.2 Multi-column grouping

The groupby method supports passing in multiple columns as arguments.

df.groupby(['A', 'B']).sum()
Copy the code

The following information is displayed after grouping:

A B C D
bar one 0.405463 0.602236
one 0.405463 0.602236
three 0.828572 1.28121
two 0.136352 1.40384
foo one 1.37496 0.612675
three 0.69841 1.48804
two 1.04491 0.464828

6.3 Applying the multiple aggregation method

We apply agG (), passing an array of aggregate methods as arguments to the method. The following code classifies according to A and only counts the values of column C.

df.groupby('A')['C'].agg([np.sum, np.mean, np.std])
Copy the code

You can see the results of the aggregate functions for group bar and group foo as follows:

A sum mean std
bar 0.559461 0.186487 0.618543
foo 1.02846 0.205692 0.957242

6.4 Different aggregation statistics for different columns

The following code performs different aggregation statistics for column C and D, summing column C and conducting standard deviation statistics for column D.

df.groupby('A').agg({'C': 'sum', 'D': lambda x: np.std(x, ddof=1)})
Copy the code

The output is as follows:

A C D
bar 0.559461 1.37837
foo 1.02846 0.907422

6.5 more

More about Pandas goupby method: please refer to the website (pandas.pydata.org/pandas-docs…

It’s very nice to watch TV

1. reshape

0 0 Reshape the table For complex tables, we need to transform them into something we can understand, such as separate statistics grouped by certain attributes.

1.1 the stack () and unstack ()

The stack method divides the table into index and data. Index columns are retained and data is stacked.

Prepare data.

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
Copy the code

Based on the above code, we create a composite index.

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])
Copy the code

We create a DataFrame with a composite index.

np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
Copy the code

The output is as follows:

A B C D
bar one 0.270961 0.405463
two 0.348373 0.828572
baz one 0.696541 0.136352
two 1.64592 0.69841
foo one 0.325415 0.602236
two 0.134508 1.28121
qux one 0.33032 1.40384
two 0.93809 1.48804

We do the stack method.

stacked = df.stack()
stacked
Copy the code

The stacked (compressed) output is shown in the following table. Note: Your output using Jupyter Notebook/Lab may be different from the following. The output below has been adjusted to make it easier to display in Markdown.

First second bar one A 0.942502 bar one B 0.060742 bar two A 1.340975 bar two b-1.712152 baz one A 1.899275 baz one B 1.237799 baz two A -1.589069 baz two B 1.288342 foo one A -0.326792 foo one B 1.576351 foo two A 1.526528 foo two B 1.410695 Qux one A 0.420718 Qux one B -0.288002 Qux two A 0.361586 Qux two B 0.177352 DTYPE: Float64Copy the code

We unstack to expand the data.

stacked.unstack()
Copy the code

Output the original table.

A B C D
bar one 0.270961 0.405463
two 0.348373 0.828572
baz one 0.696541 0.136352
two 1.64592 0.69841
foo one 0.325415 0.602236
two 0.134508 1.28121
qux one 0.33032 1.40384
two 0.93809 1.48804

Let’s add the level parameter.

stacked.unstack(level=0)
#stacked.unstack(level=1)
Copy the code

If level=0, the output is as follows. If level=1, the output is as follows:

second first bar baz foo qux
one A 0.942502 1.89927 0.326792 0.420718
one B 0.060742 1.2378 1.57635 0.288002
two A 1.34097 1.58907 1.52653 0.361586
two B 1.71215 1.28834 1.4107 0.177352

1.2 pivot_table ()

Pivot_table represents pivot_table, a table format that dynamically arranges and summarizes data.

We generate dataframes for columns without indexes.

np.random.seed(99)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                    'B': ['A', 'B', 'C'] * 4,
                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D': np.random.randn(12),
                    'E': np.random.randn(12)})
df
Copy the code

The display table is as follows:

A B C D E
0 one A foo 0.142359 0.0235001
1 one B foo 2.05722 0.456201
2 two C foo 0.283262 0.270493
3 three A bar 1.32981 1.43501
4 one B bar 0.154622 0.882817
5 one C bar 0.0690309 0.580082
6 two A foo 0.75518 0.501565
7 three B foo 0.825647 0.590953
8 one C foo 0.113069 0.731616
9 one A bar 2.36784 0.261755
10 two B bar 0.167049 0.855796
11 three C bar 0.685398 0.187526

By observing the data, we can obviously conclude that columns A, B and C have certain attribute meanings. We perform the pivot_table method.

pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])
Copy the code

Use columns D and E as data columns, A and B as composite row indexes, and C’s data values as column indexes.

(‘D’, ‘bar’) (‘D’, ‘foo’) (‘E’, ‘bar’) (‘E’, ‘foo’)
(‘one’, ‘A’) 2.36784 0.142359 0.261755 0.0235001
(‘one’, ‘B’) 0.154622 2.05722 0.882817 0.456201
(‘one’, ‘C’) 0.0690309 0.113069 0.580082 0.731616
(‘three’, ‘A’) 1.32981 nan 1.43501 nan
(‘three’, ‘B’) nan 0.825647 nan 0.590953
(‘three’, ‘C’) 0.685398 nan 0.187526 nan
(‘two’, ‘A’) nan 0.75518 nan 0.501565
(‘two’, ‘B’) 0.167049 nan 0.855796 nan
(‘two’, ‘C’) nan 0.283262 nan 0.270493

2. Time series

Date_range is the method used to generate date intervals in Pandas. We execute the following code:

rng = pd.date_range('1/1/2021', periods=100, freq='S')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code

The date_range method performs 100 time segments at 1-second intervals starting from 0 seconds on January 1, 2021. The following output is displayed:

2021-01-01 00:00:00 475 2021-01-01 00:00:01 145 2021-01-01 00:00:02 13 2021-01-01 00:00:03 240 2021-01-01 00:00:04 183 . 2021-01-01 00:01:35 413 2021-01-01 00:01:36 330 2021-01-01 00:01:37 272 2021-01-01 00:01:38 304 2021-01-01 00:01:39 151 Freq: S, Length: 100, dtype: int32Copy the code

Let’s try changing the freq parameter from S(second) to M(Month).

rng = pd.date_range('1/1/2021', periods=100, freq='M')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code

Output:

2021-01-31 311 2021-02-28 256 2021-03-31 327 2021-04-30 151 2021-05-31 484... 2028-12-31 170 2029-01-31 492 2029-02-28 205 2029-03-31 90 2029-04-30 446 Freq: M, Length: 100, dtype: int32Copy the code

We set the date generation to be quarterly as a frequency.

prng = pd.period_range('2018Q1', '2020Q4', freq='Q-NOV')
pd.Series(np.random.randn(len(prng)), prng)
Copy the code

Output all quarters between q1 2018 and Q4 2020.

2018Q1 0.833025 2018Q2-0.509514 2018Q3-0.735542 2018q4-0.224403 2019Q1-0.119709 2019Q2-1.379413 2019Q3 0.871741 2019Q4 0.877493 2020Q1 0.577611 2020Q2-0.365737 2020Q3-0.473404 2020Q4 0.529800 Freq: q-Nov, dTYPE: FLOAT64Copy the code

Classification of 3.

Pandas has a special data type called “table of contents” (dtype=”category”) that is classified by setting certain columns to be directories.

Prepare data.

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df
Copy the code
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

We add a new column grade and set its data type to category.

df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Copy the code

We can see that the grade column has only three values a, B, and E.

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Copy the code

We replace a, B, and e in that order with very good, good, very bad.

df["grade"].cat.categories = ["very good", "good", "very bad"]
Copy the code

The table is as follows:

id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

Let’s sort the table:

df.sort_values(by="grade", ascending=False)
Copy the code
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

View the number of categories:

df.groupby("grade").size()
Copy the code

The above code output is:

grade
very good    3
good         2
very bad     1
dtype: int64
Copy the code

4. IO

Pandas supports reading and writing data from files in CSV, JSON, and EXCEL formats. The file formats supported by Pandas are as follows.

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_fwf
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
MS Excel read_excel to_excel
binary OpenDocument read_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary ORC Format read_orc
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary SPSS read_spss
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google BigQuery read_gbq to_gbq

Let’s just use a CSV file as an example. For other formats, please refer to the table above.

We import data from a CSV file. Don’t worry too much about the domain name of the url below.

df = pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")
Copy the code

View the first 5 rows of data:

df.head(5)
Copy the code
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 122.23 37.88 41 880 129 322 126 8.3252 452600 NEAR BAY
1 122.22 37.86 21 7099 1106 2401 1138 8.3014 358500 NEAR BAY
2 122.24 37.85 52 1467 190 496 177 7.2574 352100 NEAR BAY
3 122.25 37.85 52 1274 235 558 219 5.6431 341300 NEAR BAY
4 122.25 37.85 52 1627 280 565 259 3.8462 342200 NEAR BAY

5. Drawing

Pandas supports Matplotlib, a powerful Python visualization tool. This section provides a brief overview of the drawing methods supported by Pandas, and we will cover matplotlib in detail in the next article. In order not to miss the update, welcome to follow me.

np.random.seed(999)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
Copy the code

Let’s show this by calling the plot method directly. There are two caveats:

  1. The plot method is called by Pandas, not by matplotlib.
  2. We know that the Python language does not require semicolons for closing statements. The semicolon here indicates that the image is displayed directly after rendering the drawing.
df.plot();
Copy the code

df.plot.bar();
Copy the code

df.plot.bar(stacked=True);
Copy the code

Four or more

We will cover matplotlib in the next post. Please follow our pre-machine learning tutorial series or my personal blog, blog.caiyongji.com/.