Pandas provides fast, flexible, and expressive data structures and is a powerful Python library for data analysis.
This article is included in the pre-machine learning tutorial series.
Series and DataFrame
Machine Learning: 30 minutes to learn the common NumPy language for Pandas Pandas is particularly good for handling tabular data, such as SQL tables and EXCEL tables. Ordered or disordered time series. Arbitrary matrix data with row and column labels.
Open the Jupyter Notebook and import NUMpy and PANDAS to begin our tutorial:
import numpy as np
import pandas as pd
Copy the code
1. pandas.Series
Series is a one-dimensional NDARray array with an index. Index values may not be unique, but they must be hashable.
pd.Series([1, 3, 5, np.nan, 6, 8])
Copy the code
Output:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Copy the code
We can see that the default index values are numbers like 0, 1, 2, 3, 4, and 5. Add the index attribute and specify ‘c’,’a’,’ I ‘,’yong’,’j’,’ I ‘.
pd.Series([1, 3, 5, np.nan, 6, 8], index=['c','a','i','yong','j','i'])
Copy the code
In the output below, we can see that index is repeatable.
C 1.0 A 3.0 I 5.0 yong NaN J 6.0 I 8.0 DTYPE: float64Copy the code
2. pandas.DataFrame
A DataFrame is a table structure with rows and columns. You can think of it as a dictionary structure for multiple Series objects.
pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['i','ii','iii'], columns=['A', 'B', 'C'])
Copy the code
The output table is as follows, where index corresponds to its rows and columns to its columns.
A | B | C | |
---|---|---|---|
i | 1 | 2 | 3 |
ii | 4 | 5 | 6 |
iii | 7 | 8 | 9 |
2. Pandas are very busy
1. Access data
Prepare data and randomly generate A 2-d array of 6 rows and 4 columns with row labels for dates from 20210101 to 20210106 and column labels for A, B, C, and D.
import numpy as np
import pandas as pd
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20210101', periods=6), columns=list('ABCD'))
df
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 |
2021-01-04 | 0.33032 | 1.40384 | 0.93809 | 1.48804 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 |
2021-01-06 | 0.816064 | 1.30197 | 0.656281 | 1.2718 |
1.1 the head () and tail ()
Look at the first few lines of the table:
df.head(2)
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
Look at the next few lines of the table:
df.tail(3)
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-04 | 0.33032 | 1.40384 | 0.93809 | 1.48804 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 |
2021-01-06 | 0.816064 | 1.30197 | 0.656281 | 1.2718 |
1.2 the describe ()
The describe method is used to generate description statistics for the DataFrame. You can easily view the distribution of data sets. Note that the statistical distribution does not contain NaN values.
df.describe()
Copy the code
The display is as follows:
A | B | C | D | |
---|---|---|---|---|
count | 6 | 6 | 6 | 6 |
mean | 0.0825402 | 0.0497552 | 0.181309 | 0.22896 |
std | 0.551412 | 1.07834 | 0.933155 | 1.13114 |
min | 0.816064 | 1.40384 | 1.64592 | 1.2718 |
25% | 0.18 | 0.553043 | 0.737194 | 0.587269 |
50% | 0.298188 | 0.134555 | 0.106933 | 0.287363 |
75% | 0.342885 | 0.987901 | 0.556601 | 1.16805 |
max | 0.696541 | 1.30197 | 0.656281 | 1.48804 |
Let’s start by reviewing our mathematical formulas.
Mean:
Variance (variance) :
The standard deviation (STD) :
We explain the meaning of attributes of the describe statistics in pandas. Let’s just take A as an example.
count
Represents counting. There are six non-null data in column A.mean
Represents the average. The average value of all non-null data in column A is 0.0825402.std
This is the standard deviation. The standard deviation of column A is 0.551412.min
Represents the minimum value. The minimum value of column A is -0.816064. That is, 0% data is less than -0.816064.25%
Represents a quarter of a number. The quartile of column A is -0.18. That is, 25% of the data is less than -0.18.50%
Represents one half of a fraction. The quartile of column A is 0.298188. That is, 50% of the data is less than 0.298188.75%
Represents the third quarter of the quartile. The third quartile of column A is 0.342885. That is, 75% of the data is smaller than 0.342885.max
Represents the maximum value. The maximum value of column A is 0.696541. That is, 100% of data is smaller than 0.696541.
1.3 T
T generally stands for Transpose, or Transpose. Row and column conversion.
df.T
Copy the code
The display table is as follows:
2021-01-01 | 2021-01-02 | 2021-01-03 | 2021-01-04 | 2021-01-05 | 2021-01-06 | |
---|---|---|---|---|---|---|
A | 0.270961 | 0.696541 | 0.325415 | 0.33032 | 0.348708 | 0.816064 |
B | 0.405463 | 0.136352 | 0.602236 | 1.40384 | 1.27175 | 1.30197 |
C | 0.348373 | 1.64592 | 0.134508 | 0.93809 | 0.626011 | 0.656281 |
D | 0.828572 | 0.69841 | 1.28121 | 1.48804 | 0.253845 | 1.2718 |
1.4 sort_values ()
Specify a column to sort, and the following code sorts positively according to column C.
df.sort_values(by='C')
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
2021-01-04 | 0.33032 | 1.40384 | 0.93809 | 1.48804 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 |
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 |
2021-01-06 | 0.816064 | 1.30197 | 0.656281 | 1.2718 |
1.5 nlargest ()
Select the largest n rows of data for a column. Df.nlargest (2,’A’) indicates that the largest 2 rows of column A are returned.
df.nlargest(2,'A')
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 |
1.6 the sample ()
The sample method displays random sample data.
Df.sample (5) returns five random rows of data.
df.sample(5)
Copy the code
The parameter frac stands for fraction. Frac =0.01 means that 1% random data is returned for sample display.
Df. Sample (frac = 0.01)Copy the code
2. Select data
2.1 Selection based on labels
We enter df[‘A’] to select column A.
df['A']
Copy the code
Output column A data, which is also A Series object:
2021-01-01 0.270961 2021-01-02 0.696541 2021-01-03 0.325415 2021-01-04-0.330320 2021-01-05 0.348708 2021-01-06 -0.816064 Name: A, dType: FLOAT64Copy the code
Df [0:3] This code is the same as df.head(3). Df [0:3] is the array selection method for NumPy, which shows that Pandas has good support for NumPy.
df[0:3]
Copy the code
The display table is as follows:
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 |
Specify row and column labels through loC methods.
df.loc['2021-01-01':'2021-01-02', ['A', 'B']]
Copy the code
The display table is as follows:
A | B | |
---|---|---|
2021-01-01 | 0.270961 | 0.405463 |
2021-01-02 | 0.696541 | 0.136352 |
2.2 Selection based on location
Iloc is different from LOC. Loc specifies the specific tag, while ILOC specifies the index location of the tag. Df.iloc [3:5, 0:3] indicates that rows with indexes 3 and 4 and columns with indexes 0, 1, and 2 are selected. That is, rows 4 and 5, columns 1, 2, and 3. Note that the index number starts at 0. The colon indicates the interval, with the left and right sides indicating the beginning and end, respectively. For example, 3:5 represents the left open and right closed interval [3,5], which does not include 5 itself.
df.iloc[3:5, 0:3]
Copy the code
A | B | C | |
---|---|---|---|
2021-01-04 | 0.33032 | 1.40384 | 0.93809 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 |
df.iloc[:, 1:3]
Copy the code
B | C | |
---|---|---|
2021-01-01 | 0.405463 | 0.348373 |
2021-01-02 | 0.136352 | 1.64592 |
2021-01-03 | 0.602236 | 0.134508 |
2021-01-04 | 1.40384 | 0.93809 |
2021-01-05 | 1.27175 | 0.626011 |
2021-01-06 | 1.30197 | 0.656281 |
2.3 Boolean index
The DataFrame can be filtered by a condition and returns if the condition is True. If the condition is False, filter it out.
We set A filter to determine if column A is greater than 0.
filter = df['A'] > 0
filter
Copy the code
The output is as follows, you can see the actions of 2021-01-04 and 2021-01-06 False.
2021-01-01 True
2021-01-02 True
2021-01-03 True
2021-01-04 False
2021-01-05 True
2021-01-06 False
Name: A, dtype: bool
Copy the code
We view the data set through a filter.
df[filter]
# df[df['A'] > 0]
Copy the code
Looking at the table, we can see that the rows 2021-01-04 and 2021-01-06 have been filtered out.
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 |
3. Handle the missing value
Prepare data.
Df2 = df. Copy () df2. Loc [: 3, 'E'] = 1.0 f_series = {' 2021-01-02 ', 1.0 '2021-01-03' : 2.0, '2021-01-04' : 3.0, '2021-01-05' : 4.0, '2021-01-06' : 5.0} df2 [' F '] = pd. The Series (df2 f_series)Copy the code
The display table is as follows:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 | nan | 1 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 | 1 | 1 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 | 2 | 1 |
2021-01-04 | 0.33032 | 1.40384 | 0.93809 | 1.48804 | 3 | nan |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 | 4 | nan |
2021-01-06 | 0.816064 | 1.30197 | 0.656281 | 1.2718 | 5 | nan |
3.1 dropna ()
Use the dropna method to clear NaN values. Note: The dropa method returns a new DataFrame and does not alter the original DataFrame.
df2.dropna(how='any')
Copy the code
The above code indicates that the row data is deleted when any value is null.
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 | 1 | 1 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 | 2 | 1 |
3.2 fillna ()
Use the filna command to fill in the NaN value.
df2.fillna(df2.mean())
Copy the code
The above code shows that the average of each column is used to fill in the gaps. Similarly, fillna does not update the original DataFrame. To update the original DataFrame, use the code df2 = df2.fillna(df2.mean()).
The display table is as follows:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-01-01 | 0.270961 | 0.405463 | 0.348373 | 0.828572 | 3 | 1 |
2021-01-02 | 0.696541 | 0.136352 | 1.64592 | 0.69841 | 1 | 1 |
2021-01-03 | 0.325415 | 0.602236 | 0.134508 | 1.28121 | 2 | 1 |
2021-01-04 | 0.33032 | 1.40384 | 0.93809 | 1.48804 | 3 | 1 |
2021-01-05 | 0.348708 | 1.27175 | 0.626011 | 0.253845 | 4 | 1 |
2021-01-06 | 0.816064 | 1.30197 | 0.656281 | 1.2718 | 5 | 1 |
4. Operation method
4.1 agg ()
Agg stands for Aggregate.
Common aggregation methods are as follows:
- mean(): Compute mean of groups
- sum(): Compute sum of group values
- size(): Compute group sizes
- count(): Compute count of group
- std(): Standard deviation of groups
- var(): Compute variance of groups
- sem(): Standard error of the mean of groups
- describe(): Generates descriptive statistics
- first(): Compute first of group values
- last(): Compute last of group values
- nth() : Take nth value, or a subset if n is a list
- min(): Compute min of group values
- max(): Compute max of group values
df.mean()
Copy the code
Returns the average value of each column
A 0.082540b 0.049755 c-0.18139d 0.228960dType: float64Copy the code
You can view the row average by adding the parameter axis.
df.mean(axis=1)
Copy the code
Output:
2021-01-01 0.260611 2021-01-02-0.377860 2021-01-03 0.217470 2021-01-04-0.296053 2021-01-05 0.498156 2021-01-06 0.032404 dtype: float64Copy the code
What if we want to see multiple aggregate statistics for a column? We can then call the AGG method:
df.agg(['std','mean'])['A']
Copy the code
Results returned show standard deviation STD and mean:
STD 0.551412 mean 0.082540 Name: A, DType: FLOAT64Copy the code
Apply different aggregate functions to different columns:
df.agg({'A':['max','mean'],'B':['mean','std','var']})
Copy the code
The result is as follows:
A | B | |
---|---|---|
max | 0.696541 | nan |
mean | 0.0825402 | 0.0497552 |
std | nan | 1.07834 |
var | nan | 1.16281 |
4.2 the apply ()
Apply () is a call to a method. For example, df.apply(np.sum) calls the np.sum method for each column and returns the sum of the values for each column.
df.apply(np.sum)
Copy the code
The output is:
A 0.495241 B 0.298531 C-1.087857 D 1.373762 DTYPE: FLOAT64Copy the code
The Apply method supports lambda expressions.
df.apply(lambda n: n*2)
Copy the code
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.541923 | 0.810925 | 0.696747 | 1.65714 |
2021-01-02 | 1.39308 | 0.272704 | 3.29185 | 1.39682 |
2021-01-03 | 0.65083 | 1.20447 | 0.269016 | 2.56242 |
2021-01-04 | 0.66064 | 2.80768 | 1.87618 | 2.97607 |
2021-01-05 | 0.697417 | 2.5435 | 1.25202 | 0.50769 |
2021-01-06 | 1.63213 | 2.60393 | 1.31256 | 2.5436 |
4.3 value_counts ()
The value_counts method displays duplicate statistics of rows and columns. We regenerate some integer data to ensure that there is some data duplication.
Np.random. Seed (101) df3 = pd.dataframe (np.random. Randint (0,9,size = (6,4)),columns=list('ABCD')) df3Copy the code
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 6 | 7 | 8 |
1 | 4 | 8 | 5 | 0 |
2 | 5 | 8 | 1 | 3 |
3 | 8 | 3 | 3 | 2 |
4 | 8 | 3 | 7 | 0 |
5 | 7 | 8 | 4 | 3 |
Call the value_counts() method.
df3['A'].value_counts()
Copy the code
Looking at the output we can see that the number 8 in column A has two numbers and the number of other numbers is 1.
8 2
7 1
5 1
4 1
1 1
Name: A, dtype: int64
Copy the code
4.4 the STR
Pandas Built-in string handling method.
names = pd.Series(['andrew','bobo','claire','david','4'])
names.str.upper()
Copy the code
Using the above code we set all strings in Series to uppercase.
0 ANDREW
1 BOBO
2 CLAIRE
3 DAVID
4 4
dtype: object
Copy the code
Capital letters:
names.str.capitalize()
Copy the code
The output is:
0 Andrew
1 Bobo
2 Claire
3 David
4 4
dtype: object
Copy the code
Check whether it is a number:
names.str.isdigit()
Copy the code
The output is:
0 False
1 False
2 False
3 False
4 True
dtype: bool
Copy the code
String splitting:
tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
tickers = pd.Series(tech_finance)
tickers.str.split(',').str[0:2]
Copy the code
Splitting the string with commas results in:
0 [GOOG, APPL]
1 [JPM, BAC]
dtype: object
Copy the code
5. Merger
5.1 the concat ()
Concat is used to concatenate data sets. Let’s prepare the data first.
data_one = {'Col1': ['A0', 'A1', 'A2', 'A3'],'Col2': ['B0', 'B1', 'B2', 'B3']}
data_two = {'Col1': ['C0', 'C1', 'C2', 'C3'], 'Col2': ['D0', 'D1', 'D2', 'D3']}
one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)
Copy the code
Concat method is used to concatenate two data sets.
pt(pd.concat([one,two]))
Copy the code
Get the table:
Col1 | Col2 | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
0 | C0 | D0 |
1 | C1 | D1 |
2 | C2 | D2 |
3 | C3 | D3 |
5.2 the merge ()
Merge is the equivalent of the JOIN method in an SQL operation and is used to join two data sets in some kind of relationship
Registrations = pd. DataFrame ({' reg_id: [1, 2, 3, 4], 'name' : [' Andrew ', 'Bobo', 'Claire', 'David']}) logins = Pd. DataFrame ({' log_id: [1, 2, 3, 4], 'name' : [' Xavier ', 'Andrew', 'Yolanda', 'Bobo]})Copy the code
Select * from outer; select * from outer; select * from outer;
pd.merge(left=registrations, right=logins, how='outer',on='name')
Copy the code
The return result is:
reg_id | name | log_id | |
---|---|---|---|
0 | 1 | Andrew | 2 |
1 | 2 | Bobo | 4 |
2 | 3 | Claire | nan |
3 | 4 | David | nan |
4 | nan | Xavier | 1 |
5 | nan | Yolanda | 3 |
Note that how: {‘left’, ‘right’, ‘outer’, ‘inner’} has 4 connections. Indicates whether nan values for the left and right tables are selected. For example, left indicates that all data in the left table is reserved. If the data in the right table is a nan value, the data on the right is not displayed. In simple terms, think of the left and right tables as two sets.
- Left indicates the set of all left tables + intersection of two tables
- Right: the set of all right tables + intersection of two tables
- Outer means to take the union of two tables
- Inner means to take the intersection of two tables
6. Grouping GroupBy
The grouping function in Pandas is very similar to the SQL statement SELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2. It doesn’t matter if you’re not familiar with SQL. Grouping is the process of splitting, counting, and merging table data into a single column.
Prepare data.
np.random.seed(20201212)
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df
Copy the code
As you can see, our columns A and B have A lot of duplicate data. At this point we can group by foo/bar or one/two.
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.270961 | 0.325415 |
1 | bar | one | 0.405463 | 0.602236 |
2 | foo | two | 0.348373 | 0.134508 |
3 | bar | three | 0.828572 | 1.28121 |
4 | foo | two | 0.696541 | 0.33032 |
5 | bar | two | 0.136352 | 1.40384 |
6 | foo | one | 1.64592 | 0.93809 |
7 | foo | three | 0.69841 | 1.48804 |
6.1 Single-column Grouping
We apply the Groupby method to group the data in the table above.
df.groupby('A')
Copy the code
The groupby method returns an object of type DataFrameGroupBy by executing the above code. We can’t see it directly, so we need to apply the aggregate function. See section 4.1 of this article.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014C6742E248>
Copy the code
Let’s try the aggregate function sum.
df.groupby('A').sum()
Copy the code
The display table is as follows:
A | C | D |
---|---|---|
bar | 0.559461 | 0.724868 |
foo | 1.02846 | 0.410533 |
6.2 Multi-column grouping
The groupby method supports passing in multiple columns as arguments.
df.groupby(['A', 'B']).sum()
Copy the code
The following information is displayed after grouping:
A | B | C | D |
---|---|---|---|
bar | one | 0.405463 | 0.602236 |
one | 0.405463 | 0.602236 | |
three | 0.828572 | 1.28121 | |
two | 0.136352 | 1.40384 | |
foo | one | 1.37496 | 0.612675 |
three | 0.69841 | 1.48804 | |
two | 1.04491 | 0.464828 |
6.3 Applying the multiple aggregation method
We apply agG (), passing an array of aggregate methods as arguments to the method. The following code classifies according to A and only counts the values of column C.
df.groupby('A')['C'].agg([np.sum, np.mean, np.std])
Copy the code
You can see the results of the aggregate functions for group bar and group foo as follows:
A | sum | mean | std |
---|---|---|---|
bar | 0.559461 | 0.186487 | 0.618543 |
foo | 1.02846 | 0.205692 | 0.957242 |
6.4 Different aggregation statistics for different columns
The following code performs different aggregation statistics for column C and D, summing column C and conducting standard deviation statistics for column D.
df.groupby('A').agg({'C': 'sum', 'D': lambda x: np.std(x, ddof=1)})
Copy the code
The output is as follows:
A | C | D |
---|---|---|
bar | 0.559461 | 1.37837 |
foo | 1.02846 | 0.907422 |
6.5 more
More about Pandas goupby method: please refer to the website (pandas.pydata.org/pandas-docs…
It’s very nice to watch TV
1. reshape
0 0 Reshape the table For complex tables, we need to transform them into something we can understand, such as separate statistics grouped by certain attributes.
1.1 the stack () and unstack ()
The stack method divides the table into index and data. Index columns are retained and data is stacked.
Prepare data.
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
Copy the code
Based on the above code, we create a composite index.
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
Copy the code
We create a DataFrame with a composite index.
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
Copy the code
The output is as follows:
A | B | C | D |
---|---|---|---|
bar | one | 0.270961 | 0.405463 |
two | 0.348373 | 0.828572 | |
baz | one | 0.696541 | 0.136352 |
two | 1.64592 | 0.69841 | |
foo | one | 0.325415 | 0.602236 |
two | 0.134508 | 1.28121 | |
qux | one | 0.33032 | 1.40384 |
two | 0.93809 | 1.48804 |
We do the stack method.
stacked = df.stack()
stacked
Copy the code
The stacked (compressed) output is shown in the following table. Note: Your output using Jupyter Notebook/Lab may be different from the following. The output below has been adjusted to make it easier to display in Markdown.
First second bar one A 0.942502 bar one B 0.060742 bar two A 1.340975 bar two b-1.712152 baz one A 1.899275 baz one B 1.237799 baz two A -1.589069 baz two B 1.288342 foo one A -0.326792 foo one B 1.576351 foo two A 1.526528 foo two B 1.410695 Qux one A 0.420718 Qux one B -0.288002 Qux two A 0.361586 Qux two B 0.177352 DTYPE: Float64Copy the code
We unstack to expand the data.
stacked.unstack()
Copy the code
Output the original table.
A | B | C | D |
---|---|---|---|
bar | one | 0.270961 | 0.405463 |
two | 0.348373 | 0.828572 | |
baz | one | 0.696541 | 0.136352 |
two | 1.64592 | 0.69841 | |
foo | one | 0.325415 | 0.602236 |
two | 0.134508 | 1.28121 | |
qux | one | 0.33032 | 1.40384 |
two | 0.93809 | 1.48804 |
Let’s add the level parameter.
stacked.unstack(level=0)
#stacked.unstack(level=1)
Copy the code
If level=0, the output is as follows. If level=1, the output is as follows:
second | first | bar | baz | foo | qux |
---|---|---|---|---|---|
one | A | 0.942502 | 1.89927 | 0.326792 | 0.420718 |
one | B | 0.060742 | 1.2378 | 1.57635 | 0.288002 |
two | A | 1.34097 | 1.58907 | 1.52653 | 0.361586 |
two | B | 1.71215 | 1.28834 | 1.4107 | 0.177352 |
1.2 pivot_table ()
Pivot_table represents pivot_table, a table format that dynamically arranges and summarizes data.
We generate dataframes for columns without indexes.
np.random.seed(99)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
df
Copy the code
The display table is as follows:
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | 0.142359 | 0.0235001 |
1 | one | B | foo | 2.05722 | 0.456201 |
2 | two | C | foo | 0.283262 | 0.270493 |
3 | three | A | bar | 1.32981 | 1.43501 |
4 | one | B | bar | 0.154622 | 0.882817 |
5 | one | C | bar | 0.0690309 | 0.580082 |
6 | two | A | foo | 0.75518 | 0.501565 |
7 | three | B | foo | 0.825647 | 0.590953 |
8 | one | C | foo | 0.113069 | 0.731616 |
9 | one | A | bar | 2.36784 | 0.261755 |
10 | two | B | bar | 0.167049 | 0.855796 |
11 | three | C | bar | 0.685398 | 0.187526 |
By observing the data, we can obviously conclude that columns A, B and C have certain attribute meanings. We perform the pivot_table method.
pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])
Copy the code
Use columns D and E as data columns, A and B as composite row indexes, and C’s data values as column indexes.
(‘D’, ‘bar’) | (‘D’, ‘foo’) | (‘E’, ‘bar’) | (‘E’, ‘foo’) | |
---|---|---|---|---|
(‘one’, ‘A’) | 2.36784 | 0.142359 | 0.261755 | 0.0235001 |
(‘one’, ‘B’) | 0.154622 | 2.05722 | 0.882817 | 0.456201 |
(‘one’, ‘C’) | 0.0690309 | 0.113069 | 0.580082 | 0.731616 |
(‘three’, ‘A’) | 1.32981 | nan | 1.43501 | nan |
(‘three’, ‘B’) | nan | 0.825647 | nan | 0.590953 |
(‘three’, ‘C’) | 0.685398 | nan | 0.187526 | nan |
(‘two’, ‘A’) | nan | 0.75518 | nan | 0.501565 |
(‘two’, ‘B’) | 0.167049 | nan | 0.855796 | nan |
(‘two’, ‘C’) | nan | 0.283262 | nan | 0.270493 |
2. Time series
Date_range is the method used to generate date intervals in Pandas. We execute the following code:
rng = pd.date_range('1/1/2021', periods=100, freq='S')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code
The date_range method performs 100 time segments at 1-second intervals starting from 0 seconds on January 1, 2021. The following output is displayed:
2021-01-01 00:00:00 475 2021-01-01 00:00:01 145 2021-01-01 00:00:02 13 2021-01-01 00:00:03 240 2021-01-01 00:00:04 183 . 2021-01-01 00:01:35 413 2021-01-01 00:01:36 330 2021-01-01 00:01:37 272 2021-01-01 00:01:38 304 2021-01-01 00:01:39 151 Freq: S, Length: 100, dtype: int32Copy the code
Let’s try changing the freq parameter from S(second) to M(Month).
rng = pd.date_range('1/1/2021', periods=100, freq='M')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
Copy the code
Output:
2021-01-31 311 2021-02-28 256 2021-03-31 327 2021-04-30 151 2021-05-31 484... 2028-12-31 170 2029-01-31 492 2029-02-28 205 2029-03-31 90 2029-04-30 446 Freq: M, Length: 100, dtype: int32Copy the code
We set the date generation to be quarterly as a frequency.
prng = pd.period_range('2018Q1', '2020Q4', freq='Q-NOV')
pd.Series(np.random.randn(len(prng)), prng)
Copy the code
Output all quarters between q1 2018 and Q4 2020.
2018Q1 0.833025 2018Q2-0.509514 2018Q3-0.735542 2018q4-0.224403 2019Q1-0.119709 2019Q2-1.379413 2019Q3 0.871741 2019Q4 0.877493 2020Q1 0.577611 2020Q2-0.365737 2020Q3-0.473404 2020Q4 0.529800 Freq: q-Nov, dTYPE: FLOAT64Copy the code
Classification of 3.
Pandas has a special data type called “table of contents” (dtype=”category”) that is classified by setting certain columns to be directories.
Prepare data.
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df
Copy the code
id | raw_grade | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | b |
3 | 4 | a |
4 | 5 | a |
5 | 6 | e |
We add a new column grade and set its data type to category.
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Copy the code
We can see that the grade column has only three values a, B, and E.
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Copy the code
We replace a, B, and e in that order with very good, good, very bad.
df["grade"].cat.categories = ["very good", "good", "very bad"]
Copy the code
The table is as follows:
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
3 | 4 | a | very good |
4 | 5 | a | very good |
5 | 6 | e | very bad |
Let’s sort the table:
df.sort_values(by="grade", ascending=False)
Copy the code
id | raw_grade | grade | |
---|---|---|---|
5 | 6 | e | very bad |
1 | 2 | b | good |
2 | 3 | b | good |
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
View the number of categories:
df.groupby("grade").size()
Copy the code
The above code output is:
grade
very good 3
good 2
very bad 1
dtype: int64
Copy the code
4. IO
Pandas supports reading and writing data from files in CSV, JSON, and EXCEL formats. The file formats supported by Pandas are as follows.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | Fixed-Width Text File | read_fwf | |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
MS Excel | read_excel | to_excel | |
binary | OpenDocument | read_excel | |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | ORC Format | read_orc | |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | SPSS | read_spss | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google BigQuery | read_gbq | to_gbq |
Let’s just use a CSV file as an example. For other formats, please refer to the table above.
We import data from a CSV file. Don’t worry too much about the domain name of the url below.
df = pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")
Copy the code
View the first 5 rows of data:
df.head(5)
Copy the code
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 122.23 | 37.88 | 41 | 880 | 129 | 322 | 126 | 8.3252 | 452600 | NEAR BAY |
1 | 122.22 | 37.86 | 21 | 7099 | 1106 | 2401 | 1138 | 8.3014 | 358500 | NEAR BAY |
2 | 122.24 | 37.85 | 52 | 1467 | 190 | 496 | 177 | 7.2574 | 352100 | NEAR BAY |
3 | 122.25 | 37.85 | 52 | 1274 | 235 | 558 | 219 | 5.6431 | 341300 | NEAR BAY |
4 | 122.25 | 37.85 | 52 | 1627 | 280 | 565 | 259 | 3.8462 | 342200 | NEAR BAY |
5. Drawing
Pandas supports Matplotlib, a powerful Python visualization tool. This section provides a brief overview of the drawing methods supported by Pandas, and we will cover matplotlib in detail in the next article. In order not to miss the update, welcome to follow me.
np.random.seed(999)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
Copy the code
Let’s show this by calling the plot method directly. There are two caveats:
- The plot method is called by Pandas, not by matplotlib.
- We know that the Python language does not require semicolons for closing statements. The semicolon here indicates that the image is displayed directly after rendering the drawing.
df.plot();
Copy the code
df.plot.bar();
Copy the code
df.plot.bar(stacked=True);
Copy the code
Four or more
We will cover matplotlib in the next post. Please follow our pre-machine learning tutorial series or my personal blog, blog.caiyongji.com/.