After obtaining or generating the data source, before analyzing the data, we should be familiar with the data and have a general understanding of the form and structure of the data before analyzing it.
The Excel files used in this paper are as follows:
1. Preview data
When there is too much data in the file or the database records are too large, it is not advisable to obtain all of them at once or query them item by item. Head (): Previews the first few rows, which by default are 5 rows
import pandas as pd
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.head())
Copy the code
result:
2019-09-06 12 1 Northwest Shaanxi Xi 'an 2019-09-07 87 2 South China Guangdong Shenzhen 23 3 North China Beijing Beijing 2021-05-13 45 4 Central China Wuhan, Hubei Province, 2012-04-13, 21Copy the code
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.head(7))
Copy the code
result:
2019-09-06 12 1 Northwest Shaanxi Xi 'an 2019-09-07 87 2 South China Guangdong Shenzhen 23 3 North China Beijing Beijing 2021-05-13 45 4 Central China Wuhan, Hubei province, 2012-04-13 21 5 Northeast Heilongjiang Harbin 2019-09-11 42 6 Northwest Gansu Lanzhou 2019-09-12 3Copy the code
2. Obtain the size of the data table
Shape: Gets the size of the table, that is, how many rows and columns the table has. The result is a tuple consisting of (number of rows, number of columns)
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.shape)
Copy the code
result:
(12, 5)
Note: the number of rows in the table is 1 less than the actual number of rows in the table
3. Obtain the data type
Info (): Gets the data type of each column in the current table
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.info())
Copy the code
result:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 5 columns): # Column non-null Count Dtype --------- -------------- ----- 0 Region 12 non-null object 1 Province 12 Non-null object 2 City 12 Non-null object 3 Time 12 Non-NULL DateTime64 [ns] 4 Indicator 12 Non-null int64 dtypes: Datetime64 [ns](1), int64(1), Object (3) Memory Usage: 608.0+ bytes NoneCopy the code
4. Obtain the numerical distribution
Describe (): Obtain the distribution of columns of type numeric in the current data table
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.describe())
Copy the code
result:
Indicator Count 12.000000 mean 34.916667 STD 21.773455 min 3.000000 25% 22.500000 50% 32.000000 75% 42.750000 Max 87.000000Copy the code
Note: Because only the indicator column in the table is a number, the results show statistics for that column only
Explanation of output content:
Where, count is a number, indicating that there are 12 records in this case.
Mean = mean
STD is standard deviation
Min is the minimum value
Max is the maximum value
The following three are descriptive statistics for understanding only
25 percent is the 25th percentile
50% is the 50th percentile
75% is the 75th percentile