After obtaining or generating the data source, before analyzing the data, we should be familiar with the data and have a general understanding of the form and structure of the data before analyzing it.

The Excel files used in this paper are as follows:

1. Preview data

When there is too much data in the file or the database records are too large, it is not advisable to obtain all of them at once or query them item by item. Head (): Previews the first few rows, which by default are 5 rows

import pandas as pd


df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.head())
Copy the code

result:

2019-09-06 12 1 Northwest Shaanxi Xi 'an 2019-09-07 87 2 South China Guangdong Shenzhen 23 3 North China Beijing Beijing 2021-05-13 45 4 Central China Wuhan, Hubei Province, 2012-04-13, 21Copy the code
df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.head(7))
Copy the code

result:

2019-09-06 12 1 Northwest Shaanxi Xi 'an 2019-09-07 87 2 South China Guangdong Shenzhen 23 3 North China Beijing Beijing 2021-05-13 45 4 Central China Wuhan, Hubei province, 2012-04-13 21 5 Northeast Heilongjiang Harbin 2019-09-11 42 6 Northwest Gansu Lanzhou 2019-09-12 3Copy the code

2. Obtain the size of the data table

Shape: Gets the size of the table, that is, how many rows and columns the table has. The result is a tuple consisting of (number of rows, number of columns)

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.shape)
Copy the code

result:

(12, 5)

Note: the number of rows in the table is 1 less than the actual number of rows in the table

3. Obtain the data type

Info (): Gets the data type of each column in the current table

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.info())
Copy the code

result:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 12 entries, 0 to 11 Data columns (total 5 columns): # Column non-null Count Dtype --------- -------------- ----- 0 Region 12 non-null object 1 Province 12 Non-null object 2 City 12 Non-null object 3 Time 12 Non-NULL DateTime64 [ns] 4 Indicator 12 Non-null int64 dtypes: Datetime64 [ns](1), int64(1), Object (3) Memory Usage: 608.0+ bytes NoneCopy the code

4. Obtain the numerical distribution

Describe (): Obtain the distribution of columns of type numeric in the current data table

df = pd.read_excel(r'C:\Users\admin\Desktop\data_test.xlsx')
print(df.describe())
Copy the code

result:

Indicator Count 12.000000 mean 34.916667 STD 21.773455 min 3.000000 25% 22.500000 50% 32.000000 75% 42.750000 Max 87.000000Copy the code

Note: Because only the indicator column in the table is a number, the results show statistics for that column only

Explanation of output content:

Where, count is a number, indicating that there are 12 records in this case.

Mean = mean

STD is standard deviation

Min is the minimum value

Max is the maximum value

The following three are descriptive statistics for understanding only

25 percent is the 25th percentile

50% is the 50th percentile

75% is the 75th percentile