In the future, we will focus on processing data and visualization for Pandas, Numpy, and Matplotlib. If you have any suggestions for the direction of your tweets, send them to me in the comments below or send me a private message in the background.

In this article, I will introduce the basic usage of Pandas, a software package that is very useful for handling data in Python.

1. Read data

Most of the data can be read using the read_csv() function, which has a sep parameter that represents the data separator, which defaults to “, “(since most CSV files are separated by commas).

users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user".                   sep = '|')# Read data;
users
Copy the code

Original data:

Snipaste_2020-06-13_08-22-39.png

Data after reading:

Snipaste_2020-06-13_08-26-03.png

In addition to read_CSV, there is a common read_table function that can also be used for reading operations, similar to read_csv

2. Change the index value to show only the first few rows

The set_index() function is used to change the index. Replace = True; Use the head(n) function to show only the first n rows of data

users.set_index('user_id',inplace = True)
users.head(25)
Copy the code
Snipaste_2020-06-13_08-26-13.png

Tail (n) shows only the last few rows of data;

3. View basic information about rows and columns of data

Shape returns the number of rows and columns of the data, as a tuple.

users.shape

# (943, 4)
Copy the code

Columns return the data column name.

users.columns

# Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')
Copy the code

3, index returns the row name;

users.index

Int64Index([  1.2.3.4.5.6.7.8.9.10..            934.935.936.937.938.939.940.941.942.943]. dtype='int64', name='user_id', length=943) Copy the code

4. Dtypes returns the data type of each column;

users.dtypes


# age int64
gender        object
occupation object zip_code object dtype: object Copy the code

4. Select only one or more columns of data

Pandas provides many formats for Pandas to choose from. Users indicates the DataFrame format that Pandas can handle.

1, users. column name;

users.occupation
Copy the code

2, users[[‘ id ‘]];

users[['occupation']]
Copy the code

3, users.loc[:,[‘ users.loc ‘]];

users.loc[:,['occupation']]
Copy the code
Snipaste_2020-06-13_10-39-00.png

When multiple columns of data are selected simultaneously

1, users[[‘ id ‘,’ id ‘]];

users[['occupation'.'age']]
Copy the code

2, users.loc[:,[‘ id ‘,’ id ‘]];

users.loc[:,['occupation'.'age']]
Copy the code
Snipaste_2020-06-13_20-49-34.png

5. Recalculate the data in the column

1, column name.nunique () checks how many unique samples there are in a particular column;

users.occupation.nunique()


# 21
Copy the code

It can also be done this way

The column name. Value_counts (). The count ()

users.occupation.value_counts().count()


# 21
Copy the code

If you want to see how many times each non-repeating sample appears in the data list on the basis of 1, use the following statement

Users. Column name. Value_counts ()

users.occupation.value_counts().head()


# student 196
other            105
educator 95 administrator 79 engineer 67 Name: occupation, dtype: int64 Copy the code

6. Do a simple count of the numeric columns in the data list

The default statistics are numeric columns(in which the data is presented in numeric numbers)

users.describe()
Copy the code
Snipaste_2020-06-13_20-49-55.png

It is also possible to count all columns by adding an argument include = ‘all’;

users.describe(include = 'all')
Copy the code
Snipaste_2020-06-13_20-50-02.png

Describe () it is also possible to perform statistics on specific columns:

users.occupation.describe()

#count 943
unique         21
top       student
freq 196 Name: occupation, dtype: object Copy the code

7. Group and cluster the data

The groupby function clusters a column and returns the groupby object. Similar to the method in 5, the difference is that groupBY refers to the column after clustering to check the data statistics of other columns

c =users.groupby("occupation")
c

# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017673002788>
Copy the code

Groupby. head(n) Views the first n rows

c.head(5)
Copy the code

Groupby.cout () performs data statistics for each sample corresponding to other columns

c.count()
Copy the code

Groupby.size () counts the number of occurrences of each sample in the column

c.size()
Copy the code

There are other functions that allow operations,

Snipaste_2020-06-13_10-33-50.png

Detailed to website inquiry: https://pandas.pydata.org/docs/reference/groupby.html

8. Sort the data by a column

The data.sort_values() function defaults to ascending. Ascending = False is set to ascending.

users.sort_values(["age"],ascending = False)
Copy the code

You can also sort by referring to multiple columns:

users.sort_values(["age"."zip_code"],ascending = False)
Copy the code
double_columns_sort.png

Create a new column

Adding a new column is easy. Create a Series with the same number of rows as the original list and assign values to the source

Data [‘ column name ‘] = newly created series; Next, I use age to homogenize the data and store it in a new column, AGe_normalize

Snipaste_2020-06-13_10-57-10.png

10, delete the specified column

The drop() function drops a specified column from the source data

users.drop(['age'],axis = 1)
Copy the code

The axis here specifies whether the row or column is to be deleted. The default is 0,0 for rows and 1 for columns; You can also use the following command:

users.drop(columns =['age'])
Copy the code
drop_columns.png

This article is formatted using MDNICE