In the future, we will focus on processing data and visualization for Pandas, Numpy, and Matplotlib. If you have any suggestions for the direction of your tweets, send them to me in the comments below or send me a private message in the background.
In this article, I will introduce the basic usage of Pandas, a software package that is very useful for handling data in Python.
1. Read data
Most of the data can be read using the read_csv() function, which has a sep parameter that represents the data separator, which defaults to “, “(since most CSV files are separated by commas).
users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user". sep = '|')# Read data;
users
Copy the code
Original data:
Data after reading:
In addition to read_CSV, there is a common read_table function that can also be used for reading operations, similar to read_csv
2. Change the index value to show only the first few rows
The set_index() function is used to change the index. Replace = True; Use the head(n) function to show only the first n rows of data
users.set_index('user_id',inplace = True)
users.head(25)
Copy the code
Tail (n) shows only the last few rows of data;
3. View basic information about rows and columns of data
Shape returns the number of rows and columns of the data, as a tuple.
users.shape
# (943, 4)
Copy the code
Columns return the data column name.
users.columns
# Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')
Copy the code
3, index returns the row name;
users.index
Int64Index([ 1.2.3.4.5.6.7.8.9.10.. 934.935.936.937.938.939.940.941.942.943]. dtype='int64', name='user_id', length=943) Copy the code
4. Dtypes returns the data type of each column;
users.dtypes
# age int64
gender object
occupation object zip_code object dtype: object Copy the code
4. Select only one or more columns of data
Pandas provides many formats for Pandas to choose from. Users indicates the DataFrame format that Pandas can handle.
1, users. column name;
users.occupation
Copy the code
2, users[[‘ id ‘]];
users[['occupation']]
Copy the code
3, users.loc[:,[‘ users.loc ‘]];
users.loc[:,['occupation']]
Copy the code
When multiple columns of data are selected simultaneously
1, users[[‘ id ‘,’ id ‘]];
users[['occupation'.'age']]
Copy the code
2, users.loc[:,[‘ id ‘,’ id ‘]];
users.loc[:,['occupation'.'age']]
Copy the code
5. Recalculate the data in the column
1, column name.nunique () checks how many unique samples there are in a particular column;
users.occupation.nunique()
# 21
Copy the code
It can also be done this way
The column name. Value_counts (). The count ()
users.occupation.value_counts().count()
# 21
Copy the code
If you want to see how many times each non-repeating sample appears in the data list on the basis of 1, use the following statement
Users. Column name. Value_counts ()
users.occupation.value_counts().head()
# student 196
other 105
educator 95 administrator 79 engineer 67 Name: occupation, dtype: int64 Copy the code
6. Do a simple count of the numeric columns in the data list
The default statistics are numeric columns(in which the data is presented in numeric numbers)
users.describe()
Copy the code
It is also possible to count all columns by adding an argument include = ‘all’;
users.describe(include = 'all')
Copy the code
Describe () it is also possible to perform statistics on specific columns:
users.occupation.describe()
#count 943
unique 21
top student
freq 196 Name: occupation, dtype: object Copy the code
7. Group and cluster the data
The groupby function clusters a column and returns the groupby object. Similar to the method in 5, the difference is that groupBY refers to the column after clustering to check the data statistics of other columns
c =users.groupby("occupation")
c
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017673002788>
Copy the code
Groupby. head(n) Views the first n rows
c.head(5)
Copy the code
Groupby.cout () performs data statistics for each sample corresponding to other columns
c.count()
Copy the code
Groupby.size () counts the number of occurrences of each sample in the column
c.size()
Copy the code
There are other functions that allow operations,
Detailed to website inquiry: https://pandas.pydata.org/docs/reference/groupby.html
8. Sort the data by a column
The data.sort_values() function defaults to ascending. Ascending = False is set to ascending.
users.sort_values(["age"],ascending = False)
Copy the code
You can also sort by referring to multiple columns:
users.sort_values(["age"."zip_code"],ascending = False)
Copy the code
Create a new column
Adding a new column is easy. Create a Series with the same number of rows as the original list and assign values to the source
Data [‘ column name ‘] = newly created series; Next, I use age to homogenize the data and store it in a new column, AGe_normalize
10, delete the specified column
The drop() function drops a specified column from the source data
users.drop(['age'],axis = 1)
Copy the code
The axis here specifies whether the row or column is to be deleted. The default is 0,0 for rows and 1 for columns; You can also use the following command:
users.drop(columns =['age'])
Copy the code
This article is formatted using MDNICE