My goal is not to teach you how to play Pandas. I will introduce Pandas through a series of examples to help you learn about Pandas.
Disclaimer: The tutorial is free and you can keep following it if it’s helpful.
Pandas in data structures, a | easily play Pandas (1) introduced the Pandas of two kinds of data structure is commonly used in the Series and the DataFrame, here look at the data structure what are the commonly used functions.
# import related libraries
import numpy as np
import pandas as pd
Copy the code
Common basic functions
Once we’ve built Series and DataFrame, what features do we use on a regular basis? Come and see with me. Referring to the scenario in the previous chapter, we have some information about the user and store it in the DataFrame.
Since DataFrame is more commonly used than Series in most cases, DataFrame is used here as an example, but in fact many of the common features apply to Series as well.
index = pd.Index(data=["Tom"."Bob"."Mary"."James"], name="name")
data = {
"age": [18.30.25.40]."city": ["BeiJing"."ShangHai"."GuangZhou"."ShenZhen"]."sex": ["male"."male"."female"."male"]
}
user_info = pd.DataFrame(data=data, index=index)
user_info
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Tom | 18 | BeiJing | male |
Bob | 30 | ShangHai | male |
Mary | 25 | GuangZhou | female |
James | 40 | ShenZhen | male |
In general, the first step to get data is to understand the overall situation of the data, which can be viewed using the INFO method.
user_info.info()
Copy the code
Index: 4 entries, Tom to James Data columns (total 3 columns): Age 4 non-null int64 City 4 Non-null object sex 4 Non-null object Dtypes: INT64 (1), object(2) Memory Usage: 128.0+ bytesCopy the code
If we have a very large amount of data, and I want to see what the data looks like, and I don’t want to see all the data, then we can just look at n at the head or n at the tail. You can use the head method to view n pieces of data in the header and the tail method to view n pieces of data in the tail.
user_info.head(2)
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Tom | 18 | BeiJing | male |
Bob | 30 | ShangHai | male |
In addition, the data structures in Pandas have the usual methods and properties in PANDA ray, such as.shape for the shape of the data and.t for the transpose of the data.
user_info.shape
Copy the code
(4, 3)
Copy the code
user_info.T
Copy the code
name | Tom | Bob | Mary | James |
---|---|---|---|---|
age | 18 | 30 | 25 | 40 |
city | BeiJing | ShangHai | GuangZhou | ShenZhen |
sex | male | male | female | male |
If we want to get the original data from a DataFrame, we can get it from.values, which is an Ndarray.
user_info.values
Copy the code
array([[18, 'BeiJing', 'male'],
[30, 'ShangHai', 'male'],
[25, 'GuangZhou', 'female'],
[40, 'ShenZhen', 'male']], dtype=object)
Copy the code
Description and statistics
Sometimes, after obtaining data, we want to view simple statistical indicators of the data (maximum, minimum, average, median, etc.). For example, we want to view the maximum age. How can we achieve this?
Call Max directly on the age column.
user_info.age.max()
Copy the code
40
Copy the code
Similarly, minimum, mean, median, and sum can be achieved by calling the min, mean, quantile, and sum methods. As you can see, calling these methods on a Series only returns an aggregate result.
Cumsum is also used for summation, but cumsum is used for summation, which means that the result is the same size as the original Series or DataFrame.
user_info.age.cumsum()
Copy the code
name
Tom 18
Bob 48
Mary 73
James 113
Name: age, dtype: int64
Copy the code
As you can see, the final result of cummax is the sum of the previous sum with the original current value as the current value. That sounds a little convoluted. For example, 73 above is 48 + 25. Cumsum can also be used to manipulate string objects.
user_info.sex.cumsum()
Copy the code
name
Tom male
Bob malemale
Mary malemalefemale
James malemalefemalemale
Name: sex, dtype: object
Copy the code
Descriptive Statistics: Descriptive Statistics
Although there are methods for various common statistics, if I want to get more than one indicator, I need to call the method multiple times, does it seem a little cumbersome?
Pandas’ designers have this in mind, and to capture multiple metrics at once, simply call the Describe method.
user_info.describe()
Copy the code
age | |
---|---|
count | 4.000000 |
mean | 28.250000 |
std | 9.251126 |
min | 18.000000 |
25% | 23.250000 |
50% | 27.500000 |
75% | 32.500000 |
max | 40.000000 |
As you can see, after calling the Describe method directly, statistical indicators for numeric columns such as totals, mean, standard deviation, minimum, maximum, 25/50/75% quantile are displayed. If you want to see statistics for columns that are not numeric, you can set include=[“object”] to get them.
user_info.describe(include=["object"])
Copy the code
city | sex | |
---|---|---|
count | 4 | 4 |
unique | 4 | 2 |
top | BeiJing | male |
freq | 1 | 3 |
The results above show some statistics for non-numeric columns: total number, number of de-duplicates, most common values, frequency of most common values.
Also, if I want to count the number of occurrences of each value in a column, how quickly can I do that? Call the value_COUNTS method to quickly get the number of occurrences of each value in a Series.
user_info.sex.value_counts()
Copy the code
male 3
female 1
Name: sex, dtype: int64
Copy the code
If you want to get the index corresponding to the maximum or minimum value of a column, you can use the idxmax or idxmin methods to do so.
user_info.age.idxmax()
Copy the code
'James'
Copy the code
discretization
Sometimes, there is a need to discretize the ages (buckets), which literally means dividing the ages into intervals. Here we want to divide the ages into three intervals. To do this, use the Pandas cut method.
pd.cut(user_info.age, 3)
Copy the code
Name Tom (17.978, 25.333) Bob (25.333, 32.667) Mary (17.978, 25.333) James (32.667, 40.0) name: age, dType: Category Categories (3, Interval [float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]Copy the code
As you can see, the cut automatically generates isometric discrete intervals, which you can define if you want.
pd.cut(user_info.age, [1.18.30.50])
Copy the code
name
Tom (1, 18]
Bob (18, 30]
Mary (18, 30]
James (30, 50]
Name: age, dtype: category
Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
Copy the code
Sometimes after discretization, if you want to give each interval a name, you can specify the labels parameter.
pd.cut(user_info.age, [1.18.30.50], labels=["childhood"."youth"."middle"])
Copy the code
name
Tom childhood
Bob youth
Mary youth
James middle
Name: age, dtype: category
Categories (3, object): [childhood < youth < middle]
Copy the code
In addition to discretization using CUT, qCUT can also be discretized. Cut is discretized according to the size of each value, and Qcut is discretized according to the number of occurrences of each value.
pd.qcut(user_info.age, 3)
Copy the code
Name Tom (17.999, 25.0) Bob (25.0, 30.0) Mary (17.999, 25.0) James (30.0, 40.0) name: age, dtype: Category Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]Copy the code
sorting
In data analysis, data sorting is indispensable. Pandas supports two sorting methods: by axis (index or column) and by actual value.
Sorting by index: sort_index is sorted by index by default.
user_info.sort_index()
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Bob | 30 | ShangHai | male |
James | 40 | ShenZhen | male |
Mary | 25 | GuangZhou | female |
Tom | 18 | BeiJing | male |
If you want to sort the columns in reverse order, set the parameters Axis =1 and Ascending =False.
user_info.sort_index(axis=1, ascending=False)
Copy the code
sex | city | age | |
---|---|---|---|
name | |||
Tom | male | BeiJing | 18 |
Bob | male | ShangHai | 30 |
Mary | female | GuangZhou | 25 |
James | male | ShenZhen | 40 |
If you want to sort by actual value, for example, if you want to sort by age, how do you do that?
Using the sort_values method, set the parameter by=”age”.
user_info.sort_values(by="age")
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Tom | 18 | BeiJing | male |
Mary | 25 | GuangZhou | female |
Bob | 30 | ShangHai | male |
James | 40 | ShenZhen | male |
Sometimes we may need to sort by multiple values, for example, to sort by age and city together, we can set by to a list.
Note: The order of each element in the list affects the sorting priority.
user_info.sort_values(by=["age"."city"])
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Tom | 18 | BeiJing | male |
Mary | 25 | GuangZhou | female |
Bob | 30 | ShangHai | male |
James | 40 | ShenZhen | male |
We can use the nlargest and nsmallest methods to do this, which is much faster than using head(n) after sorting.
user_info.age.nlargest(2)
Copy the code
name
James 40
Bob 30
Name: age, dtype: int64
Copy the code
Function application
Although Pandas provides a very rich set of functions for us, there may be times when we need to customize some functions and apply them to a DataFrame or Series. The commonly used functions are map, apply, and applymap.
Map is a method unique to a Series that transforms every element in a Series.
If I want to determine whether a user is middle-aged by age (over 30 is middle-aged), MAP can do it easily.
# receive a lambda function
user_info.age.map(lambda x: "yes" if x >= 30 else "no")
Copy the code
name
Tom no
Bob yes
Mary no
James yes
Name: age, dtype: object
Copy the code
For example, if I want to determine whether it’s north or south by city, I can do that.
city_map = {
"BeiJing": "north"."ShangHai": "south"."GuangZhou": "south"."ShenZhen": "south"
}
Pass in a map
user_info.city.map(city_map)
Copy the code
name
Tom north
Bob south
Mary south
James south
Name: city, dtype: object
Copy the code
The Apply method supports both Series and DataFrame, applying to each value when operating on Series and all rows or columns when operating on DataFrame (controlled by axis parameters).
# For Series, the Apply method is not very different from the map method.
user_info.age.apply(lambda x: "yes" if x >= 30 else "no")
Copy the code
name
Tom no
Bob yes
Mary no
James yes
Name: age, dtype: object
Copy the code
For DataFrame, the apply method applies to a row or a Series of data.
user_info.apply(lambda x: x.max(), axis=0)
Copy the code
age 40
city ShenZhen
sex male
dtype: object
Copy the code
The applyMap method applies to a DataFrame, and it applies to each element in the DataFrame similar to what apply does to a Series.
user_info.applymap(lambda x: str(x).lower())
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
Tom | 18 | beijing | male |
Bob | 30 | shanghai | male |
Mary | 25 | guangzhou | female |
James | 40 | shenzhen | male |
Modify column/index names
When using DataFrame, it is common to change column names, index names, and so on. This is easily done using rename.
To change the column name, set the columns parameter.
user_info.rename(columns={"age": "Age"."city": "City"."sex": "Sex"})
Copy the code
Age | City | Sex | |
---|---|---|---|
name | |||
Tom | 18 | BeiJing | male |
Bob | 30 | ShangHai | male |
Mary | 25 | GuangZhou | female |
James | 40 | ShenZhen | male |
Similarly, you only need to set the index parameter to change the index name.
user_info.rename(index={"Tom": "tom"."Bob": "bob"})
Copy the code
age | city | sex | |
---|---|---|---|
name | |||
tom | 18 | BeiJing | male |
bob | 30 | ShangHai | male |
Mary | 25 | GuangZhou | female |
James | 40 | ShenZhen | male |
Type of operation
If you want to get the number of columns for each type, use the get_dType_COUNTS method.
user_info.get_dtype_counts()
Copy the code
int64 1
object 2
dtype: int64
Copy the code
If you want to convert data types, you can do so with astype.
user_info["age"].astype(float)
Copy the code
Name Tom 18.0 Bob 30.0 Mary 25.0 James 40.0 Name: age, DType: float64Copy the code
In Pandas, the to_NUMERIC, to_datetime, and to_timedelta methods correspond to the numeric, to_datetime, and to_timedelta methods.
I’m going to add some height information to all of these users.
user_info["height"] = ["178"."168"."178"."180cm"]
user_info
Copy the code
age | city | sex | height | |
---|---|---|---|---|
name | ||||
Tom | 18 | BeiJing | male | 178 |
Bob | 30 | ShangHai | male | 168 |
Mary | 25 | GuangZhou | female | 178 |
James | 40 | ShenZhen | male | 180cm |
Now convert the height column to a number. Obviously, 180cm is not a number. To cast, we can pass in the errors argument, which is what to do if a cast fails.
By default, errors=’raise’, which means that an exception will be thrown directly after a coerce failure, and setting errors=’ COERce ‘will give the element in question a value of pd.nat (for datetime and timedelta) or NP.nan (number) in the event of a coerce failure. Setting errors=’ignore’ returns the original data in case of a strong-casting failure.
pd.to_numeric(user_info.height, errors="coerce")
Copy the code
Name Tom 178.0 Bob 168.0 Mary 178.0 James NaN name: height, dType: float64Copy the code
pd.to_numeric(user_info.height, errors="ignore")
Copy the code
name
Tom 178
Bob 168
Mary 178
James 180cm
Name: height, dtype: object
Copy the code
To learn more about artificial intelligence, please follow our official account: AI Pai
Here I have arranged the content of the whole article into a PDF. If you want a PDF file, you can reply the keyword: pandas02 in the background of the public account.
I’m Pandas