My goal is not to teach you how to play Pandas. I will introduce Pandas through a series of examples to help you learn about Pandas.

Disclaimer: The tutorial is free and you can keep following it if it’s helpful.

Pandas in data structures, a | easily play Pandas (1) introduced the Pandas of two kinds of data structure is commonly used in the Series and the DataFrame, here look at the data structure what are the commonly used functions.

# import related libraries
import numpy as np
import pandas as pd
Copy the code

Common basic functions

Once we’ve built Series and DataFrame, what features do we use on a regular basis? Come and see with me. Referring to the scenario in the previous chapter, we have some information about the user and store it in the DataFrame.

Since DataFrame is more commonly used than Series in most cases, DataFrame is used here as an example, but in fact many of the common features apply to Series as well.

index = pd.Index(data=["Tom"."Bob"."Mary"."James"], name="name")

data = {
    "age": [18.30.25.40]."city": ["BeiJing"."ShangHai"."GuangZhou"."ShenZhen"]."sex": ["male"."male"."female"."male"]
}

user_info = pd.DataFrame(data=data, index=index)
user_info
Copy the code
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 GuangZhou female
James 40 ShenZhen male

In general, the first step to get data is to understand the overall situation of the data, which can be viewed using the INFO method.

user_info.info()
Copy the code
Index: 4 entries, Tom to James Data columns (total 3 columns): Age 4 non-null int64 City 4 Non-null object sex 4 Non-null object Dtypes: INT64 (1), object(2) Memory Usage: 128.0+ bytesCopy the code

If we have a very large amount of data, and I want to see what the data looks like, and I don’t want to see all the data, then we can just look at n at the head or n at the tail. You can use the head method to view n pieces of data in the header and the tail method to view n pieces of data in the tail.

user_info.head(2)
Copy the code
age city sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male

In addition, the data structures in Pandas have the usual methods and properties in PANDA ray, such as.shape for the shape of the data and.t for the transpose of the data.

user_info.shape
Copy the code
(4, 3)
Copy the code
user_info.T
Copy the code
name Tom Bob Mary James
age 18 30 25 40
city BeiJing ShangHai GuangZhou ShenZhen
sex male male female male

If we want to get the original data from a DataFrame, we can get it from.values, which is an Ndarray.

user_info.values
Copy the code
array([[18, 'BeiJing', 'male'],
       [30, 'ShangHai', 'male'],
       [25, 'GuangZhou', 'female'],
       [40, 'ShenZhen', 'male']], dtype=object)
Copy the code

Description and statistics

Sometimes, after obtaining data, we want to view simple statistical indicators of the data (maximum, minimum, average, median, etc.). For example, we want to view the maximum age. How can we achieve this?

Call Max directly on the age column.

user_info.age.max()
Copy the code
40
Copy the code

Similarly, minimum, mean, median, and sum can be achieved by calling the min, mean, quantile, and sum methods. As you can see, calling these methods on a Series only returns an aggregate result.

Cumsum is also used for summation, but cumsum is used for summation, which means that the result is the same size as the original Series or DataFrame.

user_info.age.cumsum()
Copy the code
name
Tom       18
Bob       48
Mary      73
James    113
Name: age, dtype: int64
Copy the code

As you can see, the final result of cummax is the sum of the previous sum with the original current value as the current value. That sounds a little convoluted. For example, 73 above is 48 + 25. Cumsum can also be used to manipulate string objects.

user_info.sex.cumsum()
Copy the code
name
Tom                    male
Bob                malemale
Mary         malemalefemale
James    malemalefemalemale
Name: sex, dtype: object
Copy the code

Descriptive Statistics: Descriptive Statistics

Although there are methods for various common statistics, if I want to get more than one indicator, I need to call the method multiple times, does it seem a little cumbersome?

Pandas’ designers have this in mind, and to capture multiple metrics at once, simply call the Describe method.

user_info.describe()
Copy the code
age
count 4.000000
mean 28.250000
std 9.251126
min 18.000000
25% 23.250000
50% 27.500000
75% 32.500000
max 40.000000

As you can see, after calling the Describe method directly, statistical indicators for numeric columns such as totals, mean, standard deviation, minimum, maximum, 25/50/75% quantile are displayed. If you want to see statistics for columns that are not numeric, you can set include=[“object”] to get them.

user_info.describe(include=["object"])
Copy the code
city sex
count 4 4
unique 4 2
top BeiJing male
freq 1 3

The results above show some statistics for non-numeric columns: total number, number of de-duplicates, most common values, frequency of most common values.

Also, if I want to count the number of occurrences of each value in a column, how quickly can I do that? Call the value_COUNTS method to quickly get the number of occurrences of each value in a Series.

user_info.sex.value_counts()
Copy the code
male      3
female    1
Name: sex, dtype: int64
Copy the code

If you want to get the index corresponding to the maximum or minimum value of a column, you can use the idxmax or idxmin methods to do so.

user_info.age.idxmax()
Copy the code
'James'
Copy the code

discretization

Sometimes, there is a need to discretize the ages (buckets), which literally means dividing the ages into intervals. Here we want to divide the ages into three intervals. To do this, use the Pandas cut method.

pd.cut(user_info.age, 3)
Copy the code
Name Tom (17.978, 25.333) Bob (25.333, 32.667) Mary (17.978, 25.333) James (32.667, 40.0) name: age, dType: Category Categories (3, Interval [float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]Copy the code

As you can see, the cut automatically generates isometric discrete intervals, which you can define if you want.

pd.cut(user_info.age, [1.18.30.50])
Copy the code
name
Tom       (1, 18]
Bob      (18, 30]
Mary     (18, 30]
James    (30, 50]
Name: age, dtype: category
Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
Copy the code

Sometimes after discretization, if you want to give each interval a name, you can specify the labels parameter.

pd.cut(user_info.age, [1.18.30.50], labels=["childhood"."youth"."middle"])
Copy the code
name
Tom      childhood
Bob          youth
Mary         youth
James       middle
Name: age, dtype: category
Categories (3, object): [childhood < youth < middle]
Copy the code

In addition to discretization using CUT, qCUT can also be discretized. Cut is discretized according to the size of each value, and Qcut is discretized according to the number of occurrences of each value.

pd.qcut(user_info.age, 3)
Copy the code
Name Tom (17.999, 25.0) Bob (25.0, 30.0) Mary (17.999, 25.0) James (30.0, 40.0) name: age, dtype: Category Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]Copy the code

sorting

In data analysis, data sorting is indispensable. Pandas supports two sorting methods: by axis (index or column) and by actual value.

Sorting by index: sort_index is sorted by index by default.

user_info.sort_index()
Copy the code
age city sex
name
Bob 30 ShangHai male
James 40 ShenZhen male
Mary 25 GuangZhou female
Tom 18 BeiJing male

If you want to sort the columns in reverse order, set the parameters Axis =1 and Ascending =False.

user_info.sort_index(axis=1, ascending=False)
Copy the code
sex city age
name
Tom male BeiJing 18
Bob male ShangHai 30
Mary female GuangZhou 25
James male ShenZhen 40

If you want to sort by actual value, for example, if you want to sort by age, how do you do that?

Using the sort_values method, set the parameter by=”age”.

user_info.sort_values(by="age")
Copy the code
age city sex
name
Tom 18 BeiJing male
Mary 25 GuangZhou female
Bob 30 ShangHai male
James 40 ShenZhen male

Sometimes we may need to sort by multiple values, for example, to sort by age and city together, we can set by to a list.

Note: The order of each element in the list affects the sorting priority.

user_info.sort_values(by=["age"."city"])
Copy the code
age city sex
name
Tom 18 BeiJing male
Mary 25 GuangZhou female
Bob 30 ShangHai male
James 40 ShenZhen male

We can use the nlargest and nsmallest methods to do this, which is much faster than using head(n) after sorting.

user_info.age.nlargest(2)
Copy the code
name
James    40
Bob      30
Name: age, dtype: int64
Copy the code

Function application

Although Pandas provides a very rich set of functions for us, there may be times when we need to customize some functions and apply them to a DataFrame or Series. The commonly used functions are map, apply, and applymap.

Map is a method unique to a Series that transforms every element in a Series.

If I want to determine whether a user is middle-aged by age (over 30 is middle-aged), MAP can do it easily.

# receive a lambda function
user_info.age.map(lambda x: "yes" if x >= 30 else "no")
Copy the code
name
Tom       no
Bob      yes
Mary      no
James    yes
Name: age, dtype: object
Copy the code

For example, if I want to determine whether it’s north or south by city, I can do that.

city_map = {
    "BeiJing": "north"."ShangHai": "south"."GuangZhou": "south"."ShenZhen": "south"
}

Pass in a map
user_info.city.map(city_map)
Copy the code
name
Tom      north
Bob      south
Mary     south
James    south
Name: city, dtype: object
Copy the code

The Apply method supports both Series and DataFrame, applying to each value when operating on Series and all rows or columns when operating on DataFrame (controlled by axis parameters).

# For Series, the Apply method is not very different from the map method.
user_info.age.apply(lambda x: "yes" if x >= 30 else "no")
Copy the code
name
Tom       no
Bob      yes
Mary      no
James    yes
Name: age, dtype: object
Copy the code
For DataFrame, the apply method applies to a row or a Series of data.
user_info.apply(lambda x: x.max(), axis=0)
Copy the code
age           40
city    ShenZhen
sex         male
dtype: object
Copy the code

The applyMap method applies to a DataFrame, and it applies to each element in the DataFrame similar to what apply does to a Series.

user_info.applymap(lambda x: str(x).lower())
Copy the code
age city sex
name
Tom 18 beijing male
Bob 30 shanghai male
Mary 25 guangzhou female
James 40 shenzhen male

Modify column/index names

When using DataFrame, it is common to change column names, index names, and so on. This is easily done using rename.

To change the column name, set the columns parameter.

user_info.rename(columns={"age": "Age"."city": "City"."sex": "Sex"})
Copy the code
Age City Sex
name
Tom 18 BeiJing male
Bob 30 ShangHai male
Mary 25 GuangZhou female
James 40 ShenZhen male

Similarly, you only need to set the index parameter to change the index name.

user_info.rename(index={"Tom": "tom"."Bob": "bob"})
Copy the code
age city sex
name
tom 18 BeiJing male
bob 30 ShangHai male
Mary 25 GuangZhou female
James 40 ShenZhen male

Type of operation

If you want to get the number of columns for each type, use the get_dType_COUNTS method.

user_info.get_dtype_counts()
Copy the code
int64     1
object    2
dtype: int64
Copy the code

If you want to convert data types, you can do so with astype.

user_info["age"].astype(float)
Copy the code
Name Tom 18.0 Bob 30.0 Mary 25.0 James 40.0 Name: age, DType: float64Copy the code

In Pandas, the to_NUMERIC, to_datetime, and to_timedelta methods correspond to the numeric, to_datetime, and to_timedelta methods.

I’m going to add some height information to all of these users.

user_info["height"] = ["178"."168"."178"."180cm"]
user_info
Copy the code
age city sex height
name
Tom 18 BeiJing male 178
Bob 30 ShangHai male 168
Mary 25 GuangZhou female 178
James 40 ShenZhen male 180cm

Now convert the height column to a number. Obviously, 180cm is not a number. To cast, we can pass in the errors argument, which is what to do if a cast fails.

By default, errors=’raise’, which means that an exception will be thrown directly after a coerce failure, and setting errors=’ COERce ‘will give the element in question a value of pd.nat (for datetime and timedelta) or NP.nan (number) in the event of a coerce failure. Setting errors=’ignore’ returns the original data in case of a strong-casting failure.

pd.to_numeric(user_info.height, errors="coerce")
Copy the code
Name Tom 178.0 Bob 168.0 Mary 178.0 James NaN name: height, dType: float64Copy the code
pd.to_numeric(user_info.height, errors="ignore")
Copy the code
name
Tom        178
Bob        168
Mary       178
James    180cm
Name: height, dtype: object
Copy the code

To learn more about artificial intelligence, please follow our official account: AI Pai

Here I have arranged the content of the whole article into a PDF. If you want a PDF file, you can reply the keyword: pandas02 in the background of the public account.

I’m Pandas