Pandas is used to classify data in Pandas
This article introduces the Categorical type, which is used to carry data presented or encoded based on integer categories to help users achieve better performance and memory usage.
Background: Statistics on duplicate values
Repeated values often occur in a Series data, we need to extract these different values and calculate their frequencies separately:
import numpy as np
import pandas as pd
Copy the code
data = pd.Series(["Chinese"."Mathematics"."English"."Mathematics"."English"."Geography"."Chinese"."Chinese"])
data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese DType: objectCopy the code
# 1. Extract different values
pd.unique(data)
Copy the code
Array ([' language ', 'math ',' English ', 'geography '], dtype=object)Copy the code
# 2. Count the number of each value
pd.value_counts(data)
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Dtype: int64Copy the code
Classification, dictionary coding
The way in which integers are represented is called classification or dictionary encoding. Different arrays can be called categories, dictionaries, or hierarchies of data
df = pd.Series([0.1.1.0] * 2)
df
Copy the code
0 0
1 1
2 1
3 0
4 0
5 1
6 1
7 0
dtype: int64
Copy the code
# dim uses dimension tables
dim = pd.Series(["Chinese"."Mathematics"])
dim
Copy the code
0 Chinese 1 Mathematics dtype: objectCopy the code
How to put 0- language, 1- mathematics in DF one-to-one correspondence? Use the take method to do this
df1 = dim.take(df)
df1
Copy the code
0 Chinese 1 Maths 1 Maths 0 Chinese 0 Chinese 1 Maths 1 Maths 0 Chinese dtype: objectCopy the code
type(df1) # Series data
Copy the code
pandas.core.series.Series
Copy the code
The Categorical type is created
Generate a Categorical instance object
The use of the Categorical type is illustrated by examples
subjects = ["Chinese"."Mathematics"."Chinese"."Chinese"] * 2
N = len(subjects)
Copy the code
df2 = pd.DataFrame({
"subject":subjects,
"id": np.arange(N), # continuous integer
"score":np.random.randint(3.15,size=N), # random integer
"height":np.random.uniform(165.180,size=N) # Normally distributed data
},
columns=["id"."subject"."score"."height"]) # specify the order of column names
df2
Copy the code
We can convert subject to Categorical:
subject_cat = df2["subject"].astype("category")
subject_cat
Copy the code
We found two characteristics of Subject_CAT:
- It is not a NUMpy array, but a category data type
- It has two values: language and math
s = subject_cat.values
s
Copy the code
[' language ', 'mathematics',' language ', 'language', 'language', 'mathematics',' language ', 'Chinese'] Categories (2, object) : [' mathematics', 'Chinese']Copy the code
type(s)
Copy the code
pandas.core.arrays.categorical.Categorical
Copy the code
s.categories # check the category
Copy the code
Index([' math ', 'math '], dtype='object')Copy the code
s.codes # Check the category code
Copy the code
array([1, 0, 1, 1, 1, 0, 1, 1], dtype=int8)
Copy the code
How to generate a Categorical object
There are two main ways:
- Specifies a Categorical object for a DataFrame
- Generated by pandas.Categorical
- With the constructor from_codes, you must first get the classification code data
Way # 1
df2["subject"] = df2["subject"].astype("category")
df2.subject
Copy the code
0 Chinese 1 Maths 2 Chinese 3 Chinese 4 Chinese 5 Maths 6 Chinese 7 Chinese Name: Subject, dtype: category Categories (2, object): [' maths ', 'Chinese ']Copy the code
Way # 2
fruit = pd.Categorical([The word "apple"."Banana"."Grapes".The word "apple".The word "apple"."Banana"])
fruit
Copy the code
[' apple ', 'banana', 'grapes',' apple ', 'apple', 'banana'] Categories (3, object) : [' apple ', 'grapes',' banana ']Copy the code
Way # 3
categories = ["height"."score"."subject"]
codes = [0.1.0.2.1.0]
my_data = pd.Categorical.from_codes(codes, categories)
my_data
Copy the code
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height', 'score', 'subject']
Copy the code
In general, categorization transformations do not specify the order of the categories. Instead, we can specify a meaningful order using the ordered argument:
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
Copy the code
The output above is height
# my_data is not sorted
my_data.as_ordered()
Copy the code
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
Copy the code
A Categorical object is evaluated
Statistical calculations
np.random.seed(12345)
data1 = np.random.randn(100)
data1[:10]
Copy the code
Array ([-0.20470766, 0.47894334, -0.51943872, -0.5557303, 1.96578057, 1.39340583, 0.09290788, 0.28174615, 0.76902257, 1.24643474])Copy the code
# Compute the quartile boxes of datA1 and extract the statistics
bins_1 = pd.qcut(data1,4)
bins_1
Copy the code
[(0.717, 0.106), (0.106, 0.761], (0.717, 0.106), (0.717, 0.106], (0.761, 3.249],... , (0.761, 3.249], (0.106, 0.761], (2.371, 0.717), (0.106, 0.761], (0.106, 0.761]] Length: 100 Categories (4, the interval [float64]] : [(2.371, 0.717] < (0.717, 0.106] < < (0.761, 3.249 (0.106, 0.761]]]Copy the code
You can see that the result above returns the value Categories object
- There are four possible values
- See that the maximum and minimum values of the entire data are at the top and bottom, respectively
# Use the quartile name in the above quartile: Q1\Q2\Q3\Q4
bins_2 = pd.qcut(data1,4,labels=["Q1"."Q2"."Q3"."Q4"])
bins_2
Copy the code
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q4', 'Q3', 'Q1', 'Q3', 'Q3']
Length: 100
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code
bins_2.codes[:10]
Copy the code
array([1, 2, 1, 1, 3, 3, 1, 2, 3, 3], dtype=int8)
Copy the code
Count groupby for summary statistics:
bins_2 = pd.Series(bins_2, name="quartile") # Quartile
bins_2
Copy the code
0 Q2
1 Q3
2 Q2
3 Q2
4 Q4
..
95 Q4
96 Q3
97 Q1
98 Q3
99 Q3
Name: quartile, Length: 100, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code
The following code example groups datA1 data by bins_2 to generate three statistical functions
results = pd.Series(data1).groupby(bins_2).agg(["count"."min"."max"]).reset_index()
results
Copy the code
results["quartile"] The # quartile column holds the original classification information
Copy the code
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code
Memory reduced after classification
N = 10000000 # Tens of millions of data
data3 = pd.Series(np.random.randn(N))
labels3 = pd.Series(["foo"."bar"."baz"."quz"] * (N // 4))
Copy the code
categories3 = labels3.astype("category") # Classification conversion
Copy the code
# Compare two memory
print("data3: ",data3.memory_usage())
print("categories3: ",categories3.memory_usage())
Copy the code
data3: 80000128
categories3: 10000332
Copy the code
Classification method
Accessing categorical Information
The classification method is mainly realized by the special attribute CAT
data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese DType: objectCopy the code
cat_data = data.astype("category")
cat_data # Classification data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese Dtype: category Categories (4, object): [' geography ', 'maths ',' English ', 'Chinese ']Copy the code
The new classification
When the category of the actual data exceeds the four values observed in the data:
actual_cat = ["Chinese"."Mathematics"."English"."Geography"."Creatures"]
cat_data2 = cat_data.cat.set_categories(actual_cat)
cat_data2
Copy the code
That’s where “living things” comes in.
cat_data.value_counts()
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Dtype: int64Copy the code
cat_data2.value_counts() # "creature" appears in the result below
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Biology 0 dtype: int64Copy the code
Delete the classification
cat_data3 = cat_data[cat_data.isin(["Chinese"."Mathematics"]]Select only Language and math
cat_data3
Copy the code
Dtype: category (4, object): [' geography ', 'maths ',' English ', 'Chinese ']Copy the code
cat_data3.cat.remove_unused_categories() # delete unused categories
Copy the code
Dtype: category (2, object): [' maths ', 'Chinese ']Copy the code
Creating dummy variables
The classified data is converted into dummy variables, namely one-hot coding (unique hot coding); The resulting DataFrame is a column of different categories, as shown in the following example:
data4 = pd.Series(["col1"."col2"."col3"."col4"] * 2, dtype="category")
data4
Copy the code
0 col1
1 col2
2 col3
3 col4
4 col1
5 col2
6 col3
7 col4
dtype: category
Categories (4, object): ['col1', 'col2', 'col3', 'col4']
Copy the code
pd.get_dummies(data4) # get_dummies: Convert one-dimensional classification data into a DataFrame containing dummy variables
Copy the code
Classification method
- Add_categories: Adds a new category to the tail
- As_ordered: Category order
- As_unordered: Makes the category unordered
- Remove_categories: removes categories and sets the removed value to NULL
- Remove_unused_categories: removes all categories that do not appear
- Rename_categories: replaces category names without changing the number of categories
- Reorder_categories: Sorts classes
- Set_categories: Replace old classes with a specified set of new classes, which can be added or removed