Pandas is used to classify data in Pandas

This article introduces the Categorical type, which is used to carry data presented or encoded based on integer categories to help users achieve better performance and memory usage.

Background: Statistics on duplicate values

Repeated values often occur in a Series data, we need to extract these different values and calculate their frequencies separately:

import numpy as np
import pandas as pd
Copy the code
data = pd.Series(["Chinese"."Mathematics"."English"."Mathematics"."English"."Geography"."Chinese"."Chinese"])
data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese DType: objectCopy the code
# 1. Extract different values

pd.unique(data)
Copy the code
Array ([' language ', 'math ',' English ', 'geography '], dtype=object)Copy the code
# 2. Count the number of each value

pd.value_counts(data)
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Dtype: int64Copy the code

Classification, dictionary coding

The way in which integers are represented is called classification or dictionary encoding. Different arrays can be called categories, dictionaries, or hierarchies of data

df = pd.Series([0.1.1.0] * 2)
df
Copy the code
0    0
1    1
2    1
3    0
4    0
5    1
6    1
7    0
dtype: int64
Copy the code
# dim uses dimension tables

dim = pd.Series(["Chinese"."Mathematics"])
dim
Copy the code
0 Chinese 1 Mathematics dtype: objectCopy the code

How to put 0- language, 1- mathematics in DF one-to-one correspondence? Use the take method to do this

df1 = dim.take(df)
df1
Copy the code
0 Chinese 1 Maths 1 Maths 0 Chinese 0 Chinese 1 Maths 1 Maths 0 Chinese dtype: objectCopy the code
type(df1)  # Series data
Copy the code
pandas.core.series.Series
Copy the code

The Categorical type is created

Generate a Categorical instance object

The use of the Categorical type is illustrated by examples

subjects = ["Chinese"."Mathematics"."Chinese"."Chinese"] * 2

N = len(subjects)
Copy the code
df2 = pd.DataFrame({
    "subject":subjects,
    "id": np.arange(N),  # continuous integer
    "score":np.random.randint(3.15,size=N),  # random integer
    "height":np.random.uniform(165.180,size=N)  # Normally distributed data
   },
  columns=["id"."subject"."score"."height"])  # specify the order of column names

df2
Copy the code

We can convert subject to Categorical:

subject_cat = df2["subject"].astype("category")
subject_cat
Copy the code

We found two characteristics of Subject_CAT:

  • It is not a NUMpy array, but a category data type
  • It has two values: language and math
s = subject_cat.values
s
Copy the code
[' language ', 'mathematics',' language ', 'language', 'language', 'mathematics',' language ', 'Chinese'] Categories (2, object) : [' mathematics', 'Chinese']Copy the code
type(s)
Copy the code
pandas.core.arrays.categorical.Categorical
Copy the code
s.categories  # check the category
Copy the code
Index([' math ', 'math '], dtype='object')Copy the code
s.codes  # Check the category code
Copy the code
array([1, 0, 1, 1, 1, 0, 1, 1], dtype=int8)
Copy the code

How to generate a Categorical object

There are two main ways:

  • Specifies a Categorical object for a DataFrame
  • Generated by pandas.Categorical
  • With the constructor from_codes, you must first get the classification code data
Way # 1

df2["subject"] = df2["subject"].astype("category")
df2.subject
Copy the code
0 Chinese 1 Maths 2 Chinese 3 Chinese 4 Chinese 5 Maths 6 Chinese 7 Chinese Name: Subject, dtype: category Categories (2, object): [' maths ', 'Chinese ']Copy the code
Way # 2

fruit = pd.Categorical([The word "apple"."Banana"."Grapes".The word "apple".The word "apple"."Banana"])
fruit
Copy the code
[' apple ', 'banana', 'grapes',' apple ', 'apple', 'banana'] Categories (3, object) : [' apple ', 'grapes',' banana ']Copy the code
Way # 3

categories = ["height"."score"."subject"]
codes = [0.1.0.2.1.0]

my_data = pd.Categorical.from_codes(codes, categories)
my_data
Copy the code
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height', 'score', 'subject']
Copy the code

In general, categorization transformations do not specify the order of the categories. Instead, we can specify a meaningful order using the ordered argument:

['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
Copy the code

The output above is height

# my_data is not sorted

my_data.as_ordered()
Copy the code
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
Copy the code

A Categorical object is evaluated

Statistical calculations

np.random.seed(12345)

data1 = np.random.randn(100)
data1[:10]
Copy the code
Array ([-0.20470766, 0.47894334, -0.51943872, -0.5557303, 1.96578057, 1.39340583, 0.09290788, 0.28174615, 0.76902257, 1.24643474])Copy the code
# Compute the quartile boxes of datA1 and extract the statistics

bins_1 = pd.qcut(data1,4)
bins_1
Copy the code
[(0.717, 0.106), (0.106, 0.761], (0.717, 0.106), (0.717, 0.106], (0.761, 3.249],... , (0.761, 3.249], (0.106, 0.761], (2.371, 0.717), (0.106, 0.761], (0.106, 0.761]] Length: 100 Categories (4, the interval [float64]] : [(2.371, 0.717] < (0.717, 0.106] < < (0.761, 3.249 (0.106, 0.761]]]Copy the code

You can see that the result above returns the value Categories object

  • There are four possible values
  • See that the maximum and minimum values of the entire data are at the top and bottom, respectively
# Use the quartile name in the above quartile: Q1\Q2\Q3\Q4

bins_2 = pd.qcut(data1,4,labels=["Q1"."Q2"."Q3"."Q4"])
bins_2
Copy the code
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q4', 'Q3', 'Q1', 'Q3', 'Q3']
Length: 100
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code
bins_2.codes[:10]
Copy the code
array([1, 2, 1, 1, 3, 3, 1, 2, 3, 3], dtype=int8)
Copy the code

Count groupby for summary statistics:

bins_2 = pd.Series(bins_2, name="quartile")  # Quartile
bins_2
Copy the code
0     Q2
1     Q3
2     Q2
3     Q2
4     Q4
      ..
95    Q4
96    Q3
97    Q1
98    Q3
99    Q3
Name: quartile, Length: 100, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code

The following code example groups datA1 data by bins_2 to generate three statistical functions

results = pd.Series(data1).groupby(bins_2).agg(["count"."min"."max"]).reset_index()
results
Copy the code

results["quartile"] The # quartile column holds the original classification information
Copy the code
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
Copy the code

Memory reduced after classification

N = 10000000  # Tens of millions of data

data3 = pd.Series(np.random.randn(N))
labels3 = pd.Series(["foo"."bar"."baz"."quz"] * (N // 4))
Copy the code
categories3 = labels3.astype("category")  # Classification conversion
Copy the code
# Compare two memory

print("data3: ",data3.memory_usage())
print("categories3: ",categories3.memory_usage())
Copy the code
data3:  80000128
categories3:  10000332
Copy the code

Classification method

Accessing categorical Information

The classification method is mainly realized by the special attribute CAT

data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese DType: objectCopy the code
cat_data = data.astype("category")
cat_data  # Classification data
Copy the code
0 Chinese 1 Maths 2 English 3 Maths 4 English 5 Geography 6 Chinese 7 Chinese Dtype: category Categories (4, object): [' geography ', 'maths ',' English ', 'Chinese ']Copy the code

The new classification

When the category of the actual data exceeds the four values observed in the data:

actual_cat = ["Chinese"."Mathematics"."English"."Geography"."Creatures"]

cat_data2 = cat_data.cat.set_categories(actual_cat)
cat_data2
Copy the code

That’s where “living things” comes in.

cat_data.value_counts()
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Dtype: int64Copy the code
cat_data2.value_counts()  # "creature" appears in the result below
Copy the code
Chinese 3 Maths 2 English 2 Geography 1 Biology 0 dtype: int64Copy the code

Delete the classification

cat_data3 = cat_data[cat_data.isin(["Chinese"."Mathematics"]]Select only Language and math

cat_data3
Copy the code
Dtype: category (4, object): [' geography ', 'maths ',' English ', 'Chinese ']Copy the code
cat_data3.cat.remove_unused_categories()  # delete unused categories
Copy the code
Dtype: category (2, object): [' maths ', 'Chinese ']Copy the code

Creating dummy variables

The classified data is converted into dummy variables, namely one-hot coding (unique hot coding); The resulting DataFrame is a column of different categories, as shown in the following example:

data4 = pd.Series(["col1"."col2"."col3"."col4"] * 2, dtype="category")
data4
Copy the code
0    col1
1    col2
2    col3
3    col4
4    col1
5    col2
6    col3
7    col4
dtype: category
Categories (4, object): ['col1', 'col2', 'col3', 'col4']
Copy the code
pd.get_dummies(data4)  # get_dummies: Convert one-dimensional classification data into a DataFrame containing dummy variables
Copy the code

Classification method

  • Add_categories: Adds a new category to the tail
  • As_ordered: Category order
  • As_unordered: Makes the category unordered
  • Remove_categories: removes categories and sets the removed value to NULL
  • Remove_unused_categories: removes all categories that do not appear
  • Rename_categories: replaces category names without changing the number of categories
  • Reorder_categories: Sorts classes
  • Set_categories: Replace old classes with a specified set of new classes, which can be added or removed