preface

In the last section, we learned the unsupervised dimensional-reduction method PCA. In the process of learning the input data of PCA, we learned some data processing methods, including serial number coding, unique thermal coding, binary coding and Encoding. This section will describe the Categorical Embedder method, which can map unstructured indexes to floating point tensors through neural network encoding classification variables, thus satisfying neural network input requirements.

1. Data preprocessing of neural network

Machine learning model is inseparable from data preprocessing, which is also very important for the construction of network model and often determines the training results. For different data sets, preprocessing methods have more or less limitations and particularity, especially neural network input data only support floating point tensors. Whatever data is being processed (sound, image, or text) needs to be converted into tensors, which are then fed to the neural network model. When we encounter such data, it can be converted to numbers by Ordinal Encoding, one-hot Encoding, Binary Encoding, etc. Next, a more advanced neural network coding method is introduced.

2. What is embedding

Embedding is a way to transform discrete variables into continuous vector representation. Embedding makes machine learning for large inputs, such as sparse vectors representing words, easier as described in the Official Google tutorial. It can help us study the content of unstructured data, converting these discrete variables into numbers is more conducive to model training. Let’s say a company has three products, with an average of 50,000 comments for each product and a total of 1 million unique words in the corpus. We will have a matrix of shape (150K, 1M). This input is very large and very sparse for any model. Let’s assume that we reduce the dimension to 15 (15-bit IDS for each product), take the average of each product embedded, and then color them according to the value, resulting in the figure below; Embedding means that a category is represented by a fixed set of numbers, and we can use a matrix of (3,15) to indicate the degree of similarity between each product. More visualization and less complexity.

Each category maps to a different vector, and the properties of the vector can be adjusted or learned when training the neural network. Vector Spaces provide projections of categories that are close to or related to each other. To learn about embeddings, we created one that uses these embeddings as features and interacts with other features to learn the tasks described above. There is a concept that needs to be referenced: word vectors.

2.1 term vectors

A word vector is an embedding vector of each word in a language. The whole idea of word vectors is that words that appear closer together in a sentence are usually closer to each other. Embeddings are **n ** dimensional vectors. Each dimension captures certain attributes/attributes of each word, and therefore is closer to the attribute, closer to the word. To learn about word vectors, we create a set of word pairs that appear in a small word window (say 5 words) as positive examples and a set of word pairs that do not appear in that window as negative examples.

When we train the neural network above on a sufficiently large data set, the model learns to predict whether two words are related. However, a by-product of this model is the embedding matrix, which is an information-rich vector representation of each word in the vocabulary. (Refer to Categorical Embedding and Transfer Learning)

2.2 Traditional Methods
  • Ordinal Encoding is usually used to process data that has a size relationship between categories.
  • One-hot Encoding uses sparse vectors to save space, along with feature selection to reduce dimension
  • Binary Encoding consists of two main steps: each category is assigned a category ID using ordinal Encoding, and the Binary Encoding corresponding to the category ID is taken as the result.
2.3 Categorical_embedder Working principle

First, each classification variable category maps an N-dimensional vector. This mapping is learned by the neural network during standard supervised training. If we want to use the above 15-dimensional ids as features, then we will train the neural network in a supervised manner, take a vector for each feature and generate a 3*15 matrix as shown below. As shown in the figure below (too many nerve nodes are not good for representation, the figure below is for reference only)

Next, we will replace each category with the corresponding vector in the data. The advantage of this is that we limit the number of columns required for each category. This is useful when columns have a high cardinality (a measure of the number of elements in the collection). Generative embeddings obtained from neural networks reveal the intrinsic properties of classification variables. This means that similar categories will have similar embeddings.

2.4 Learning embedding matrix

The embedding matrix is a floating point N*M matrix. Here N is the number of unique categories and M is the embedded dimension. Let’s decide what M is. M is usually set to be equal to the square root of N, and then incremented or subtracted as needed. In effect, the embedding matrix is a lookup table of vectors. Each row of the embedding matrix is a vector of a unique class.

For the company, it has three products, each of which has an evaluation of 50,000 unique values. To build classification embedding, we need to solve a meaningful task deep learning model and use embedding matrix to represent classification variables in the above tasks. We used 15-dimensional variables to predict the company’s product relevance. Which product has relevance can be analyzed by color differentiation, of course, this belongs to an analysis idea of recommendation system. What’s more, the attributes of the samples are classified into different classes to construct corresponding embedding matrices, through which Label_embedding matrices are input into the neural network for training through Faltten layer

3 Python-based categorical_embedder

3.1 Recurrence of neural network coding code
pip install categorical_embedder
Copy the code

Note: This library requires tensorFlow to be under version 2.1, higher than that will cause unknown errors.

This categorical_embedder contains some important function definitions, which we describe in detail.

  • Ce.get_embedding_info (data,categorical_variables=None) : The purpose of this function is to identify all categoricalvariables in data and determine their embedding size. The embedding size of categorical variables is determined by a number of at least 50 or half. Unique value, i.e. embedded column size = Min (50, # unique value in this column). We can pass an explicit list of categorical_variables in the categorical_variables parameter. If not, the function automatically accepts all variables whose data type is object.
def get_embedding_info(data, categorical_variables=None) :
    ''' this function identifies categorical variables and its embedding size :data: input data [dataframe] :categorical_variables: list of categorical_variables [default: None] if None, it automatically takes the variables with data type 'object' embedding size of categorical variables are determined by minimum of 50 or half of the no. of its unique values. i.e. embedding size of a column = Min(50, # unique values of that column) '''
    if categorical_variables is None:
        categorical_variables = data.select_dtypes(include='object').columns

    return {col:(data[col].nunique(),min(50,(data[col].nunique()+ 1) / /2)) for col in categorical_variables}
Copy the code
  • Ce. Get_label_encoded_data (data, categorical_variables = None) : This function label use * * sklearn. Preprocessing. LabelEncoder * * function encodes all classification variables (integer encoding), and returns the label data frames for training. Keras/TensorFlow or any other deep learning library would want data in this format.
def get_label_encoded_data(data, categorical_variables=None) :
    ''' this function label encodes all the categorical variables using sklearn.preprocessing.labelencoder and returns a label encoded dataframe for training :data: input data [dataframe] :categorical_variables: list of categorical_variables [Default: None] if None, it automatically takes the variables with data type 'object' '''
    encoders = {}

    df = data.copy()

    if categorical_variables is None:
        categorical_variables = [col for col in df.columns if df[col].dtype == 'object']

    for var in categorical_variables:
        #print(var)
        encoders[var] = __LabelEncoder__()
        df.loc[:, var] = encoders[var].fit_transform(df[var])

    return df, encoders
Copy the code
  • ce.get_embeddings(X_train, y_train, categorical_embedding_info=embedding_info, is_classification=True, Epochs =100,batch_size=256) : This function trains a shallow neural network and returns the embedding of classification variables. At the bottom layer is a two-layer neural network architecture with 1000*500 neurons with ReLU activation. It requires four required outputs X_train, y_train, and categorical_embedding_info: the output of the get_embedding_info function and is_classification:True for categorizing tasks; False is used for regression tasks.

    For classification: loss = ‘binary_crossentropy’; Metrics = ‘accuracy’ For regression loss = ‘mean_squared_error’; metrics = ‘r2’

def get_embeddings(X_train, y_train, categorical_embedding_info, is_classification, epochs=100, batch_size=256) :
    ''' this function trains a shallow neural networks and returns embeddings of categorical variables :X_train: training data [dataframe] :y_train: target variable :categorical_embedding_info: output of get_embedding_info function [dictionary of categorical variable and it's embedding size] :is_classification: True for classification tasks; False for regression tasks :epochs: num of epochs to train [default:100] :batch_size: batch size to train [default:256] It is a 2 layer neural network architecture with 1000 and 500 neurons with 'ReLU' activation for classification: loss = 'binary_crossentropy'; metrics = 'accuracy' for regression: loss = 'mean_squared_error'; metrics = 'r2' '''

    numerical_variables = [x for x in X_train.columns if x not in list(categorical_embedding_info.keys())]

    inputs = []
    flatten_layers = []

    for var, sz in categorical_embedding_info.items():
        input_c = Input(shape=(1,), dtype='int32')
        embed_c = Embedding(*sz, input_length=1)(input_c)
        flatten_c = Flatten()(embed_c)
        inputs.append(input_c)
        flatten_layers.append(flatten_c)
        #print(inputs)

    input_num = Input(shape=(len(numerical_variables),), dtype='float32')
    flatten_layers.append(input_num)
    inputs.append(input_num)

    flatten = concatenate(flatten_layers, axis=-1)

    fc1 = Dense(1000, kernel_initializer='normal')(flatten)
    fc1 = Activation('relu')(fc1)



    fc2 = Dense(500, kernel_initializer='normal')(fc1)
    fc2 = Activation('relu')(fc2)


    if is_classification:
        output = Dense(1, activation='sigmoid')(fc2)

    else:
        output = Dense(1, kernel_initializer='normal')(fc2)


    nnet = Model(inputs=inputs, outputs=output)

    x_inputs = []
    for col in categorical_embedding_info.keys():
        x_inputs.append(X_train[col].values)

    x_inputs.append(X_train[numerical_variables].values)

    if is_classification:
        loss = 'binary_crossentropy'
        metrics='accuracy'
    else:
        loss = 'mean_squared_error'
        metrics=r2



    nnet.compile(loss=loss, optimizer='adam', metrics=[metrics])
    nnet.fit(x_inputs, y_train.values, batch_size=batch_size, epochs=epochs, validation_split=0.2, callbacks=[TQDMNotebookCallback()], verbose=0)

    embs = list(map(lambda x: x.get_weights()[0], [x for x in nnet.layers if 'Embedding' in str(x)]))
    embeddings = {var: emb for var, emb in zip(categorical_embedding_info.keys(), embs)}
    return 
Copy the code

Note that this code is the source code for the library, just to know.

3.2 Case analysis — Second-hand car price forecast

Based on the given second-hand car trading sample data as training, verification and test samples. Build used car retail transaction price forecast model. First of all, we should look at the structure of the whole data and the significance of indicators. By observing the data, we can find that some indicators are date, unknown features and other unstructured data. If we want to build machine learning models (neural network models).

We need to use the categorical_embedding method to transform the unstructured data before processing it. First, import related libraries;

import pandas as pd
import numpy as np
import categorical_embedder as ce
from sklearn.model_selection import train_test_split
from keras import models
from keras import layers
import matplotlib.pyplot as plt
import csv
Copy the code

Observe the shape of the entire dataset and output the contents of the first five observed datasets, as shown below;

train_data = pd.read_csv('train_estimate.csv')
train_data.shape
train_data.head()
Copy the code

We can see that metrics like ‘Trade Time’ and ‘anonymousFeature12’ are unstructured, so let’s delete the first column of the order and the last column of price in preparation for building input data. Determine the classification variables, present the unstructured data, and use the ce.get_embedding_info function to get the variables for the unstructured data.

X = train_data.drop(['carid'.'price'], axis=1)
y = train_data['carid']
embedding_info = ce.get_embedding_info(X)
embedding_info
Copy the code

The results are shown below;

Get_label_encoded_data = ce.get_label_encoded_data = ce.get_label_encoded_data This function label use sklearn. Preprocessing. LabelEncoder to encode so classification variables (integer encoding), and return label coding data frames for training.

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y)

# ce.get_embeddings trains NN, extracts embeddings and return a dictionary containing the embeddings
embeddings = ce.get_embeddings(X_train, y_train, categorical_embedding_info=embedding_info, is_classification=True,epochs=100, batch_size=256)
embeddings
Copy the code

As shown in the figure above we use a two-layer neural network architecture of 1000*500 neurons with ReLU activation. We can check the changes of the data after coding, as shown in the figure below.

train_data_encoded['tradeTime']
Copy the code

The next step is to convert the trained data encoding from the dictionary to the dataset;

data = ce.fit_transform(X, embeddings=embeddings, encoders=encoders, drop_categorical_vars=True)
data.head()
Copy the code

At this point, the steps of using neural network coding have all been finished, and it can be clearly seen from the data header that the structure of data has all been converted into floating point tensors, which has met the input requirements of the machine learning model. In order to further improve the effect of the model, standardization and other methods are needed for processing.

4 summarizes

Machine learning models are sensitive to numerical variables but insensitive to unstructured data. The traditional variable classification method limits the ability of the algorithm to some extent. Under the right conditions, we can learn entirely new embeddings to improve model performance. Classification embedding usually performs well and contributes to better generalization of the model.

5. References

  1. Categorical Embedder: Encoding Categorical Variables via Neural Networks
  2. A Deep-Learned Embedding Technique for Categorical Features Encoding
  3. Deep embedding’s for categorical variables (Cat2Vec)
  4. Categorical Embedding and Transfer Learning

Recommended reading

  • Differential operator method
  • PyTorch is used to build a neural network model for handwriting recognition
  • PyTorch was used to build neural network models and back-propagation calculations
  • How to optimize model parameters and integrate models
  • TORCHVISION Target detection fine tuning tutorial
  • Neural network development recipes
  • Principal component analysis (PCA) method steps and code details