0 x00 the

In this article, we introduce HugeCTR, an industry-oriented recommendation system training framework optimized for large-scale CTR models with model parallel embedding and data-parallel dense networks.

Thanks for using HugeCTR source code to read this masterpiece.

Other articles in this series are as follows:

NVIDIA HugeCTR, GPU version parameter server –(1)

NVIDIA HugeCTR, GPU version parameter server — (2)

NVIDIA HugeCTR, GPU version parameter server –(3)

NVIDIA HugeCTR, GPU version parameter server — (4)

0x01 Previous review

Now that we’ve finished analyzing the HugeCTR pipeline, it’s time to look at the implementation of the embedded layer, which is the essence of HugeCTR.

Embedding plays a key role in modern deep learning-based recommendation architectures, encoding individual information for billions of entities (users, products, and their characteristics). As the amount of data increases, so does the size of the embedded tables, now spanning multiple GIGABytes to terabytes. Training this type of DL system presents unique challenges because of its huge embedded tables and sparse access patterns that can span multiple Gpus.

HugeCTR implements an optimized embedding implementation that is up to 8 times higher than other frameworks. This optimized implementation is also available as a TensorFlow plug-in that works seamlessly with TensorFlow and serves as a convenient alternative to TensorFlow’s native embedding layer.

0x02 Embedding

2.1 concept

We first briefly introduce the concept of embedding. Embedding is a machine learning technology, which is used to automatically transform an ID /category feature into a feature vector to be optimized. In this way, the algorithm can be expanded from accurate matching to fuzzy matching, so as to improve the expansion ability of the algorithm. Another way to look at it is that an embedded table is a specific type of key-value store, where the key is an ID used to uniquely identify an object and the value is a vector of real numbers.

Embedding technology is very popular in NLP, which is used to represent dense numerical vector of words. Words with similar meanings have similar Embedding vector.

Another advantage of Embedding is that it transforms data from high to low dimensions.

2.1.1 One – hot coding

For comparison, let’s look at one-hot encoding. One-hot encoding is to ensure that only One bit of a single feature in each sample is in state 1, and all the others are 0. The specific coding example is as follows. In the corpus, each of Hangzhou, Shanghai, Ningbo and Beijing corresponds to a vector, in which only one value is 1 and the rest are 0.

Hangzhou [0.0.0.0.0.0.0.1.0And... .0.0.0.0.0.0.0Shanghai []0.0.0.0.1.0.0.0.0And... .0.0.0.0.0.0.0Ningbo []0.0.0.1.0.0.0.0.0And... .0.0.0.0.0.0.0[] Beijing0.0.0.0.0.0.0.0.0And... .1.0.0.0.0.0.0]
Copy the code

Its disadvantages are:

  • The dimensions of a vector increase with the number of words; If the vectors corresponding to the names of all the cities in the world were to form a matrix, the matrix would be a dimensional disaster because it was too sparse.
  • Because the city code is random, the vectors are independent from each other and cannot represent the semantic information between terms.

Therefore, people want to make the following improvements to the unique thermal coding:

  • Change each element of the vector from an integer to a floating-point representation of the entire range of real numbers;
  • The original sparse vector is transformed into the continuous value of low dimension, that is, the dense vector. It can be thought of as embedding the formerly sparse huge dimensional compression into a smaller dimensional space. And words that have similar meanings will be mapped to similar positions in this vector space.

To put it simply, it’s to find a spatial mapping to embed a higher-dimensional word vector into a lower-dimensional space.

2.1.2 Distributed presentation

The basic idea of Distributed Representation is to express each word as a dense, continuous vector of n-dimensional real numbers. On the other hand, the relationships between real vectors can represent the similarity between words, such as vector angles or Euclidean distance. In terms of vocabulary, for example, independent thermal coding is equivalent to encoding words, while distributed representation is the compression of words from sparse large dimensions into a vector space of lower dimensions.

The greatest contribution of distributed representation is that related or similar words are closer in distance. Another difference between distributed representation and one-hot representation is that the dimension is much lower. For a word list of 1 million, we can use a 100-dimensional real vector to represent a word, while one-hot requires 1 million encoding. For example, the nine cities of Hangzhou, Shanghai, Ningbo, Beijing, Guangzhou, Shenzhen, Shenyang, Xi ‘an and Luoyang adopt one-HOT coding as follows:

Hangzhou [1.0.0.0.0.0.0.1.0Shanghai []0.1.0.0.0.0.0.0.0Ningbo []0.0.1.0.0.0.0.0.0[] Beijing0.0.0.1.0.0.0.0.0[] guangzhou0.0.0.0.1.0.0.0.0Shenzhen []0.0.0.0.0.1.0.0.0[] shenyang0.0.0.0.0.0.1.0.0Xian] [0.0.0.0.0.0.0.1.0Luoyang] [0.0.0.0.0.0.0.0.1]  
Copy the code

If the Embedding Table is sparse, it consumes too much memory.


[ 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 ] \left[ \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \\ 14 & 15 & 16 \\ 17 & 18 & 19 \\ 21 & 22 & 23 \\ 24 & 25 & 26 \\ 27 & 28 & 29 \end{matrix} \right]

If you want to find Hangzhou, Shanghai:


[ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ] \left[ \begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]

Using matrix multiplication, we can get:


[ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ] x [ 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 ] = [ 1 2 3 4 5 6 ] \left[ \begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right] \times \left[ \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \\ 14 & 15 & 16 \\ 17 & 18 & 19 \\ 21 & 22 & 23 \\ 24 & 25 & 26 \\ 27 & 28 & 29 \end{matrix} \right] = \left[ \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{matrix} \right]

So this compresses two high-dimensional, discrete, sparse 1 x 9 vectors into two low-dimensional, dense 1 x 3 vectors. The “1” position in the one-hot vector is called sparseID, which is a number. The matrix multiplication of the single heat vector and the embedded table is equal to a table lookup process by using sparseID, that is, according to the sparseID (0, 1) of Hangzhou and Shanghai, the corresponding vectors (row 0 and 1) are extracted from the embedded table. This turns the higher dimension into the lower dimension.

2.1.3 Recommended fields

In the field of recommendation systems, Embedding represents each object of interest (user, product, category, etc.) as a dense vector of values. The simplest recommendation system is based on the user and the product: what products should you recommend to the user? You have the user ID and product ID as keys. The corresponding values are user and product, so you use two embedded tables.

Figure 1. The embedded table is a dense representation of sparse categories. Each category is represented by a vector, where the embedded dimension is 4. From Using Neural Networks for Your Recommender System.

2.2 the Lookup

You can see another concept in the figure above: Lookup, which we’ll examine next.

The number of neurons for Embeddings layer is determined by embedding Vector and field_size, that is, the number of neurons is embedding vector * field_size. A dense vector is a horizontal splicing of the outputs of the embeding layer. Since the input feature is one-hot encoding, the embedding vector is also the weight of the input layer to the Dense Embeddings layer, which is also the weight of the full connection layer.

The Embedding weight matrix can be a dense matrix of [item_size, embedding_size], where item_size is the number of objects that need to be Embedding, and the embedding_size is the vector length of the mapping, or the size of the matrix: Number of features * embedded dimension. Each line of the Embedding weight matrix corresponds to a dimension feature of the input (the dimension after one-hot). The user can use an index to indicate which feature was selected.

Embedding_lookup is how the Embedding vector corresponding to a super-dimensional input is retrieved from the Embedding weight matrix, which is the weight itself. Embedding_lookup is actually implemented by multiplying the matrix V = WX + b, and since the input X is one-hot, it is equivalent to retrieving the corresponding row in the weight matrix, which looks like looking up an index table, so it is called lookup. Embedding can be regarded as a full connection layer in nature. Such as:


[ [ 0 0 1 0 ] x [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ] = [ 11 12 13 14 15 ] \left[ \begin{matrix}[ 0 & 0 & 1 & 0 \end{matrix} \right] \times \left[ \begin{matrix} 1 & 2 & 3 & 4 & 5\\ 6 & 7 & 8 & 9 & 10 \\ 11 & 12 & 13 & 14 & 15 \\ 16 & 17 & 18 & 19 & 20 \end{matrix} \right] = \left[ \begin{matrix} 11 & 12 & 13 & 14 & 15\end{matrix} \right]

These weights of the embedding layer are learned by the neural network itself. In fact, the weight matrix is generally randomly initialized and is a variable that needs to be optimized. When the neural network is trained, each Embedding vector will be updated, that is, the most suitable dimension will be found in the process of continuous increase and dimension reduction. So embedding_lookup here also needs to do the back-propagation, that is, automatic derivation and updating of the weight matrix.

2.3 embeded layer

The embedding layer is the key module of modern deep learning recommendation systems. It is usually located after the input layer, and before the feature interaction and density layer. The embedding layer, like the other layers of deep neural networks, is learned from data and end-to-end training. Let’s look at how to use the embedding layer.

2.3.1 the dot product

We can calculate the dot product between user embeddings and project embeddings to get the final score, the likelihood of the user interacting with the project. You can apply the SigmoID activation function as the final step in converting the output to a probability between 0 and 1.


d o t p r o d u c t : u . v = a i . b i dotproduct : u.v = \sum a_i . b_i

Figure 2. Neural network with two embedded tables and dot product outputs. From Using Neural Networks for Your Recommender System.

This method is equivalent to matrix factorization or alternate least square method (ALS).

2.3.2 Full connection layer

If multiple nonlinear layers are used to construct a deep structure, the neural network performance will be better. You can extend the previous model by feeding the output of the embedded layer to multiple fully connected layers using ReLU activation. One choice point in design here is how to combine two embedding vectors. You can join the embedded vectors alone, or you can multiply vectors by elements, similar to the dot product. The dot product output is followed by multiple hidden layers.

Figure 3. A neural network with two embedded tables and multiple fully connected layers, embedding users and products for concatenate or element-wise multiplication. Multiple hidden layers can process the result vector. From Using Neural Networks for Your Recommender System.

2.3.3 Metadata Information

So far, we’ve only used user ids and product ids as input, but we usually have more information available. Other information about the user, for example, could be gender, age, city (address), time since last visit, or the credit card used to pay. An item usually has the brand, price, category or quantity sold in the last seven days. This auxiliary information can help the model generalize better. We can modify the neural network to use additional features as inputs.

Figure 4. A neural network with meta-information and multiple fully connected layers. We add more information to the neural network architecture, such as auxiliary information such as city, age, branch, category, and price. From Using Neural Networks for Your Recommender System.

So far, we know that we can get a basic recommender system network based on the embedding layer.

2.3.4 Classical Architecture

Next, let’s look at Google’s Wide and Deep in 2016 and Facebook’s DLRM in 2019.

2.3.4.1 Google’s Wide and Deep

Google Wide and Deep contain two components:

  • Compared to the previous neural network, the Wide part is a new component, which is a linear combination of input features with linear/logistic regression-like logic.
  • The function of Deep part is to process the high-dimensional and sparse category features into low-dimensional dense vectors through the embedding layer, and connect the output with the continuous value features. The connected vector is passed to the MLP layer.

The outputs of the two parts are weighted together to obtain the final predicted value.

For the recommender system model, the ideal state is to have both memory ability and generalization ability.

  • Generalization ability refers to the model’s ability to predict unobtainable or rare features.

    • For example, a rule learned through learning is: if you have wings, you can fly, so you can infer that a magpie can fly. This is generalization ability.
    • The Deep section provides excellent generalization capabilities by generalizing based on user/item characteristics.
  • The ability to remember means that the model can remember a large number of historical behavioral features, and then learn the commonalities of features from the historical data, and can use these commonalities as the basis of recommendation.

    • There are omissions in generalization, like penguins can’t fly.
    • The Wide section provides excellent memory through information such as history to correct exceptions. If a simple model like logistic regression finds a “strong feature”, it will adjust its corresponding weight to a very large amount during training, so as to realize the memory of this feature, which is the so-called model “memory ability”.
    • We present papers (arxiv.org/abs/1606.07… , impression_app= Pandora) is set to 1. This is the hope that the model will remember the rule “will the user install APPLICATION B if they have already installed application A?”

Therefore, Wide and Deep model has both the advantages of logistic regression and Deep neural network, which can remember a large number of historical behaviors and has strong expression ability.

In HugeCTR, you can see that models are organized in the following way.

2.3.4.2 DLRM Facebook

The Deep Learning Recommendation Model (DLRM) of Facebook has a similar structure with the neural network architecture with metadata, but there are some differences. A dataset can contain multiple classification features. DLRM requires that all classification inputs be fed through an embedding layer with the same dimension.

Next, the continuous input is connected in series and fed through multiple fully connected layers, called the underlying multilayer perceptron (MLP). The last layer of the bottom MLP has the same dimension as the embedding layer vector.

DLRM uses a new composition layer. It applies element-wise multiplication between all embedded vector pairs and the bottom MLP output. That’s why each vector has the same dimension. The resulting vectors are concatenated and sent to another set of fully connected layers (top MLP).

Figure 5. The Wide and Deep architecture is visualized on the left and the DLRM architecture on the right.

Using Neural Networks for Your Recommender System.

2.4 Embedding layer of recommendation system

In CTR domain, features are characterized by high dimension and sparse, so one-HOT coding is generally used for discrete features. However, the input of one-hot features into DNN will result in too many network parameters. For example, if there are 10 million nodes in the input layer and 500 nodes in the hidden layer, the parameters will be 5 billion.

Therefore, a Embedding layer is added to reduce the dimension, which compacts the sparse vector of a single feature. But sometimes it doesn’t work, and people can divide features into different fields.

Against 2.4.1 characteristics

Compared to other types of DL models, the embedding layers of DL recommendation models are special: they contribute a large number of parameters to the model, but require little computation, whereas the computation-intensive Denser layers have a much smaller number of parameters.

To take a specific example: The original Wide and Deep model had several [1024,512,256] dense layers, so these dense layers had only a few million parameters, while the embedded layer could have billions of items, and billions of parameters. This is in contrast to, for example, the BERT model architecture popular in the NLP field, where BERT’s embedding layer has only tens of thousands of entries, totaling millions of parameters, but its dense feedforward and attention layer consists of hundreds of millions of parameters. This difference also leads to another result: THE amount of computation per byte of DL recommended network input data is typically much smaller than other types of DL models.

2.4.2 Importance of optimized embedding

For recommendation system, the optimization of embedding layer is very important. To understand why optimization of the embedding layer and related operations is important, we need to first look at the challenges of embedding layer training for recommendation systems: data volume and speed.

2.4.2.1 data volume

As online platforms and services gain hundreds of millions or even billions of users and offer billions of unique products, it’s not surprising that embedded tables are getting bigger.

Instagram has reportedly been working on a 10 terabyte recommendation model. Similarly, an advertising ranking model of Baidu also reached the realm of 10 TB. Across the industry, models from hundreds of Gigabytes to terabytes are becoming increasingly popular, such as Pinterest’s 4-TB model and Google’s 1.2-TB model. Fitting a TERabyte model on a single compute node, let alone a single compute accelerator such as a GPU, is therefore a major challenge. The NVIDIA A100 GPU just comes with 80 GB of HBM.

2.4.2.2 Access speed

Training recommendation systems is essentially a memory – bandwidth – intensive task. This is because each training sample or batch usually involves a small number of entities embedded in the table. These entries must be retrieved to calculate the forward pass and then updated in the back pass.

CPU main memory capacity is large but bandwidth is limited, high-end models are usually in the range of tens of GB/s. On the other hand, gpus have limited memory but high bandwidth. NVIDIA A100 80-GB GPU provides 2 TB/s memory bandwidth.

2.4.3 Solution

Some solutions have been found to these challenges, but they all have some problems, such as:

  • Keeping the entire embedded table on main memory solves the size problem. However, it typically results in extremely slow training throughput, which is often dwarfed by the amount and speed of new data, resulting in systems that cannot be retrained in a timely manner.
  • Alternatively, you can distribute the embedded layer across multiple Gpus and multiple nodes, but this leads to a communication bottleneck, resulting in insufficient GPU computing utilization and training performance comparable to pure CPU training.

Therefore, embedding layer is one of the main bottleneck of recommendation system. Optimizing the embedding layer is the key to unlock the high computing throughput of GPU.

0x03 DeepFM

Since we are using DeepFM as an example, we need to cover the basics.

We choose the content of IJCAI 2017 paper DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction for analysis.

3.1 characteristics of CTR

CTR estimated data has the following characteristics:

  • The input data can be classified or continuous. Categorical data will be encoded as one-HOT. Continuous data can be discretized first and then changed to one-hot, or the original value can be retained.
  • The dimensions of data are very high.
  • The data is very sparse.
  • Features are grouped by Field.

CTR estimation focuses on learning portfolio characteristics. The conclusion of The Google paper is that both high-order and low-order combination features are very important and should be learned at the same time. So the key point is how to extract these combination features efficiently.

3.2 DeepFM

DeepFM improves On Google’s Wide & Deep model:

  • The Wide part of Wide & Deep part was converted from artificial feature engineering + LR to FM model. FM extracted low-order combined features, and Deep extracted high-order combined features. In this way, artificial feature engineering was avoided and the generalization ability of the model was improved.
  • FM model and Deep part share Embedding, which makes DeepFM become an end-to-end model and improves the training efficiency of the model. The training is faster and more accurate.

The specific model architecture is as follows:

0x04 HugeCTR Embedded Layer

To overcome embedding challenges and enable faster training, HugeCTR implements its own embedding layer, which includes a GPU-accelerated hash table, efficient sparse optimizer implemented in a memory-saving manner, and various embedding distribution strategies. It uses NCCL as its inter-GPU communication primitive.

4.1 a hash table

The implementation of hash tables is based on RAPIDS cuDF, a GPU DataFrame library that is part of NVIDIA’s RAPIDS Data Science platform. CuDF GPU hash tables can be up to 35 times faster than concurrent_hash_map based on Threading Building Blocks (TBB).

4.2 Model Parallelism

HugeCTR provides a model parallel embedded table that is distributed across all Gpus in a cluster consisting of multiple nodes and gpus. Dense layers, on the other hand, employ data parallelism, with one copy on each GPU.

For extensibility, HugeCTR supports model parallelism at the embedding layer by default.

Figure 6. HugeCTR model and data parallel architecture. From developer.nvidia.com/blog/announ…

4.3 communication

HugeCTR uses NCCL for high-speed and scalable inter-node and intra-node communication. For cases with many input features, HugeCTR embedded tables can be split into multiple slots. Features belonging to the same slot are independently transformed into corresponding embedding vectors, and then reduced to a single embedding vector. This allows users to effectively reduce the number of valid functions in each slot to manageable levels.

4.4 confirm

We can get it from github.com/NVIDIA/Deep… You can see mixed parallelism in this, which can be verified with HugeCTR (DLRM is only used for verification because we use DeepFM for profiling).

4.4.1 DLRM

In DLRM, to process the category data, the embedding layer maps each category to a dense representation, which is then fed into a multi-layer perceptron (MLP). The numerical characteristics can be entered directly into the MLP. At the next level, the second-order interactions of different features are explicitly calculated by taking the dot product between all embedding vector pairs and the processed dense features. These paired interactions are fed to the top-level MLP to calculate the likelihood of interaction between the user and the product pair.

DLRM differs from other DL-based recommendation methods in two ways. First, it explicitly calculates characteristic interactions and limits the order of interactions to Pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to the category feature) as a unit, while other methods (such as Deep and Cross) treat each element in the feature vector as a new unit, resulting in different Cross terms. These design choices help reduce computing and memory costs while also providing considerable accuracy.

Figure from developer.nvidia.com/blog/announ…

4.4.2 Mixed parallelism

Many recommendation models contain very large embedded tables. As a result, the model is often too large to fit into a single device. This can be easily solved by using cpus or other Gpus as “memory donors” and training in a model parallel manner. However, this approach is suboptimal because the calculations of the “memory donor” device are not utilized.

For DLRM, we use the model parallel approach (embedded table + underlying MLP) for the bottom part of the model, and the usual data parallel approach (Dot Interaction + Top MLP) for the Top part of the model. This way, we can train much larger models than would normally fit a single GPU, while making training faster by using multiple Gpus at the same time. We call this approach hybrid parallelism.

Figure from github.com/NVIDIA/Deep… .

4.5 the use of

The hugeCTR embedding layer can be used in two ways:

  • Use the native NVIDIA Merlin HugeCTR framework for training and reasoning work.
  • Use the NVIDIA Merlin HugeCTR TensorFlow plug-in, which is designed to work seamlessly with TensorFlow.

0 x05 initialization

5.1 configuration

As mentioned above, network configuration can be completed by code. As can be seen from the following, DeepFM has three embedding layers that map wide_data sparse parameter to dense vector. The SPARSE parameter of DEEP_data maps to the dense vector.

model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("wide_data".30, True, 1),
                        hugectr.DataReaderSparseParam("deep_data".2, False, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 23,
                            embedding_vec_size = 1,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "wide_data",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 358,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "deep_data",
                            optimizer = optimizer))
Copy the code

5.1.1 DataReaderSparseParam

The DataReaderSparseParam is defined as follows, where slot_num represents the number of slots in this layer. Such as hugectr. DataReaderSparseParam (” deep_data “, 2, False, 26) represents there are 26 slots.

struct DataReaderSparseParam {
  std::string top_name;
  std::vector<int> nnz_per_slot;
  bool is_fixed_length;
  int slot_num;

  DataReaderSparse_t type;
  int max_feature_num;
  int max_nnz;

  DataReaderSparseParam() {}
  DataReaderSparseParam(const std::string& top_name_, const std::vector<int>& nnz_per_slot_,
                        bool is_fixed_length_, int slot_num_)
      : top_name(top_name_),
        nnz_per_slot(nnz_per_slot_),
        is_fixed_length(is_fixed_length_),
        slot_num(slot_num_),
        type(DataReaderSparse_t::Distributed) {
    if (static_cast<size_t>(slot_num_) ! = nnz_per_slot_.size()) {
      CK_THROW_(Error_t::WrongInput, "slot num ! = nnz_per_slot.size().");
    }
    max_feature_num = std::accumulate(nnz_per_slot.begin(), nnz_per_slot.end(), 0);
    max_nnz = *std::max_element(nnz_per_slot.begin(), nnz_per_slot.end());
  }

  DataReaderSparseParam(const std::string& top_name_, const int nnz_per_slot_,
                        bool is_fixed_length_, int slot_num_)
      : top_name(top_name_),
        nnz_per_slot(slot_num_, nnz_per_slot_),
        is_fixed_length(is_fixed_length_),
        slot_num(slot_num_),
        type(DataReaderSparse_t::Distributed) {
    max_feature_num = std::accumulate(nnz_per_slot.begin(), nnz_per_slot.end(), 0);
    max_nnz = *std::max_element(nnz_per_slot.begin(), nnz_per_slot.end()); }};Copy the code

5.1.2 slot concept

As we know from the documentation, the concept of slot is a feature field or table.

In HugeCTR, a slot is a feature field or table. The features in a slot can be one-hot or multi-hot. The number of features in different slots can be various. You can specify the number of slots (`slot_num`) in the data layer of your configuration  file.Copy the code

Field or slots (also called Feature Group in some articles) is a collection of several associated features. Its main function is to form a Feature Field with related features and then convert the Field into a dense vector. This reduces the size of the DNN input layer and model parameters.

For example, the commodities seen by the user and the commodities purchased by the user are two fields, and each commodity is a feature. These commodities share a list of commodities, or a Vocabulary. The DNN input layer would be too large if each commodity was added and the tensors were spliced together. Therefore, these purchased goods are classified as the same field, and a field vector is obtained by pooling the embedding vectors of these goods, then the number of input layers will be much less.

5.1.2.1 FLEN

FLEN: There are several examples of Leveraging Field for Scalable CTR Prediction, as well as several wonderful illustrations, that help illustrate the concept:

The data in CTR prediction task is multi-field categorical data, that is, every feature is classified and belongs to only one field. For example, characteristics “gender=Female “belong to domain “gender”, characteristics “age=24” belong to domain “age”, characteristics “item category=cosmetics “belong to domain “item category”. The value of the trait “gender” is “male” or “female”. Characteristic “age” is divided into several age groups. 0-18 “, “18-25 “, “25-30 “, etc. Conjunctions are the key to predicting hits correctly. An example of informative feature linking is the age group “18-25 “combined with gender” female “for the item category” Cosmetics “. It shows that young girls are more likely to click on cosmetics products.

A filed-wise embedding vector is used in FLEN model, and the corresponding embedding vector can be obtained by summing the embedding vectors in the same domain (such as user field or item field).

For example, the feature xnx_nxn is first converted to embed vector ene_nen.


e n = V n x n e_n= V_nx_n

Secondly, field-wise embedding vectors are obtained by sum-pooling.


e m = n F ( n ) = m e n e_m = \sum_{n|F(n)=m}e_n

Such as

Finally, all field-wise embedding Vectors are stitched together.

The overall system architecture is as follows:

5.1.2.2 Pooling

How to do pooling? HugeCTR has two operations, sum or mean, specifically called combiner, for example:

// do sum reduction
if (combiner == 0) {
  forward_sum(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
              hash_value_index.get_ptr(), hash_table_value.get_ptr(),
              embedding_feature.get_ptr(), stream);
} else if (combiner == 1) {
  forward_mean(batch_size, slot_num, embedding_vec_size, row_offset.get_ptr(),
               hash_value_index.get_ptr(), hash_table_value.get_ptr(),
               embedding_feature.get_ptr(), stream);
} 
Copy the code

Combined with the previous figure, the category feature has a total of M slots, corresponding to M embedded tables.

For example, in test/ pyBind_test /din_matmul_fp32_1gpu.py, there are 11 slots for CateID.

model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 0, dense_name = "dense",
                        data_reader_sparse_param_array =
                        [hugectr.DataReaderSparseParam("UserID".1, True, 1),
                        hugectr.DataReaderSparseParam("GoodID".1, True, 11),
                        hugectr.DataReaderSparseParam("CateID".1, True, 11)))Copy the code

For example, in the figure below, a sample has seven keys divided into two fields, namely two slots. Four keys are placed in the first slot, three keys are placed in the second slot, and the third slot has no keys. During the search, the values corresponding to these keys are searched out. The sum or mean operation is performed on the values in the first slot to obtain V1. The sum or mean operation is performed on the three values in the second slot to obtain V2. Finally, V1 and V2 are concat and sent to the subsequent layers.

5.1.2.3 TensorFlow

We can see some use of pooling in the TF source comments.

Tensorflow/python/ops/embedding_ops. Py is about the use of embedding.

    combiner: A string specifying the reduction op. Currently "mean", "sqrtn"
      and "sum" are supported. "sum" computes the weighted sum of the embedding
      results for each row. "mean" is the weighted sum divided by the total
      weight. "sqrtn" is the weighted sum divided by the square root of the sum
      of the squares of the weights. Defaults to `mean`.
Copy the code

Tensorflow/python/feature_column/feature_column. Py is about feature the use of the column.

sparse_combiner: A string specifying how to reduce if a categorical column
  is multivalent. Except `numeric_column`, almost all columns passed to
  `linear_model` are considered as categorical columns.  It combines each
  categorical column independently. Currently "mean", "sqrtn" and "sum" are
  supported, with "sum" the default for linear model. "sqrtn" often achieves
  good accuracy, in particular with bag-of-words columns.
    * "sum": do not normalize features in the column
    * "mean": do l1 normalization on features in the column
    * "sqrtn": do l2 normalization on features in the column
Copy the code

In: tensorflow/lite/kernels/embedding_lookup_sparse. Cc annotation is directly from the embedded in the use of the look up table.

// Op that looks up items from a sparse tensor in an embedding matrix.
// The sparse lookup tensor is represented by three individual tensors: lookup,
// indices, and dense_shape. The representation assume that the corresponding
// dense tensor would satisfy:
// * dense.shape = dense_shape
// * dense[tuple(indices[i])] = lookup[i]
//
// By convention, indices should be sorted.
//
// Options:
// combiner: The reduction op (SUM, MEAN, SQRTN).
// * SUM computes the weighted sum of the embedding results.
// * MEAN is the weighted sum divided by the total weight.
// * SQRTN is the weighted sum divided by the square root of the sum of the
// squares of the weights.
//
// Input:
// Tensor[0]: Ids to lookup, dim.size == 1, int32.
// Tensor[1]: Indices, int32.
// Tensor[2]: Dense shape, int32.
// Tensor[3]: Weights to use for aggregation, float.
// Tensor[4]: Params, a matrix of multi-dimensional items,
// dim.size >= 2, float.
//
// Output:
// A (dense) tensor representing the combined embeddings for the sparse ids.
// For each row in the sparse tensor represented by (lookup, indices, shape)
// the op looks up the embeddings for all ids in that row, multiplies them by
// the corresponding weight, and combines these embeddings as specified in the
// last dimension.
//
// Output.dim = [l0, ... , ln-1, e1, ..., em]
// Where dense_shape == [l0, ..., ln] and Tensor[4].dim == [e0, e1, ..., em]
//
// For instance, if params is a 10x20 matrix and ids, weights are:
//
// [0, 0]: id 1, weight 2.0
// [0, 1]: id 3, weight 0.5
// [1, 0]: id 0, weight 1.0
// [2, 3]: id 1, weight 3.0
//
// with combiner=MEAN, then the output will be a (3, 20) tensor where:
//
/ / the output [0, :] = (params [1, :] * 2.0 + params [3, :] * 0.5)/(2.0 + 0.5)
// output[1, :] = params[0, :] * 1.0) / 1.0
// output[2, :] = params[1, :] * 3.0) / 3.0
//
// When indices are out of bound, the op will not succeed.
Copy the code

In addition, other frameworks/model implementations use weighted averages (e.g., Attention), or add timing information.

5.2 build

This is a complicated process, and as we saw earlier, the DataReader ends up copying the various inputs to its member variable output_.

So how does an embedded layer take advantage of sparse features in output_? We need to look at it step by step.

5.2.1 line

Parser, CPP, established the following code lines, we omitted a lot of code, you can see the first call create_datareader reader is established, and then establish embedding.

template <typename TypeKey>
void Parser::create_pipeline_internal(std::shared_ptr<IDataReader>& init_data_reader,
                                      std::shared_ptr<IDataReader>& train_data_reader,
                                      std::shared_ptr<IDataReader>& evaluate_data_reader,
                                      std::vector<std::shared_ptr<IEmbedding>>& embeddings,
                                      std::vector<std::shared_ptr<Network>>& networks,
                                      const std::shared_ptr<ResourceManager>& resource_manager,
                                      std::shared_ptr<ExchangeWgrad>& exchange_wgrad) {
  try {
    std::map<std::string, SparseInput<TypeKey>> sparse_input_map;
    {
      // Create Data Reader
      {
        create_datareader<TypeKey>()(j, sparse_input_map, train_tensor_entries_list,
                                     evaluate_tensor_entries_list, init_data_reader,
                                     train_data_reader, evaluate_data_reader, batch_size_,
                                     batch_size_eval_, use_mixed_precision_, repeat_dataset_,
                                     enable_overlap, resource_manager);
      }  // Create Data Reader

      // Create Embedding ----
      {
        for (unsigned int i = 1; i < j_layers_array.size(a); i++) {if (use_mixed_precision_) {
            create_embedding<TypeKey, __half>()(
                sparse_input_map, train_tensor_entries_list, evaluate_tensor_entries_list,
                embeddings, embedding_type, config_, resource_manager, batch_size_,
                batch_size_eval_, exchange_wgrad, use_mixed_precision_, scaler_, j, use_cuda_graph_,
                grouped_all_reduce_);
          } else {
            create_embedding<TypeKey, float>()( sparse_input_map, train_tensor_entries_list, evaluate_tensor_entries_list, embeddings, embedding_type, config_, resource_manager, batch_size_, batch_size_eval_, exchange_wgrad, use_mixed_precision_, scaler_, j, use_cuda_graph_, grouped_all_reduce_); }}// for ()
      }    // Create Embedding

      // create network
      for (size_t i = 0; i < resource_manager->get_local_gpu_count(a); i++) { networks.emplace_back(Network::create_network(
            j_layers_array, j_optimizer, train_tensor_entries_list[i],
            evaluate_tensor_entries_list[i], total_gpu_count, exchange_wgrad,
            resource_manager->get_local_cpu(), resource_manager->get_local_gpu(i),
            use_mixed_precision_, enable_tf32_compute_, scaler_, use_algorithm_search_,
            use_cuda_graph_, false, grouped_all_reduce_));
      }
    }
    exchange_wgrad->allocate(a); }catch (const std::runtime_error& rt_err) {
    std::cerr << rt_err.what() << std::endl;
    throw; }}Copy the code

5.2.2 establish the DataReader

So let’s see how we can build Reader, again, without most of the code. Sparse input is mainly related to:

  • The parameter to create_datareader is sparse_input_map, which is a reference.
  • Set sparSE_INput_map based on the configuration.
  • Find sparse_input by parameter name.
  • Assign sparse_tensors_map of reader output_ to sparse_input. That’s this line of codecopy(data_reader_tk->get_sparse_tensors(sparse_name), sparse_input->second.train_sparse_tensors);Assign sparse_tensors_map of reader output_ to sparse_input.
  • Sparse_input ->second. Train_sparse_tensors were also set to read.
  • Sparse_input_map is then passed to create_embedding.

The key is shown in the code comments below,

template <typename TypeKey>
void create_datareader<TypeKey>::operator() (const nlohmann::json& j, std::map<std::string, SparseInput<TypeKey>>& sparse_input_map,
    std::vector<TensorEntry>* train_tensor_entries_list,
    std::vector<TensorEntry>* evaluate_tensor_entries_list,
    std::shared_ptr<IDataReader>& init_data_reader, std::shared_ptr<IDataReader>& train_data_reader,
    std::shared_ptr<IDataReader>& evaluate_data_reader, size_t batch_size, size_t batch_size_eval,
    bool use_mixed_precision, bool repeat_dataset, bool enable_overlap,
    const std::shared_ptr<ResourceManager> resource_manager) {

  std::vector<DataReaderSparseParam> data_reader_sparse_param_array;

  // Set sparse_input_map based on the configuration
  for (unsigned int i = 0; i < j_sparse.size(a); i++) { DataReaderSparseParam param{sparse_name, nnz_per_slot_vec, is_fixed_length, slot_num}; data_reader_sparse_param_array.push_back(param);
    SparseInput<TypeKey> sparse_input(param.slot_num, param.max_feature_num);
    sparse_input_map.emplace(sparse_name, sparse_input);
    sparse_names.push_back(sparse_name);
  }

  if (format == DataReaderType_t::RawAsync) {
    / / to omit
  } else {
    DataReader<TypeKey>* data_reader_tk = newDataReader<TypeKey>(......) ;for (unsigned int i = 0; i < j_sparse.size(a); i++) {const std::string& sparse_name = sparse_names[i];
      // Find sparse input by name
      const auto& sparse_input = sparse_input_map.find(sparse_name);

      auto copy = [](const std::vector<SparseTensorBag>& tensorbags,
                     SparseTensors<TypeKey>& sparse_tensors) {
        sparse_tensors.resize(tensorbags.size());
        for (size_t j = 0; j < tensorbags.size(a); ++j) { sparse_tensors[j] = SparseTensor<TypeKey>::stretch_from(tensorbags[j]); }};// The key is to assign sparse_tensors_map of reader output_ to sparse_input
      copy(data_reader_tk->get_sparse_tensors(sparse_name),
           sparse_input->second.train_sparse_tensors);
      copy(data_reader_eval_tk->get_sparse_tensors(sparse_name), sparse_input->second.evaluate_sparse_tensors); }}Copy the code

Get_sparse_tensors code is as follows:

const std::vector<SparseTensorBag> &get_sparse_tensors(const std::string &name) {
  if (output_->sparse_tensors_map.find(name) == output_->sparse_tensors_map.end()) {
    CK_THROW_(Error_t::IllegalCall, "no such sparse output in data reader:" + name);
  }
  return output_->sparse_tensors_map[name];
}
Copy the code

So the logical extension is as follows:

  • A) Create_pipeline_internal generates sparse_input_map as a parameter to create_datareader.
  • Sparse_input_map = datareader.output_.sparse_tensors_map = datareader.output_.sparse_tensors_map;
  • C) sparse_input_map is passed to create_embedding as follows:create_embedding<TypeKey, float>()(sparse_input_map,......)
  • D) Sparse_input_map is used for create_embedding on GPU.

5.2.3 Establish embedding

The following code establishes the embed.

        create_embedding<TypeKey, float>()(
            sparse_input_map, train_tensor_entries_list, evaluate_tensor_entries_list,
            embeddings, embedding_type, config_, resource_manager, batch_size_,
            batch_size_eval_, exchange_wgrad, use_mixed_precision_, scaler_, j, use_cuda_graph_,
            grouped_all_reduce_);
Copy the code

Here established some embedding, such as DistributedSlotSparseEmbeddingHash, the embedding finally saved in the Session embeddings_ member variables. There are several key points here:

  • Get sparSE_input information from sparse_input_map.
  • Using sparse. DistributedSlotSparseEmbeddingHash train_sparse_tensors to construct. So we can know that the DataReader. Output_ member variables will and DistributedSlotSparseEmbeddingHash together. Here is the input for embedding.
  • The output parameter train_tensor_entries_list is returned as the output of the embedding, which is a pointer.
template <typename TypeKey, typename TypeFP>
static void create_embeddings(std::map<std::string, SparseInput<TypeKey>>& sparse_input_map,
                              std::vector<TensorEntry>* train_tensor_entries_list,
                              std::vector<TensorEntry>* evaluate_tensor_entries_list,
                              std::vector<std::shared_ptr<IEmbedding>>& embeddings,
                              Embedding_t embedding_type, const nlohmann::json& config,
                              const std::shared_ptr<ResourceManager>& resource_manager,
                              size_t batch_size, size_t batch_size_eval, 
                              std::shared_ptr<ExchangeWgrad>& exchange_wgrad,
                              bool use_mixed_precision,
                              float scaler, const nlohmann::json& j_layers,
                              bool use_cuda_graph = false.bool grouped_all_reduce = false) {
  auto j_optimizer = get_json(config, "optimizer");
  auto embedding_name = get_value_from_json<std::string>(j_layers, "type");

  auto bottom_name = get_value_from_json<std::string>(j_layers, "bottom");
  auto top_name = get_value_from_json<std::string>(j_layers, "top");

  auto& embed_wgrad_buff = (grouped_all_reduce) ? 
    std::dynamic_pointer_cast<GroupedExchangeWgrad<TypeFP>>(exchange_wgrad)->get_embed_wgrad_buffs() :
    std::dynamic_pointer_cast<NetworkExchangeWgrad<TypeFP>>(exchange_wgrad)->get_embed_wgrad_buffs(a);auto j_hparam = get_json(j_layers, "sparse_embedding_hparam");
  size_t max_vocabulary_size_per_gpu = 0;
  if (embedding_type == Embedding_t::DistributedSlotSparseEmbeddingHash) {
    max_vocabulary_size_per_gpu =
        get_value_from_json<size_t>(j_hparam, "max_vocabulary_size_per_gpu");
  } else if (embedding_type == Embedding_t::LocalizedSlotSparseEmbeddingHash) {
    if (has_key_(j_hparam, "max_vocabulary_size_per_gpu")) {
      max_vocabulary_size_per_gpu =
          get_value_from_json<size_t>(j_hparam, "max_vocabulary_size_per_gpu");
    } else if (!has_key_(j_hparam, "slot_size_array")) {
      CK_THROW_(Error_t::WrongInput,
                "No max_vocabulary_size_per_gpu or slot_size_array in: "+ embedding_name); }}auto embedding_vec_size = get_value_from_json<size_t>(j_hparam, "embedding_vec_size");
  auto combiner = get_value_from_json<int>(j_hparam, "combiner");

  SparseInput<TypeKey> sparse_input;
  if (!find_item_in_map(sparse_input, bottom_name, sparse_input_map)) {
    CK_THROW_(Error_t::WrongInput, "Cannot find bottom");
  }

  OptParams<TypeFP> embedding_opt_params;
  if (has_key_(j_layers, "optimizer")) {
    embedding_opt_params = get_optimizer_param<TypeFP>(get_json(j_layers, "optimizer"));
  } else {
    embedding_opt_params = get_optimizer_param<TypeFP>(j_optimizer);
  }
  embedding_opt_params.scaler = scaler;

  switch (embedding_type) {
    case Embedding_t::DistributedSlotSparseEmbeddingHash: {
      const SparseEmbeddingHashParams<TypeFP> embedding_params = {
          batch_size,
          batch_size_eval,
          max_vocabulary_size_per_gpu,
          {},
          embedding_vec_size,
          sparse_input.max_feature_num_per_sample,
          sparse_input.slot_num,
          combiner,  // combiner: 0-sum, 1-mean
          embedding_opt_params};

      embeddings.emplace_back(new DistributedSlotSparseEmbeddingHash<TypeKey, TypeFP>(
          sparse_input.train_row_offsets, sparse_input.train_values, sparse_input.train_nnz,
          sparse_input.evaluate_row_offsets, sparse_input.evaluate_values,
          sparse_input.evaluate_nnz, embedding_params, resource_manager));
      break;
    }
    case Embedding_t::LocalizedSlotSparseEmbeddingHash: {
#ifndef NCCL_A2A

      auto j_plan = get_json(j_layers, "plan_file");
      std::string plan_file;
      if (j_plan.is_array()) {
        int num_nodes = j_plan.size(a); plan_file = j_plan[resource_manager->get_process_id()].get<std::string>();
      } else {
        plan_file = get_value_from_json<std::string>(j_layers, "plan_file");
      }

      std::ifstream ifs(plan_file);
#else
      std::string plan_file = "";
#endif
      std::vector<size_t> slot_size_array;
      if (has_key_(j_hparam, "slot_size_array")) {
        auto slots = get_json(j_hparam, "slot_size_array");
        for (auto slot : slots) {
          slot_size_array.emplace_back(slot.get<size_t>());
        }
      }

      const SparseEmbeddingHashParams<TypeFP> embedding_params = {
          batch_size,
          batch_size_eval,
          max_vocabulary_size_per_gpu,
          slot_size_array,
          embedding_vec_size,
          sparse_input.max_feature_num_per_sample,
          sparse_input.slot_num,
          combiner,  // combiner: 0-sum, 1-mean
          embedding_opt_params};

      embeddings.emplace_back(new LocalizedSlotSparseEmbeddingHash<TypeKey, TypeFP>(
          sparse_input.train_row_offsets, sparse_input.train_values, sparse_input.train_nnz,
          sparse_input.evaluate_row_offsets, sparse_input.evaluate_values,
          sparse_input.evaluate_nnz, embedding_params, plan_file, resource_manager));

      break;
    }
    case Embedding_t::LocalizedSlotSparseEmbeddingOneHot: {
 			// Omit some code
      break;
    }

    case Embedding_t::HybridSparseEmbedding: {
			// Omit some code
      break; }}// switch
  
  // Set the output here
  for (size_t i = 0; i < resource_manager->get_local_gpu_count(a); i++) { train_tensor_entries_list[i].push_back(
        {top_name, (embeddings.back() - >get_train_output_tensors())[i]});
    evaluate_tensor_entries_list[i].push_back(
        {top_name, (embeddings.back() - >get_evaluate_output_tensors())[i]}); }}Copy the code

5.2.4 How to Obtain the output

When the create_Embeddings method was called above, train_tensor_entries_list was passed in as an argument.

At the end of create_embeddings, the output of the embedding is retrieved in train_tensor_entries_list.

  // If you have a dense tensor, you would put it here
  for (size_t i = 0; i < resource_manager->get_local_gpu_count(a); i++) { train_tensor_entries_list[i].push_back(
        {top_name, (embeddings.back() - >get_train_output_tensors())[i]});
    evaluate_tensor_entries_list[i].push_back(
        {top_name, (embeddings.back() - >get_evaluate_output_tensors())[i]});
  }
Copy the code

Where is the output? And that’s in embedding_data.train_output_tensors_, which we’ll look at later.

std::vector<TensorBag2> get_train_output_tensors(a) const override {                        \
  std::vector<TensorBag2> bags;                                                            \
  for (const auto& t : embedding_data.train_output_tensors_) {                             \
    bags.push_back(t.shrink()); \} \returnbags; \} \Copy the code

Therefore, for embedding, sparse_input_map and train_tensor_entries_list constitute input and output data streams.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Using Neural Networks for Your Recommender System

Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin

Developer.nvidia.com/blog/introd…

Developer.nvidia.com/blog/announ…

Developer.nvidia.com/blog/accele…

Introduction to NVIDIA Merlin HugeCTR: A training framework dedicated to recommendation systems

Read HugeCTR source code

How does embedding propagate back

Web.eecs.umich.edu/~justincj/t…

Sparse matrix storage format summary + storage efficiency comparison :COO,CSR,DIA,ELL,HYB

Out of Thin Air: On the Embedding idea in recommendation algorithm

Principle of the tf.nm.embedding_lookup function

Tensorflow’s embedding_lookup interface is embedding_lookup.

How does embedding do in the recommended scene of big factory

Can CTR preestimation be applied to stitching of sequence feature embedding and input MLP? With pooling

Depth matching model in recommendation system

Local processing: How is Embedding realized?

Depth Matching Model in Search of Asymmetrical Two-bar Model (Part 2)

Depth feature fast ox strategy about high and low layer feature fusion

Deep learning introduction to DeepFM with Pytorch code explanation

deepFM in pytorch

Recommended algorithm 7 — DeepFM model

DeepFM Parameter Understanding (2)

Recommendation system meets deep learning (III)– Theory and Practice of DeepFM model

Deep learning introduction to DeepFM with Pytorch code explanation

Docs.nvidia.com/deeplearnin…

Introduce you to the key algorithm of large model training: distributed training Allreduce algorithm

FLEN: Leveraging Field for Scalable CTR Prediction