Author: Han Xinzi @Showmeai, Joan @Tencent

Address: www.showmeai.tech/article-det…

Statement: All rights reserved, please contact the platform and the author and indicate the source

Twin towers model is recommended, search, advertising, and other fields of commonly used in the algorithm and the structure of the classic, practical applications, the company do structure upgrade each tower in the structure of the twin towers, substitute CTR new network structure in the forecast for all connections within DNN, see this is tencent browser team recommended scenario, ingenious solutions parallel CTR model is applied to the twin towers.

Read the full text in one picture

The implementation code

For the realization of DCN/FM/DeepFM/FFM/CIN(xDeepFM) and other CTR prediction methods mentioned in this paper, please go to GitHub to check: github.com/ShowMeAI-Hu…

Paper Download & Data set download

For relevant data, please reply to “Recommendation and CTR Data Set” on the official account (AI Algorithm Institute), and obtain the download link.

Dachang technology implementation of multi-task and multi-objective modeling in CTR estimation

“Multi-objective optimization and application (including code implementation)” www.showmeai.tech/article-det…
“Multi-objective optimization practice in iQiyi short video recommendation business” www.showmeai.tech/article-det…

One, the twin tower model structure

1.1 Introduction to model structure

The two-tower model is widely used in the recall and ranking stages of recommendation, search, advertising and other fields. The model structure is shown as follows:

In the two-tower model structure, the User tower is on the left and the Item tower is on the right. Correspondingly, features can be divided into two categories:

User-related features: basic User information, group statistical attributes, Item sequences that have been exchanged, etc.
Item related features: Item basic information, attribute information, etc.

Context features can be added to the user side tower. In the initial version of the structure, both of the two towers are the classical DNN model (i.e., the fully connected structure), from the feature Embedding through several layers of MLP hidden layer, the two towers output User Embedding and Item Embedding encoding respectively.

During the training, User Embedding and Item Embedding do inner product or Cosine similarity calculation, which makes the current User and positive case Item closer to each other in the Embedding space, while the distance between User and negative case Item is wider in the Embedding space. The Loss function can be standard cross entropy Loss (treating the problem as a classification problem) or BPR or Hinge Loss (treating the problem as a presentation learning problem).

1.2 Advantages and disadvantages of the twin tower model

The advantages of the twin tower model are obvious:

The structure is clear. After modeling and learning User and Item respectively, interactive estimation is completed
After the training, the on-line inference process was efficient and had excellent performance. During the online serving stage, the Item vector is pre-calculated. Calculate the User vector once according to the changing characteristics and then calculate the inner product or cosine.

There are drawbacks to the twin towers:

The original twin tower model structure had limited features and could not use crossover features.
Under the constraints of model structure, User and Item are constructed separately and can only interact through the final inner product, which is not conducive to the learning of user-item interaction.

1.3 Optimization of the twin tower model

Tencent Information Flow team (recommended scene of QQ Browser novel) optimized the structure of the twin towers model based on the above restrictions, and made good gains in enhancing the structure and effect of the model. The specific methods are as follows:

The “parallel” structure of effective CTR modules (MLP, DCN, FM, FFM, CIN) was replaced by the simple DNN structure in the twin towers. The “width” of the model was widened to alleviate the bottleneck of the inner product of the twin towers by taking full advantage of the cross-feature advantages of different structures.
LR is used to learn the weight of multiple twin towers that are “parallel”. LR weight is finally integrated into User Embedding, so that the final model still maintains the inner product form.

Two, parallel twin tower model structure

A parallel two-tower model can be divided into three total layers: input layer, presentation layer and matching layer. The corresponding processing and operation of the three layers in the figure are as follows.

2.1 Input Layer

Tencent QQ browser has the following two categories of features in the novel scene:

User features: User ID, User portrait (age, gender, city), behavior sequence (click, read, favorites), external behavior (browser information, Tencent video, etc.);
Item features: novel content features (novel ID, classification, label, etc.), statistical features, etc.

Both User and Item features are discretized and mapped into Feature Embedding, which is convenient for network construction at the presentation layer.

2.2 Representation Layer

Deep neural network CTR modules (MLP, DCN, FM, CIN, etc.) are applied to learn input. Different modules can learn the fusion and interaction of input layer features in different ways.
To represent the learning of different modules, a parallel structure is constructed for matching layer calculation.
User-user and item-item feature interaction of the presentation layer (information crossing within the tower) can be achieved in the tower branch, while user-item feature interaction can only be achieved through upper-level operations.

2.3 Matching Layer

The User and Item vectors obtained from the representation layer are respectively processed by Hadamard product according to different parallel models. After splicing, results are fused through LR to calculate the final score.
The weights of each dimension of the online SERVING LR can be pre-integrated into the User Embedding so that the online grading remains an inner product operation.

MLP/DCN structure of the two towers

MLP structure (multi-layer full connection) is generally used in the twin towers. Tencent QQ browser team also introduced the Cross Network structure in DCN to explicitly construct high-order feature interaction. The reference structure is the improved version of Google Paper DCN-MIX.

3.1 the DCN structure

DCN is characterized by the introduction of Cross Network structure to extract Cross combination features, avoiding the process of artificial features in traditional machine learning. The Network structure is simple and the complexity is controllable, and multi-order Cross features can be obtained with the increase of depth.

The specific structure of DCN model is shown in the figure above:

The bottom layer is the Embedding layer and the stack is applied to the Embedding.
At the top are parallel Cross networks and Deep networks.
The header is a Combination Layer, and the result stack of Cross Network and Deep Network is Output.

3.2 Introduction of optimized DCN-V2 structure

On the basis of DCN, Google proposed an improved version of DCN-MIX/DCN-v2, which was improved for Cross Network. We mainly focused on the change of calculation method of Cross Network:

3.2.1 Calculation method of original Cross Network

The problem is that the final KKK order interaction result xKx_ {K}xk is proved to be equal to the product of x0x_{0}x0 and a scalar (but different x0x_{0}x0 scalar is different, X0x_ {0}x0 and XKx_ {k}xk are not linear), the expression of Cross Network is limited under this calculation method.

3.2.2 Improved Cross Network computing mode

Google’s improved version of DCN-Mix does the following:

W is changed from a vector to a matrix, and the larger number of parameters brings stronger expression ability (the actual W matrix can also be decomposed by matrix).
Change feature interaction mode: Hadamard Product is applied instead of cross product.

3.2.3 DCN-V2 code reference

Dcn-v2 code implementation and CTR application case can refer to Google official implementation github.com/tensorflow/…

The core of the improved deep Cross Layer code is as follows:

class Cross(tf.keras.layers.Layer) :
  """Cross Layer in Deep & Cross Network to learn explicit feature interactions. A layer that creates explicit and bounded-degree feature interactions efficiently. The `call` method accepts `inputs` as a tuple of size 2 tensors. The first input `x0` is the base layer that contains the original features (usually the embedding layer); the second input `xi` is the output of the previous `Cross` layer in the stack, i.e., the i-th `Cross` layer. For the first `Cross` layer in the stack, x0 = xi. The output is x_{i+1} = x0 .* (W * xi + bias + diag_scale * xi) + xi, where .* designates elementwise multiplication, W could be a full-rank matrix, or a low-rank matrix U*V to reduce the computational cost, and diag_scale increases the diagonal of W to improve training stability ( especially for the low-rank case). References: 1. [R. Wang et al.] (https://arxiv.org/pdf/2008.13535.pdf) See Eq. (1) for full - rank and Eq. (2) for low rank version. 2. [r. Wang et al.] (https://arxiv.org/pdf/1708.05123.pdf) Example: ```python # after embedding layer in a functional model: input = tf.keras.Input(shape=(None,), name='index', dtype=tf.int64) x0 = tf.keras.layers.Embedding(input_dim=32, output_dim=6) x1 = Cross()(x0, x0) x2 = Cross()(x0, x1) logits = tf.keras.layers.Dense(units=10)(x2) model = tf.keras.Model(input, logits) ``` Args: projection_dim: project dimension to reduce the computational cost. Default is `None` such that a full (`input_dim` by `input_dim`) matrix W is used. If enabled, a low-rank matrix W = U*V will be used, where U is of size `input_dim` by `projection_dim` and V is of size `projection_dim` by `input_dim`. `projection_dim` need to be smaller than `input_dim`/2 to improve the model efficiency. In practice, we've observed that `projection_dim` = d/4 consistently preserved the accuracy of a full-rank version. diag_scale: a non-negative float used to increase the diagonal of the kernel W by `diag_scale`, that is, W + diag_scale * I, where I is an identity matrix. use_bias: whether to add a bias term for this layer. If set to False, no bias term will be used. kernel_initializer: Initializer to use on the kernel matrix. bias_initializer: Initializer to use on the bias vector. kernel_regularizer: Regularizer to use on the kernel matrix. bias_regularizer: Regularizer to use on bias vector. Input shape: A tuple of 2 (batch_size, `input_dim`) dimensional inputs. Output shape: A single (batch_size, `input_dim`) dimensional output. """

  def __init__(
      self,
      projection_dim: Optional[int] = None,
      diag_scale: Optional[float] = 0.0,
      use_bias: bool = True,
      kernel_initializer: Union[
          Text, tf.keras.initializers.Initializer] = "truncated_normal",
      bias_initializer: Union[Text,
                              tf.keras.initializers.Initializer] = "zeros",
      kernel_regularizer: Union[Text, None,
                                tf.keras.regularizers.Regularizer] = None,
      bias_regularizer: Union[Text, None,
                              tf.keras.regularizers.Regularizer] = None,
      **kwargs) :

    super(Cross, self).__init__(**kwargs)

    self._projection_dim = projection_dim
    self._diag_scale = diag_scale
    self._use_bias = use_bias
    self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
    self._bias_initializer = tf.keras.initializers.get(bias_initializer)
    self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
    self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
    self._input_dim = None

    self._supports_masking = True

    if self._diag_scale < 0:
      raise ValueError(
          "`diag_scale` should be non-negative. Got `diag_scale` = {}".format(
              self._diag_scale))

  def build(self, input_shape) :
    last_dim = input_shape[-1]

    if self._projection_dim is None:
      self._dense = tf.keras.layers.Dense(
          last_dim,
          kernel_initializer=self._kernel_initializer,
          bias_initializer=self._bias_initializer,
          kernel_regularizer=self._kernel_regularizer,
          bias_regularizer=self._bias_regularizer,
          use_bias=self._use_bias,
      )
    else:
      self._dense_u = tf.keras.layers.Dense(
          self._projection_dim,
          kernel_initializer=self._kernel_initializer,
          kernel_regularizer=self._kernel_regularizer,
          use_bias=False,
      )
      self._dense_v = tf.keras.layers.Dense(
          last_dim,
          kernel_initializer=self._kernel_initializer,
          bias_initializer=self._bias_initializer,
          kernel_regularizer=self._kernel_regularizer,
          bias_regularizer=self._bias_regularizer,
          use_bias=self._use_bias,
      )
    self.built = True

  def call(self, x0: tf.Tensor, x: Optional[tf.Tensor] = None) -> tf.Tensor:
    """Computes the feature cross. Args: x0: The input tensor x: Optional second input tensor. If provided, the layer will compute crosses between x0 and x; if not provided, the layer will compute crosses between x0 and itself. Returns: Tensor of crosses. """

    if not self.built:
      self.build(x0.shape)

    if x is None:
      x = x0

    if x0.shape[-1] != x.shape[-1] :raise ValueError(
          "`x0` and `x` dimension mismatch! Got `x0` dimension {}, and x "
          "dimension {}. This case is not supported yet.".format(
              x0.shape[-1], x.shape[-1]))

    if self._projection_dim is None:
      prod_output = self._dense(x)
    else:
      prod_output = self._dense_v(self._dense_u(x))

    if self._diag_scale:
      prod_output = prod_output + self._diag_scale * x

    return x0 * prod_output + x

  def get_config(self) :
    config = {
        "projection_dim":
            self._projection_dim,
        "diag_scale":
            self._diag_scale,
        "use_bias":
            self._use_bias,
        "kernel_initializer":
            tf.keras.initializers.serialize(self._kernel_initializer),
        "bias_initializer":
            tf.keras.initializers.serialize(self._bias_initializer),
        "kernel_regularizer":
            tf.keras.regularizers.serialize(self._kernel_regularizer),
        "bias_regularizer":
            tf.keras.regularizers.serialize(self._bias_regularizer),
    }
    base_config = super(Cross, self).get_config()
    return dict(list(base_config.items()) + list(config.items()))

Copy the code

Four, the expression layer structure of the twin towers – FM/FFM/CIN structure

Another type of structure commonly used in CTR estimation is FM series structure, typical models include FM, FFM, DeepFM, xDeepFM, their special modeling methods can also mine effective information, Tencent QQ browser team’s final model, also used the substructure of the above model.

As the features of MLP and DCN mentioned above interact and cross, there is no way to explicitly specify some feature interactions, while the FM/FFM/CIN structure in FM series models can explicitly operate the interaction of feature granularity, and from the calculation formula, they have a good inner product form. It is convenient and direct to realize the interaction of feature granularity of User-Item in twin-tower modeling.

4.1 Introduction of FM structure

y = \omega_{0}+\sum_{i=1}^{n} \omega_{i} x_{i}+\sum_{i=1}^{n-1} \sum_{j=i+1}^{n}<v_{i}, v_{j}>x_{i} x_{j}

FM is the most common in CTR forecast model structure, it is built by using the method of matrix decomposition characteristics of the second order interaction, a formula to calculate the performance of vi of characteristic vector and vj two inner product operation to sum (in deep learning can be regarded as characteristics of Embedding group internal product), through the distribution of inner product operation rate can be converted into sum again in the form of inner product.

\begin{array}{c} y=\sum_{i} \sum_{j}\left\langle V_{i}, V_{j}\right\rangle=\left\langle\sum_{i} V_{i}, \sum_{j} V_{j}\right\rangle \\ i \in \text { user fea, } \quad j \in \text { item fea } \end{array}

In the novel recommendation scenario of Tencent QQ browser team, only user-item interaction is considered (because the second-order interaction of features within User or Item has been captured by the above mentioned model). As shown in the above formula, $I $is the feature of User side, $j $is the feature of Item side. By inner product calculation of distribution rate conversion, user-item second-order feature interaction can also be transformed into User and Item feature vectors by suming (sum pooling in neural network) and then inner product, which can be conveniently processed into a two-tower structure.

4.2 INTRODUCTION of FFM structure

FFM model is an upgraded version of FM. Compared with FM, it has more concept of field. FFM attributes features of the same nature to the same field, and the constructed hidden vector is not only related to features, but also related to fields. FFM can also be converted into a two-tower inner product structure by some means.

y(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\sum_{i=1}^{n} \sum_{j=i+1}^{n}\left\langle\mathbf{v}_{i f_{j}}, \mathbf{v}_{j f_{i}}\right\rangle x_{i} x_{j}

An example of a transformation is as follows:

User has 2 feature fields and Item has 3 feature fields. Any 2 feature interactions in the graph have independent Embedding vectors. According to FFM formula, to calculate the second-order interaction of user-item, all inner products need to be calculated and summed.

We reorder the feature Embedding of User and Item and then splice it, so FFM can also be converted into the form of double tower inner product. User-user and item-item in FFM are in the tower, so we can pre-calculate and put them into first-order items.

Tencent QQ browser team found in the practice application that the AUC of FFM twin towers improved significantly in training data, but the increase in the number of parameters resulted in serious over-fitting. Besides, the width of the twin towers after the above structural adjustment was extremely wide (possibly reaching ten thousand levels), which had a great impact on performance efficiency. Further optimization methods were tried as follows:

The User and Item feature fields that participate in FFM training feature interaction are manually screened to control the width of the twin towers (about 1000).
Adjust FFM’s Embedding parameter initialization mode (close to 0) and learning rate (reduced).

The end result was not great, so the team did not use FFM on the actual line.

4.3 INTRODUCTION of CIN structure

FM and FFM mentioned above can complete second-order feature interaction, while CIN structure proposed in xDeepFM model can realize higher-order feature interaction (such as user-user-item, user-user-item-item, user-item-item, etc.). Tencent QQ browser team tried two ways to apply CIN to the twin tower structure:

4.3.1 CIN CIN (User) * (Item)

The multi-order CIN results of User and Item are generated in each tower, then User/Item vector is generated by sum pooling respectively, and the inner product of User and Item vector is obtained

According to the allocation rate, we disassemble the formula of sum pooling inner product and find that the calculation method has realized user-item multi-order interaction internally:

\begin{array}{c} \left(U^{1}+U^{2}+U^{3}\right) *\left(I^{1}+I^{2}+I^{3}\right) \\ U^{1} I^{1}+U^{1} I^{2}+U^{1} I^{3}+U^{2} I^{1}+U^{2} I^{2}+U^{2} I^{3}+U^{3} I^{1}+U^{3} I^{2}+U^{3} I^{3} \end{array}

The implementation process of this usage is also relatively simple. For the two-tower structure, CIN is performed in the towers on both sides to generate results of each order, and then sumpooling is performed for the results. Finally, the interaction of each order of user-item is realized through inner product similar to FM principle.

This approach has certain disadvantages: The generated second-order and above user-item feature interaction has limitations similar to FM (for example, U1 is the sumpooling results of multiple features provided by the User side, and the inner product calculation of U1 and Item results is limited by the calculation of sumpooling. Here each User feature becomes equally important.

4.3.2 CIN(CIN(User), CIN(Item))

The second treatment method is: after multi-order CIN results of User and Item are generated in each side of the towers, CIN results of User and Item are used for explicit interaction in pairs again (instead of inner product calculation after sum pooling) and converted into inner product of the towers, as shown in the figure below:

The following figure is the formula for CIN calculation, and the form remains unchanged after sum pooling of multiple convolution results (weighted sum of two Hadamard products)

The form of CIN is similar to FFM. It can also be converted into the form of the inner product of two towers by the operation of “rearrangement + splicing”. The width of the two towers generated is also very large (ten thousand levels), but different from FFM: The feature Embedding used at the bottom of CIN is shared for all feature interactions, while FFM has independent Embedding for each second-order interaction. Therefore, Tencent QQ browser team basically did not appear in the practice of the fitting problem, the experimental effect of the first method of the first use slightly better.

5. Tencent’s business effect

The following is the experimental results of Tencent QQ browser novel recommendation service (comparing various single-CTR models and parallel two-tower structures) :

5.1 Some of the team’s analyses are as follows

CIN2 has the best effect in the single structure twin tower model, followed by DCN and CIN1 twin tower model.
Compared with single twin tower structure, the effect of parallel twin tower structure is also significantly improved;
Parallel scheme 2 uses CIN2 structure, and the width of the twin towers is up to 20,000 +, which poses a certain challenge to the performance of online serving. Therefore, the parallel two-tower scheme 1 can be selected considering the overall effect and deployment efficiency.

5.2 Some training details and experience provided by the team

Considering the computational complexity of FM/FFM/CIN and other structures, only the selected feature subset is trained, and the category features with higher dimensions are mainly selected, such as user ID, behavior history ID, novel ID, tag ID, etc., as well as a few statistical features. There are probably less than 20 feature fields on the User side and Item side respectively.
Each parallel twin tower structure does not share the underlying feature Embedding and trains its own Embedding separately.
Feature Embedding dimension selection, MLP/DCN for category feature dimension is 16, non-category feature dimension is 32
The feature Embedding dimension of FM/FFM/CIN is uniformly 32.

6. Tencent team experiment effect

In the rough layout stage of novel recommendation scenes, A/B Test was launched. The CTR and reading conversion rate models of the experimental group used “Parallel two-tower scheme I”, while the control group used “MLP two-tower model”, as shown in the figure below, showing significant improvement in business indicators:

Click + 6.8752%

Reading conversion rate +6.2250%

Book conversion rate +6.5775%

Reading time +3.3796%

Vii. Relevant code implementation reference

The implementation code

For the realization of DCN/FM/DeepFM/FFM/CIN(xDeepFM) and other CTR prediction methods mentioned in this paper, please go to GitHub to check: github.com/ShowMeAI-Hu…

Paper Download & Data set download

For relevant data, please reply to “Recommendation and CTR Data Set” on the official account (AI Algorithm Institute), and obtain the download link.

8. References

[1] Huang, Po-Sen, et al. “Learning deep structured semantic models for web search using clickthrough data.” Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013.

[2] S. Rendle, “Factorization Machines,” In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995 — 1000, 2010.

[3] Yuchin Juan, et al. “Field-aware Factorization Machines for CTR Prediction.” Proceedings of the 10th ACM Conference on Recommender SystemsSeptember 43-2016 Pages

[4] Jianxun Lian, et al. “xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018 Pages 1754 — 1763

[5] Ruoxi Wang, Et al. “Deep & Cross Network for Ad Click Predictions” Proceedings of the ADKDD’17August 2017 Article No.: 12Pages 1 —

[6] Wang, Ruoxi, et al. “DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems” In Proceedings of the Web Conference 2021 (WWW ’21); Doi: 10.1145/3442381.3450078

Data download

The implementation code

For the realization of DCN/FM/DeepFM/FFM/CIN(xDeepFM) and other CTR prediction methods mentioned in this paper, please go to GitHub to check: github.com/ShowMeAI-Hu…

Paper Download & Data set download

For relevant data, please reply to “Recommendation and CTR Data Set” on the official account (AI Algorithm Institute), and obtain the download link.

Dachang technology implementation of multi-task and multi-objective modeling in CTR estimation

The multi-objective optimization and application (including code) www.showmeai.tech/article-det…
The iQIYI multi-objective optimization of short video recommended business practice “www.showmeai.tech/article-det…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Giant technology | tencent parallel twin towers CTR structure of information flow to recommend sorting @ recommendation and calculated AD series

Read the full text in one picture

One, the twin tower model structure

1.1 Introduction to model structure

1.2 Advantages and disadvantages of the twin tower model

1.3 Optimization of the twin tower model

Two, parallel twin tower model structure

2.1 Input Layer

2.2 Representation Layer

2.3 Matching Layer

MLP/DCN structure of the two towers

3.1 the DCN structure

3.2 Introduction of optimized DCN-V2 structure

3.2.1 Calculation method of original Cross Network