0 x00 the
DIEN stands for Alibaba’s Deep Interest Evolution Network.
This article will analyze the whole idea of DIEN source code. Because DIEN evolved from DIN, there is a lot of duplication in the code.
This article uses github.com/mouna99/die… The implementation of.
0x01 File Overview
Data files mainly include:
- Uid_voc. PKL: user dictionary, id corresponding to user name;
- Mid_voc. PKL: movie dictionary, id corresponding to item;
- Cat_voc. PKL: category dictionary, category id;
- Item-info: indicates the category information of an item.
- Review-info: review metadata, in the format of userID, itemID, score, timestamp, used for negative sampling;
- Local_train_splitByUser: indicates training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item.
- Local_test_splitByUser: test data in the same format as training data;
The code mainly includes:
- Rnn. py: Modify the original RNN in Tensorflow to combine attention with RNN
- Vecattgrucell. py: Modify GRU source code, add attention to it, design AUGRU structure
- Data_iterator. py: data iterator, used for continuous input of data
- Utils. Py: some auxiliary functions, such as dice activation function, attention score calculation, etc
- Model. py:DIEN model file
- Train.py: Entry to the model for training data, saving the model, and testing data
0x02 Overall Architecture
The first is to extract the architecture diagram from the paper for illustration.
The evolutionary network of deep interest is divided into several layers, from bottom to top:
- Behavior Layer: main function is to transform the goods browsed by users into corresponding embedding and sorted according to browsing time, that is, the original Behavior sequence features of ID class are converted into embedding Behavior sequence.
- Interest Extractor Layer: the main function is to extract user Interest sequence based on behavior sequence by simulating user Interest migration process;
- Interest Evolving Layer: the main role is to simulate the evolution process of Interest related to the current target advertisement by adding Attention mechanism on the basis of Interest extraction Layer, and to model the evolution process of Interest related to the target items;
- The embedding vector of interest representation and AD, user profile and context is spliced. Finally, MLP is used to complete the final prediction.
0x03 Overall code
The DIEN code starts with train.py. Train.py evaluates the test set with the initial model and calls train:
- Get training data and test data, both of which are data iterators for continuous data input
- Generate the corresponding model based on model_type
- The test set was evaluated every 1000 times according to the Batch training.
The code is as follows:
def train(
train_file = "local_train_splitByUser",
test_file = "local_test_splitByUser",
uid_voc = "uid_voc.pkl",
mid_voc = "mid_voc.pkl",
cat_voc = "cat_voc.pkl",
batch_size = 128,
maxlen = 100,
test_iter = 100,
save_iter = 100,
model_type = 'DNN',
seed = 2.) :
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
## Training data
train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
## Test data
test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
n_uid, n_mid, n_cat = train_data.get_n()
......
elif model_type == 'DIEN':
model = Model_DIN_V2_Gru_Vec_attGru_Neg(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
......
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
iter = 0
lr = 0.001
for itr in range(3):
loss_sum = 0.0
accuracy_sum = 0.
aux_loss_sum = 0.
for src, tgt in train_data:
uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
loss_sum += loss
accuracy_sum += acc
aux_loss_sum += aux_loss
iter+ =1
if (iter % test_iter) == 0:
eval(sess, test_data, model, best_model_path)
loss_sum = 0.0
accuracy_sum = 0.0
aux_loss_sum = 0.0
if (iter % save_iter) == 0:
model.save(sess, model_path+"--"+str(iter))
lr *= 0.5
Copy the code
0x04 Model base class
The base class of a Model is Model, and its constructor __init__ can be understood as the Behavior Layer: The main function is to transform the products browsed by users into corresponding embedding and sort them according to the browsing time, that is, the original behavior sequence of ID class is converted into embedding behavior sequence.
4.1 Basic Logic
The basic logic is as follows:
- Construct various placeholder variables under ‘Inputs’ scope;
- The embedding Lookup table for user and item is built under the ‘Embedding_layer’ scope to convert the input data to the corresponding embedding.
- Combining various embedding vectors, for example, the embedding corresponding to the ID of item and the embedding corresponding to the CateID of item are joined together as the embedding of item.
4.2 Module Analysis
Below B is batch size, T is sequence length, and H is hidden size. Initialization variables in the program are as follows:
EMBEDDING_DIM = 18
HIDDEN_SIZE = 18 * 2
ATTENTION_SIZE = 18 * 2
best_auc = 0.0
Copy the code
4.2.1 Building variables
The first is to build the placeholder variable.
with tf.name_scope('Inputs') :# shape: [B, T] # shape: [B, T] T is the length of the sequence
self.mid_his_batch_ph = tf.placeholder(tf.int32, [None.None], name='mid_his_batch_ph')
# shape: [B, T] # shape: [B, T] # shape: [B, T] # shape: [B, T] T is the length of the sequence
self.cat_his_batch_ph = tf.placeholder(tf.int32, [None.None], name='cat_his_batch_ph')
# shape: [B], user id sequence. (B: batch size)
self.uid_batch_ph = tf.placeholder(tf.int32, [None, ], name='uid_batch_ph')
# shape: [B], movie ID sequence. (B: batch size)
self.mid_batch_ph = tf.placeholder(tf.int32, [None, ], name='mid_batch_ph')
# shape: [B], category id sequence. (B: batch size)
self.cat_batch_ph = tf.placeholder(tf.int32, [None, ], name='cat_batch_ph')
self.mask = tf.placeholder(tf.float32, [None.None], name='mask')
# shape: [B]; Sl: sequence length, the actual sequence length of the sequence in User Behavior (?)
self.seq_len_ph = tf.placeholder(tf.int32, [None], name='seq_len_ph')
# shape: [B, T], y: label sequence corresponding to the target node, positive sample corresponds to 1, negative sample corresponds to 0
self.target_ph = tf.placeholder(tf.float32, [None.None], name='target_ph')
# Learning rate
self.lr = tf.placeholder(tf.float64, [])
self.use_negsampling =use_negsampling
if use_negsampling:
self.noclk_mid_batch_ph = tf.placeholder(tf.int32, [None.None.None], name='noclk_mid_batch_ph') #generate 3 item IDs from negative sampling.
self.noclk_cat_batch_ph = tf.placeholder(tf.int32, [None.None.None], name='noclk_cat_batch_ph')
Copy the code
See the run-time variables below for details of the various shapes
self = {Model_DIN_V2_Gru_Vec_attGru_Neg}
cat_batch_ph = {Tensor} Tensor("Inputs/cat_batch_ph:0", shape=(? ,), dtype=int32) uid_batch_ph = {Tensor} Tensor("Inputs/uid_batch_ph:0", shape=(? ,), dtype=int32) mid_batch_ph = {Tensor} Tensor("Inputs/mid_batch_ph:0", shape=(? ,), dtype=int32) cat_his_batch_ph = {Tensor} Tensor("Inputs/cat_his_batch_ph:0", shape=(? ,?) , dtype=int32) mid_his_batch_ph = {Tensor} Tensor("Inputs/mid_his_batch_ph:0", shape=(? ,?) , dtype=int32) lr = {Tensor} Tensor("Inputs/Placeholder:0", shape=(), dtype=float64)
mask = {Tensor} Tensor("Inputs/mask:0", shape=(? ,?) , dtype=float32) seq_len_ph = {Tensor} Tensor("Inputs/seq_len_ph:0", shape=(? ,), dtype=int32) target_ph = {Tensor} Tensor("Inputs/target_ph:0", shape=(? ,?) , dtype=float32) noclk_cat_batch_ph = {Tensor} Tensor("Inputs/noclk_cat_batch_ph:0", shape=(? ,? ,?) , dtype=int32) noclk_mid_batch_ph = {Tensor} Tensor("Inputs/noclk_mid_batch_ph:0", shape=(? ,? ,?) , dtype=int32) use_negsampling = {bool} True
Copy the code
4.2.2 build embedding
Then, the embedding lookup table of user and item is constructed, and the input data is converted into the corresponding embedding, that is, sparse features are converted into dense features. This series will introduce the principle and code analysis of embedding layer.
The subsequent U is the hash bucket size of user_id, I is the hash bucket size of item_id, and C is the hash bucket size of cat_id.
Note that a variable like self. mid_HIS_batch_ph holds a sequence of the user’s historical behavior and is of size [B, T], so when embedding_lookup, the output size is [B, T, H/2];
# Embedding layer
with tf.name_scope('Embedding_layer') :[U, H/2] : [U, H/2] : [U, H/2] : [U, H/2] : [U, H/2
self.uid_embeddings_var = tf.get_variable("uid_embedding_var", [n_uid, EMBEDDING_DIM])
The uid vector is added to the uid vector
self.uid_batch_embedded = tf.nn.embedding_lookup(self.uid_embeddings_var, self.uid_batch_ph)
[I, H/2] : [I, H/2] : [I, H/2] : [I, H/2
self.mid_embeddings_var = tf.get_variable("mid_embedding_var", [n_mid, EMBEDDING_DIM])
The uid vector is added to mid embedding weight
self.mid_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.mid_batch_ph)
Local vector mid History embedding vector
# Note that a variable like self.mid_HIS_batch_ph holds a sequence of the user's historical behavior and is of size [B, T], so when embedding_lookup, the output size is [B, T, H/2];
self.mid_his_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.mid_his_batch_ph)
Vector mid History embedding vector is a negative vector
if self.use_negsampling:
self.noclk_mid_his_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.noclk_mid_batch_ph)
[C, H/2] # shape: [C, H/2] cate_id embedding weight. C is the hash bucket size of cat_id
self.cat_embeddings_var = tf.get_variable("cat_embedding_var", [n_cat, EMBEDDING_DIM])
Cid History embedding vector () : cid history embedding vector (
self.cat_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.cat_batch_ph)
Cid embedding vector is added to cid embedding weight
self.cat_his_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.cat_his_batch_ph)
Local Cid History vector (cid history vector
if self.use_negsampling:
self.noclk_cat_his_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.noclk_cat_batch_ph)
Copy the code
See the run-time variables below for details of the various shapes
self = {Model_DIN_V2_Gru_Vec_attGru_Neg}
cat_embeddings_var = {Variable} <tf.Variable 'cat_embedding_var:0' shape=(1601.18) dtype=float32_ref>
uid_embeddings_var = {Variable} <tf.Variable 'uid_embedding_var:0' shape=(543060.18) dtype=float32_ref>
mid_embeddings_var = {Variable} <tf.Variable 'mid_embedding_var:0' shape=(367983.18) dtype=float32_ref>
cat_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_4:0", shape=(? .18), dtype=float32)
mid_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_1:0", shape=(? .18), dtype=float32)
uid_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup:0", shape=(? .18), dtype=float32)
cat_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_5:0", shape=(? ,? .18), dtype=float32)
mid_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_2:0", shape=(? ,? .18), dtype=float32)
noclk_cat_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_6:0", shape=(? ,? ,? .18), dtype=float32)
noclk_mid_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_3:0", shape=(? ,? ,? .18), dtype=float32)
Copy the code
Holdings splicing embedding
This part combines various embedding vectors. For example, the embedding corresponding to the ID of item and the embedding corresponding to the CateID of item are joined together as the embedding of item.
Shape:
- Note that in the previous step, a variable like self. mid_HIS_batch_ph holds a sequence of the user’s historical behavior and is of size [B, T], so when embedding_lookup, the output is of size [B, T, H/2].
- Here the embedding of Goods and Cate is concat to get [B, T, H]. Notice that the axis parameter in tf.concat has a value of 2.
A note on logic:
[self.item_eb = tf.concat([self.mid_batch_embedded, self.cat_batch_embedded], 1) Stored in I_EMB, it is concatenated by Goods and classes. Corresponding to the architecture diagram:
The second step is self.item_HIS_eb = tf.concat([self.mid_HIS_batch_embedded, self.cat_his_batch_embedded], 2) These two history matrices hold the sequence of the user’s historical behavior and are of size [B, T], so when embedding_lookup, the output is of size [B, T, H/2]. Then the embedding of Goods and Cate is concat to get [B, T, H] size. Notice that the axis parameter in tf.concat has a value of 2. Corresponding to the architecture diagram:
The specific code is as follows:
Embedding stitching of positive samples includes Item and cate. The corresponding commodity embedding and class embedding of the target node are concatenated
self.item_eb = tf.concat([self.mid_batch_embedded, self.cat_batch_embedded], 1)
Embedding of Goods and Cate is concat to [B, T, H] size. Notice that the axis parameter in tf.concat has a value of 2
self.item_his_eb = tf.concat([self.mid_his_batch_embedded, self.cat_his_batch_embedded], 2)
self.item_his_eb_sum = tf.reduce_sum(self.item_his_eb, 1)
# Negative sample embedding stitching, negative sample includes Item and cate. The corresponding commodity embedding and class embedding of the target node are concatenated
if self.use_negsampling:
# 0 means only using the first negative item ID. 3 item IDs are inputed in the line 24.
self.noclk_item_his_eb = tf.concat(
[self.noclk_mid_his_batch_embedded[:, :, 0, :], self.noclk_cat_his_batch_embedded[:, :, 0, :]] -1)
# cat embedding 18 concate item embedding 18.
self.noclk_item_his_eb = tf.reshape(self.noclk_item_his_eb,
[-1, tf.shape(self.noclk_mid_his_batch_embedded)[1].36])
self.noclk_his_eb = tf.concat([self.noclk_mid_his_batch_embedded, self.noclk_cat_his_batch_embedded], -1)
self.noclk_his_eb_sum_1 = tf.reduce_sum(self.noclk_his_eb, 2)
self.noclk_his_eb_sum = tf.reduce_sum(self.noclk_his_eb_sum_1, 1)
Copy the code
See the run-time variables below for details of the various shapes
self = {Model_DIN_V2_Gru_Vec_attGru_Neg}
item_eb = {Tensor} Tensor("concat:0", shape=(? .36), dtype=float32)
item_his_eb = {Tensor} Tensor("concat_1:0", shape=(? ,? .36), dtype=float32)
item_his_eb_sum = {Tensor} Tensor("Sum:0", shape=(? .36), dtype=float32)
noclk_item_his_eb = {Tensor} Tensor("Reshape:0", shape=(? ,? .36), dtype=float32)
noclk_his_eb = {Tensor} Tensor("concat_3:0", shape=(? ,? ,? .36), dtype=float32)
noclk_his_eb_sum = {Tensor} Tensor("Sum_2:0", shape=(? .36), dtype=float32)
noclk_his_eb_sum_1 = {Tensor} Tensor("Sum_1:0", shape=(? ,? .36), dtype=float32)
Copy the code
0x05 Model_DIN_V2_Gru_Vec_attGru_Neg
Model_DIN_V2_Gru_Vec_attGru_Neg is the model corresponding to DIEN. User history must be a time series. If it is fed into RNN, the last state can be considered to contain all historical information. Therefore, the author uses a two-tier GRU to model user interests.
The __init__ function of Model_DIN_V2_Gru_Vec_attGru_Neg builds this double-layer GRU with the following code logic:
- The first Layer ‘RNN_1’ corresponds to the yellow part in the architecture diagram, namely the Interest Extractor Layer.
- The second Layer ‘Attention_layer_1’ corresponds to the Interest Evolving Layer in the red part of the architecture diagram, whose main component is AUGRU.
5.1 Layer 1′ RNN_1 ‘
This Layer corresponds to the yellow part in the architecture diagram, namely the Interest Extractor Layer, whose main component is GRU.
The main function is to extract user interest sequence based on behavior sequence by simulating user interest transfer process. That is, the item embedding of user behavior history is input into dynamic RNN (layer 1 GRU) and the auxiliary loss is calculated at the same time, which outputs the user’s interest at each time.
# RNN layer(-s)
with tf.name_scope('rnn_1') : rnn_outputs, _ = dynamic_rnn(GRUCell(HIDDEN_SIZE), inputs=self.item_his_eb, sequence_length=self.seq_len_ph, dtype=tf.float32, scope="gru1")
aux_loss_1 = self.auxiliary_loss(rnn_outputs[:, :-1, :], self.item_his_eb[:, 1:, :],
self.noclk_item_his_eb[:, 1:, :],
self.mask[:, 1:], stag="gru")
self.aux_loss = aux_loss_1
Copy the code
5.1.1 GRU helped
GRU is as follows, and RNN will be introduced in another article.
5.1.2 Auxiliary loss
The calculation of auxiliary Loss is actually a binary classification model, corresponding to:
Specifically, the behavior B (t+1) at time T is used as supervision to learn the hidden layer vector HT. The positive and negative samples respectively represent the t item vector clicked/unclicked by the user.
- Using the real next behavior as a positive sample;
- Negative examples are selected randomly from items that the user has not interacted with, or items that have been shown to the user but not clicked;
The code uses tf.concat([h_states, click_seq], -1), where -1 means the first dimension increases and the rest stays the same.
For example, (3,2,4) + (3,2,4) if you do -1, then the last dimension of our shape increases, it becomes (3,2,8), and then you combine the two tensor.
The specific code is as follows:
def auxiliary_loss(self, h_states, click_seq, noclick_seq, mask, stag = None) :
mask = tf.cast(mask, tf.float32)
# penultimate first dimension concat, remaining unchanged
click_input_ = tf.concat([h_states, click_seq], -1)
# penultimate first dimension concat, remaining unchanged
noclick_input_ = tf.concat([h_states, noclick_seq], -1)
Get the last y_hat from positive sample
click_prop_ = self.auxiliary_net(click_input_, stag = stag)[:, :, 0]
Get the last y_hat from positive sample
noclick_prop_ = self.auxiliary_net(noclick_input_, stag = stag)[:, :, 0]
# logarithmic loss, and mask the real historical behavior
click_loss_ = - tf.reshape(tf.log(click_prop_), [-1, tf.shape(click_seq)[1]]) * mask
noclick_loss_ = - tf.reshape(tf.log(1.0 - noclick_prop_), [-1, tf.shape(noclick_seq)[1]]) * mask
loss_ = tf.reduce_mean(click_loss_ + noclick_loss_)
return loss_
def auxiliary_net(self, in_, stag='auxiliary_net') :
bn1 = tf.layers.batch_normalization(inputs=in_, name='bn1' + stag, reuse=tf.AUTO_REUSE)
dnn1 = tf.layers.dense(bn1, 100, activation=None, name='f1' + stag, reuse=tf.AUTO_REUSE)
dnn1 = tf.nn.sigmoid(dnn1)
dnn2 = tf.layers.dense(dnn1, 50, activation=None, name='f2' + stag, reuse=tf.AUTO_REUSE)
dnn2 = tf.nn.sigmoid(dnn2)
dnn3 = tf.layers.dense(dnn2, 2, activation=None, name='f3' + stag, reuse=tf.AUTO_REUSE)
y_hat = tf.nn.softmax(dnn3) + 0.00000001
return y_hat
Copy the code
5.1.3 Functions of masks
As for the role of mask, here is a review with Transformer:
Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. Transformer model involves two types of mask, namely padding mask and Sequence mask. The padding Mask is used in all scaled dot-Product attention, while the Sequence Mask is only used in self-attention for decoder.
Padding Mask
What is a padding mask? Because the length of the input sequence is different from batch to batch in other words, we need to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Since these fill positions are meaningless, the attention mechanism should not focus on these positions and needs to do some processing.
To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax! And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.
Sequence mask
Sequence masks are designed to prevent the decoder from seeing future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.
So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 0. Applying this matrix to each sequence will do the trick.
For scaled dot-product attention, use padding Mask and Sequence Mask as attn_mask. The implementation is to add two masks together as attn_mask.
In all other cases, attn_mask equals the padding mask.
DIN uses the padding mask.
5.2 Layer 2 ‘Attention_layer_1’
The second Layer ‘Attention_layer_1’ corresponds to the Interest Evolving Layer in the red part of the architecture diagram, whose main component is AUGRU.
5.2.1 Attention mechanism
The Attention mechanism is: The constituent elements in the Source are imagined to be composed of a series of < Key, Value > data pairs. At this time, an element Query in the given Target is obtained by calculating the similarity or correlation between Query and each Key to obtain the weight coefficient of the Value corresponding to each Key. Then add the weighted sum of values to get the final Attention Value. So essentially, the Attention mechanism is a weighted sum of the values of elements in Source, while Query and Key are used to calculate the weight coefficients of the corresponding values. That is, the essential idea can be rewritten into the following formula:
Of course, conceptually, Attention is still understood as selectively selecting a small amount of important information from a large amount of information and focusing on these important information, while ignoring most of the unimportant information. This idea is still valid. The focusing process is reflected in the calculation of the weight coefficient. The larger the weight is, the more the focus will be on the corresponding Value, that is, the weight represents the importance of the information, and the Value is the corresponding information.
Another way to think about it is that the Attention mechanism can also be thought of as Soft Addressing: Source can be regarded as the contents stored in memory. The element consists of the address Key and the Value Value. The current Query Key=Query is used to retrieve the corresponding Value in memory, namely the Attention Value. Soft addressing refers to the fact that, unlike ordinary addressing, only one item is retrieved from the stored contents. Instead, it is possible to retrieve the contents from each Key address. The importance of retrieving the contents depends on the similarity between Query and Key. The weighted sum of the values is then used to extract the final Value, that is, the Attention Value. So it makes sense for many researchers to view the Attention mechanism as a special case of soft addressing.
As for the specific calculation process of Attention mechanism, if most current methods are abstracted, it can be summarized into two processes:
- The first procedure calculates the weight coefficients based on the Query and Key.
- The second procedure weights and sums the values based on the weight coefficients.
The first process can be subdivided into two stages:
- The first small stage calculates the similarity or correlation between Query and Key.
- In the second small stage, the original score of the first stage is normalized.
In this way, the calculation process of Attention can be abstracted into three stages as shown in the figure.
In the first phase, different functions and computation mechanisms can be introduced to calculate the similarity or correlation between Query and a particular Keyi. The most common methods include finding their vector dot product, Cosine similarity or by introducing additional neural networks.
The score generated in the first stage varies according to the specific generation method.
In the second stage, a calculation method similar to SoftMax is introduced for numerical conversion of the score in the first stage. On the one hand, normalization can be carried out, and the original calculated score can be sorted into the probability distribution of the sum of the weights of all elements is 1. On the other hand, SoftMax’s built-in mechanism can be used to highlight the weight of important elements. In other words, the following formula is generally used:
The calculation result of the second stage ai is the weight coefficient corresponding to Valuei, and then the weighted sum can get the Attention value:
Through the calculation of the above three stages, the Attention value for Query can be calculated. At present, most concrete calculation methods of Attention mechanism conform to the above three-stage abstract calculation process.
5.2.2 Attention layer
In DIEN, the function of ‘Attention_layer_1’ layer is to simulate the evolution process of interest related to the current target advertisement by adding Attention mechanism on the basis of interest extraction layer, and to model the evolution process of interest related to target items. The output of the first layer is fed to the second layer GRU, and the Update gate of the second layer GRU is controlled by attention score (calculated based on the output vector of the first layer and candidate materials).
We first need to calculate the attention score, which will be input as part of the GRU later.
Here corresponds to the paper
The code is as follows:
# Attention layer
with tf.name_scope('Attention_layer_1'):
att_outputs, alphas = din_fcn_attention(self.item_eb, rnn_outputs, ATTENTION_SIZE, self.mask,
softmax_stag=1, stag='1 _1', mode='LIST', return_alphas=True)
Copy the code
Query = query (); query (); query (); query (); query ();
- If time_major, the conversion is performed :(T,B,D) => (B,T,D);
- Conversion mask.
- Use tf.ones_like(mask) to build a tensor with the same elements as the mask dimension;
- Convert mask from int to bool using tf.equal. Tf.equal checks whether two inputs are equal. If they are equal, the value is True; if they are unequal, the value is False.
- Convert the Query dimension to the same shape as facts. Here, T varies with each specific training data. For example, the length of a certain time series of a user is 5 and the length of another time series is 15.
- Query is [B, H], which is converted to the queries dimension (B, T, H). To calculate weights for each element in pos_item and user action sequences. Here it is
tf.tile(query, [1, tf.shape(facts)[1]])
. Tf.shape (keys)[1] result = T, query = [B, H], tile = [B, T * H]; - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- Query is [B, H], which is converted to the queries dimension (B, T, H). To calculate weights for each element in pos_item and user action sequences. Here it is
- Do more operations to capture the relationship between action items and candidate items before MLP: addition, subtraction, multiplication and division. You then get the input from the Local Activation Unit. That is, candidate advertising queries correspond to EMB, user history behavior sequence facts correspond to Embed, plus the cross features between them, concat results;
- The attention operation calculates the correlation between query and key. The weight of each key in queries and facts is obtained by three-layer neural network. The output node of this DNN network is 1.
- D_layer_3_all’s shape is [B, T, 1].
- Then reshape for [B, 1, T], axis = 2 this said T the weight of a corresponding user behavior sequence parameters;
- Attention output, [B, 1, T];
- Get a score that has real meaning;
- use
key_masks = tf.expand_dims(mask, 1)
Mask is extended from [B, T] to [B, 1, T]. - Use tf.ones_like(scores) to construct a tensor whose elements are all 1 as scores dimension;
- The padding mask is followed by a small negative number, so that e^{x} is approximately equal to 0 when calculating softmax.
- Perform the [B, 1, T] padding operation. In order to ignore the padding effect on the whole, the code uses tf.where to reset the padding vector weight (the vacant item in each sample sequence) to a minimum value (-2 ** 32 + 1) instead of 0.
- using
tf.where(key_masks, scores, paddings)
To get a really meaningful score;
- use
- Scale is the standard operation for attention. After scaled, feed it into Softmax to get the final weight. But it’s not in the code, it’s written off;
- After standardization by Softmax, the normalized weight is obtained.
- The correct weighted scores and facts of the user’s historical behavior sequence have been obtained, so weighted sum is used to obtain the representation of the end user’s interest.
- If SUM mode is used, the representation of users’ interests can be obtained by matrix multiplication. Specifically, scores is [B, 1, T], indicating the weight of each historical behavior, facts is historical behavior sequence, size is [B, T, H], matrix multiplication is used to make both, and the output obtained is [B, 1, H].
- Otherwise, multiply the Hadad code.
- First turn on the scores to reshape, from [B, 1, H] change into Batch * Time;
- And use expand_dims to increment scores by one dimension at the end;
- Then the Hada coproduct is performed, [B, T, H] x [B, T, 1] = [B, T, H];
- 0 0 Batch * Time * Hidden Size;
The specific code is as follows:
def din_fcn_attention(query, facts, attention_size, mask, stag='null', mode='SUM', softmax_stag=1, time_major=False, return_alphas=False, forCnn=False) :
Shape: [B, H], i_emb; Shape: [B, T, H], h_emb, T is the padding length, and each emB of length H represents an item. Mask: Batch the real meaning of each behavior, shape: [B, H]; ' ' '
if isinstance(facts, tuple) :# In case of Bi-RNN, concatenate the forward and the backward RNN outputs.
facts = tf.concat(facts, 2)
if len(facts.get_shape().as_list()) == 2:
facts = tf.expand_dims(facts, 1)
if time_major:
# (T,B,D) => (B,T,D)
facts = tf.array_ops.transpose(facts, [1.0.2])
# Trainable parameters
mask = tf.equal(mask, tf.ones_like(mask))
facts_size = facts.get_shape().as_list()[-1] # D value - hidden size of the RNN layer
querry_size = query.get_shape().as_list()[-1] # H, this is 36
# Is different from DIN attention
query = tf.layers.dense(query, facts_size, activation=None, name='f1' + stag)
query = prelu(query)
# 1. Transform the Query dimension to the historical dimension T
# query = [B, H], converts queries to dimension (B, T, H), in order to calculate weights for each element in pos_item and user behavior sequence
Tensor("concat:0", shape=(? , 36), dtype=float32)
# tf.shape(keys)[1] query [B, H]
queries = tf.tile(query, [1, tf.shape(facts)[1]]) # [B, T * H], think of it as tile
Tensor("Attention_layer/Tile:0", shape=(? ,?) , dtype=float32)
# queries Need 0 0 To be the same size as Facts
queries = tf.reshape(queries, tf.shape(facts)) # [B, T * H] -> [B, T, H]
0 0 = 0 0 = 0 0 ,? , 36), dtype=float32)
# 2. The purpose of this part is to do more operations to capture the relationship between behavior item and candidate item before MLP: addition, subtraction, multiplication and division, etc.
Get the input for the Local Activation Unit. That is, the EMB of candidate ads queries and facts of user history behavior sequence
# corresponding to the embed, plus the crossover features between them, concat the result after
din_all = tf.concat([queries, facts, queries-facts, queries*facts], axis=-1) # T*[B,H] ->[B, T, H]
# 3. Attention operation, get the weight through several layers of MLP, this DNN network output node is 1
d_layer_1_all = tf.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att' + stag)
d_layer_2_all = tf.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att' + stag)
d_layer_3_all = tf.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att' + stag)
Shape d_layer_3_all = [B, T, 1]
0 0 # Next 0 is 0 [B, 1, T], Axis =2 This one dimension represents the weight parameters for the 0 0 user behavior sequence
d_layer_3_all = tf.reshape(d_layer_3_all, [-1.1, tf.shape(facts)[1]])
scores = d_layer_3_all [B, 1, T]
# 4. Get a meaningful score
# key_masks = tf.sequence_mask(facts_length, tf.shape(facts)[1]) # [B, T]
key_masks = tf.expand_dims(mask, 1) # [B, 1, T]
# padding mask is followed by a small negative number so that e^{x} is approximately 0 when softmax is calculated later
paddings = tf.ones_like(scores) * (-2 ** 32 + 1) Note that the initialization is minimal
# [B, 1, T] padding operation, in order to ignore the effect of padding on the whole, the code uses tf-where to reset the padding vector weight (the vacant item in each sample sequence) to the minimum value (-2 ** 32 + 1), instead of 0
if not forCnn:
scores = tf.where(key_masks, scores, paddings) # [B, 1, T]
Scale # attention standard operation, after scaled, feed into Softmax to get the final weight.
# scores = scores/(facts. Get_shape () as_list () [1] 0.5 * *)
# 6. Activation, to obtain the normalized weight
if softmax_stag:
scores = tf.nn.softmax(scores) # [B, 1, T]
# 7. Correct weight scores and facts of user history behavior sequence are obtained, and then the representation of user interest is obtained by matrix multiplication
# Weighted sum,
if mode == 'SUM':
# scores = [B, 1, T], indicating the weight of each historical action,
# facts is a sequence of historical actions, size is [B, T, H];
[B, 1, H] [B, 1, H]
B * (1 * T) * (T * H)
Here output is the weight calculated by attention, that is, w in formula (3).
output = tf.matmul(scores, facts) # [B, 1, H]
# output = tf.reshape(output, [-1, tf.shape(facts)[-1]])
else:
# from [B, 1, H] to Batch * Time
scores = tf.reshape(scores, [-1, tf.shape(facts)[1]])
[B, T, H] x [B, T, 1] = [B, T, H]
output = facts * tf.expand_dims(scores, -1)
output = tf.reshape(output, tf.shape(facts)) # Batch * Time * Hidden Size
return output
Copy the code
5.2.3 requires VecAttGRUCell
Next, the structure of AUGRU, here we design a new VecAttGRUCell structure. The specific principle of RNN Cell will be discussed in another article.
Text classification based on deep learning also faces the problem of how to compress multiple word vectors in a paragraph into a vector to represent the paragraph. The commonly used method is to feed multiple word vectors into RNN, and the output vector of RNN at the last moment represents the “combination” result of multiple word vectors. DIEN borrowed this idea and transformed the structure of GRU, using attention Score to control the door.
The change ali made here is mainly to the call function, and it’s about att_score:
u = (1.0 - att_score) * u
new_h = u * state + (1 - u) * c
return new_h, new_h
Copy the code
The specific code is:
def call(self, inputs, state, att_score=None) :. c = self._activation(self._candidate_linear([inputs, r_state])) u = (1.0 - att_score) * u # Here is a new addition
new_h = u * state + (1 - u) * c # Here is a new addition
return new_h, new_h
Copy the code
5.2.4 Computational evolution of interest
By designing the new GRU Cell, we can calculate the evolution of interest, and that’s what ‘RNN_2’ does.
with tf.name_scope('rnn_2') : rnn_outputs2, final_state2 = dynamic_rnn(VecAttGRUCell(HIDDEN_SIZE), inputs=rnn_outputs, att_scores = tf.expand_dims(alphas, -1),
sequence_length=self.seq_len_ph, dtype=tf.float32,
scope="gru2")
Copy the code
5.2.5 Calculate the input of full connection layer
After obtaining the result of interest evolution final_state2, it needs to be joined with other embedding to get the input of the full connection layer:
inp = tf.concat([self.uid_batch_embedded, self.item_eb, self.item_his_eb_sum, self.item_eb * self.item_his_eb_sum, final_state2], 1)
Copy the code
0x06 Full Connection Layer
Now that we have the connected dense representation vector, the next step is to use the fully connected layer to automatically learn the combination of nonlinear relations between features.
The final CTR estimate is then obtained through a multi-layer neural network, which is a function call.
# Fully connected layer
self.build_fcn_net(inp, use_dice=True)
Copy the code
Corresponding to the paper:
The logic is as follows:
- There is Batch Normalization first.
- Add a full connection layer
tf.layers.dense(bn1, 200, activation=None, name='f1')
; - Activate with DICE or Prelu
- Add a full connection layer
tf.layers.dense(dnn1, 80, activation=None, name='f2')
; - Activate with DICE or Prelu
- Add a full connection layer
tf.layers.dense(dnn2, 2, activation=None, name='f3')
; - Get the output
Y_hat = tf.nn.softmax(DNN3) + 0.00000001
; - Cross entropy and Optimizer initialization;
- You get cross entropy
- tf.reduce_mean(tf.log(self.y_hat) * self.target_ph)
; - If there is negative sampling, auxiliary loss should be added;
- Use AdamOptimizer;
- You get cross entropy
- Calculation Accuracy;
The specific code is as follows:
def build_fcn_net(self, inp, use_dice = False) :
bn1 = tf.layers.batch_normalization(inputs=inp, name='bn1')
dnn1 = tf.layers.dense(bn1, 200, activation=None, name='f1')
if use_dice:
dnn1 = dice(dnn1, name='dice_1')
else:
dnn1 = prelu(dnn1, 'prelu1')
dnn2 = tf.layers.dense(dnn1, 80, activation=None, name='f2')
if use_dice:
dnn2 = dice(dnn2, name='dice_2')
else:
dnn2 = prelu(dnn2, 'prelu2')
dnn3 = tf.layers.dense(dnn2, 2, activation=None, name='f3')
self.y_hat = tf.nn.softmax(dnn3) + 0.00000001
with tf.name_scope('Metrics') :# Cross-entropy loss and optimizer initialization
ctr_loss = - tf.reduce_mean(tf.log(self.y_hat) * self.target_ph)
self.loss = ctr_loss
if self.use_negsampling:
self.loss += self.aux_loss
tf.summary.scalar('loss', self.loss)
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(self.loss)
# Accuracy metric
self.accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(self.y_hat), self.target_ph), tf.float32))
tf.summary.scalar('accuracy', self.accuracy)
self.merged = tf.summary.merge_all()
Copy the code
0x07 Training model
Train the model with model.train.
The input data of model.train are as follows:
- The user id;
- Target’s item ID;
- Cateid of target item;
- Item ID list of user history actions;
- The cate ID list corresponding to the user’s historical behavior item;
- Mask of historical behavior;
- The target;
- Length of historical behavior;
- Learning rate.
- Negative sampled data;
Train code is as follows:
def train(self, sess, inps) :
if self.use_negsampling:
loss, accuracy, aux_loss, _ = sess.run([self.loss, self.accuracy, self.aux_loss, self.optimizer], feed_dict={
self.uid_batch_ph: inps[0],
self.mid_batch_ph: inps[1],
self.cat_batch_ph: inps[2],
self.mid_his_batch_ph: inps[3],
self.cat_his_batch_ph: inps[4],
self.mask: inps[5],
self.target_ph: inps[6],
self.seq_len_ph: inps[7],
self.lr: inps[8],
self.noclk_mid_batch_ph: inps[9],
self.noclk_cat_batch_ph: inps[10],})return loss, accuracy, aux_loss
else:
loss, accuracy, _ = sess.run([self.loss, self.accuracy, self.optimizer], feed_dict={
self.uid_batch_ph: inps[0],
self.mid_batch_ph: inps[1],
self.cat_batch_ph: inps[2],
self.mid_his_batch_ph: inps[3],
self.cat_his_batch_ph: inps[4],
self.mask: inps[5],
self.target_ph: inps[6],
self.seq_len_ph: inps[7],
self.lr: inps[8],})return loss, accuracy, 0
Copy the code
Stay tuned for the RNN Cell in the next article.
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.
0 XFF reference
Build Wide & Deep by hand with NumPy
How Google implements The Wide & Deep Model (1)
How to make recommendations using deep Learning on Youtube
Also review Deep Interest Evolution Network
The evolution of Ali CTR algorithm from DIN to DIEN
Chapter 7 Artificial Intelligence, 7.6 APPLICATION of DNN in Search Scenarios (Author: Renzhong)
#Paper Reading# Deep Interest Network for Click-Through Rate Prediction
【 Paper Reading 】Deep Interest Evolution Network for click-through Rate Prediction
Also review Deep Interest Evolution Network
Deep Interest Evolution Network for Click-Through Rate Prediction
Deep Interest Evolution Network(AAAI 2019)
Deep Interest Evolution Network for click-through Rate Prediction
DIN(Deep Interest Network): core ideas + source code to read notes
Calculating advertising CTR Estimation Series (5)– Ali’s Deep Interest Network Theory
Detailed explanation of Deep Interest NetWork model principle of CTR prediction
LSTM that everyone can understand
Understand RNN, LSTM and GRU from the driven graph
Machine learning (I) — NN&LSTm
Li Hongyi machine Learning (2016)
Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!
Import terror: DLL load failed from google.protobuf.pyext import _message
DIN deep interest network introduction and source analysis
Deep Interest Network for click-through Rate Prediction
Deep Interest Network for click-through Rate Prediction
Ali CTR Prediction Trilogy (2) : Deep Interest Evolution Network for click-through Rate Prediction
Deep Interest Network interpretation
Deep Interest Network (DIN)
DIN paper official implementation analysis
Ali DIN source code how to model user sequence (1) : Base scheme
How to model user sequences (2) : DIN and feature Engineering perspectives
Ali Deep Interest Network (DIN) paper translation
Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!
Recommendation system meets deep learning (18)– Probe into ali’s deep Interest Network (DIN) analysis and implementation
[Paper introduction] 2018 Alibaba CTR prediction model –DIN(Deep Interest Network), attached with TF2.0 recurrence code
2019 Ali CTR Prediction Model –DIEN(Deep Interest Evolution Network)
Recommendation System — Deep interest Network DIN&DIEN
Attention mechanism in deep learning
Attention is all you need