0 x00 the
Deep Interest Network (DIN) was proposed by ali Mama precise directional retrieval and basic algorithm team in June 2017. Its CTR estimates for the e-Commerce industry focus on making full use of/mining information from historical user behavior data.
This series of articles will review some concepts related to deep learning and the implementation of TensorFlow by interpreting the paper and source code. This second article will analyze how to generate training data and model user sequences.
0x01 What data is required for DIN
Let’s summarize DIN’s behavior:
- CTR estimation generally abstracts the user’s behavior sequence into a feature, which is called behavioral EMB here.
- The previous prediction model treats a group of behavior sequences of users equally, such as pooling or adding time attenuation.
- DIN deeply analyzes the user behavior intent, that is, the correlation between each user behavior and candidate goods is different. Taking this as an opportunity, a module for calculating correlation (later called attention) is used to weight pooling of sequence behaviors and obtain the desired embedding.
It can be seen that the user sequence is the core input data. Around this data, a series of data such as users, commodities and commodity attributes are needed. So DIN requires the following data:
- User dictionary, id corresponding to user name;
- Movie dictionary, the ID of item;
- Category dictionary, category id;
- Category information corresponding to item;
- Training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item;
- Test data in the same format as training data;
0x02 How do I Generate data
Prepare_data. sh file for data processing, generate a variety of data, its content is as follows.
export PATH="~/anaconda4/bin:$PATH"
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books.json.gz
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz
gunzip reviews_Books.json.gz
gunzip meta_Books.json.gz
python script/process_data.py meta_Books.json reviews_Books_5.json
python script/local_aggretor.py
python script/split_by_user.py
python script/generate_voc.py
Copy the code
We can see what these processing files do as follows:
- Process_data.py: generate metadata file, build negative sample, sample separation;
- Local_aggretor. py: Generates a sequence of user actions;
- Split_by_user. py: Split into data sets;
- Generate_voc. py: Generates three data dictionaries for user, movie, and genre respectively;
2.1 Basic Data
This paper uses Amazon Product Data, which contains two files: Reviews_electronics_5.json and meta_Electronics. Json.
Among them:
- Reviews refers to the contextual information generated by the user’s purchase of relevant products, including product ID, time, reviews, etc.
- Meta file is the information about the product itself, including the product ID, name, category, bought or bought information, etc.
The specific format is as follows:
Reviews_Electronics data | |
---|---|
reviewerID | Commenter ID, for example [A2SUAM1J3GNN3B] |
asin | Product ID, for example, [0000013714] |
reviewerName | Commenter nickname |
helpful | Review usefulness rating, e.g. 2/3 |
reviewText | The comment text |
overall | Product rating |
summary | Comment on the |
unixReviewTime | Audit Time (Unix time) |
reviewTime | Audit Time (original) |
Meta_Electronics data | |
---|---|
asin | The product ID |
title | The product name |
imUrl | Product Picture address |
categories | List of categories to which the product belongs |
description | The product description |
The user behavior in this dataset is rich, with more than five reviews per user and item. Features include goods_id, cate_id, user comment goods_id_list, and cate_id_list. All actions of the user are (b1, B2… , bk,… , bn).
The task is to predict the (k + 1) review item by using the first k review item. The training data set is used with each user’s k = 1,2… N minus 2.
2.2 Data Processing
2.2.1 Generating Metadata
By processing these two JSON files, we can generate two metadata files: item-info and review-info.
python script/process_data.py meta_Books.json reviews_Books_5.json
Copy the code
The specific code is as follows, which is simple extraction:
def process_meta(file) :
fi = open(file, "r")
fo = open("item-info"."w")
for line in fi:
obj = eval(line)
cat = obj["categories"] [0] [-1]
print>>fo, obj["asin"] + "\t" + cat
def process_reviews(file) :
fi = open(file, "r")
user_map = {}
fo = open("reviews-info"."w")
for line in fi:
obj = eval(line)
userID = obj["reviewerID"]
itemID = obj["asin"]
rating = obj["overall"]
time = obj["unixReviewTime"]
print>>fo, userID + "\t" + itemID + "\t" + str(rating) + "\t" + str(time)
Copy the code
The generated file is as follows.
The format of review-info is userID, itemID, score, and timestamp
A2S166WSCFIFP5 000100039X 5.0 1071100800
A1BM81XB4QHOA3 000100039X 5.0 1390003200
A1MOSTXNIO5MPJ 000100039X 5.0 1317081600
A2XQ5LZHTD4AFT 000100039X 5.0 1033948800
A3V1MKC2BVWY48 000100039X 5.0 1390780800
A12387207U8U24 000100039X 5.0 1206662400
Copy the code
Item-info is a product ID and a list of categories to which the product belongs, which is like a mapping table. The product 0001048791 corresponds to the Books category.
0001048791 Books
0001048775 Books
0001048236 Books
0000401048 Books
0001019880 Books
0001048813 Books
Copy the code
2.2.2 Build a sample list
The negative sample is constructed by the manual_join function, and the specific logic is as follows:
- Item_list = item_list;
- Get the sequence of actions for all users. Each user has an execution sequence, the content of each sequence item is a tuple2 (userID + item ID + rank + timestamp, timestamp);
- Iterate over each user
- The behavior sequence of the user is sorted by timestamp.
- For each sorted user behavior, build two samples:
- A negative sample. Replace the item ID of the user action with a randomly selected item ID (click set to 0).
- A positive sample, which is user behavior, and then click to 1.
- Write samples to files separately.
Such as:
The list of goods is:
item_list =
0000000 = {str} '000100039X'
0000001 = {str} '000100039X'
0000002 = {str} '000100039X'
0000003 = {str} '000100039X'
0000004 = {str} '000100039X'
0000005 = {str} '000100039X'
Copy the code
The sequence of user behaviors is:
user_map = {dict: 603668}
'A1BM81XB4QHOA3' = {list: 6}
0 = {tuple: 2} ('A1BM81XB4QHOA3 \ t000100039X \ t5.0 \ t1390003200'.1390003200.0)
1 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0060838582 \ t5.0 \ t1190851200'.1190851200.0)
2 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0743241924 \ t4.0 \ t1143158400'.1143158400.0)
3 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0848732391 \ t2.0 \ t1300060800'.1300060800.0)
4 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0884271781 \ t5.0 \ t1403308800'.1403308800.0)
5 = {tuple: 2} ('A1BM81XB4QHOA3 \ t1885535104 \ t5.0 \ t1390003200'.1390003200.0)
'A1MOSTXNIO5MPJ' = {list: 9}
0 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t000100039X \ t5.0 \ t1317081600'.1317081600.0)
1 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0143142941 \ t4.0 \ t1211760000'.1211760000.0)
2 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0310325366 \ t1.0 \ t1259712000'.1259712000.0)
3 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0393062112 \ t5.0 \ t1179964800'.1179964800.0)
4 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0872203247 \ t3.0 \ t1211760000'.1211760000.0)
5 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1455504181 \ t5.0 \ t1398297600'.1398297600.0)
6 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1596917024 \ t5.0 \ t1369440000'.1369440000.0)
7 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1600610676 \ t5.0 \ t1276128000'.1276128000.0)
8 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t9380340141 \ t3.0 \ t1369440000'.1369440000.0)
Copy the code
The specific code is as follows:
def manual_join() :
f_rev = open("reviews-info"."r")
user_map = {}
item_list = []
for line in f_rev:
line = line.strip()
items = line.split("\t")
if items[0] not in user_map:
user_map[items[0]]= []
user_map[items[0]].append(("\t".join(items), float(items[-1])))
item_list.append(items[1])
f_meta = open("item-info"."r")
meta_map = {}
for line in f_meta:
arr = line.strip().split("\t")
if arr[0] not in meta_map:
meta_map[arr[0]] = arr[1]
arr = line.strip().split("\t")
fo = open("jointed-new"."w")
for key in user_map:
sorted_user_bh = sorted(user_map[key], key=lambda x:x[1]) # Order user actions by time
for line, t in sorted_user_bh:
# for every user behavior
items = line.split("\t")
asin = items[1]
j = 0
while True:
asin_neg_index = random.randint(0.len(item_list) - 1) Get a random item ID index
asin_neg = item_list[asin_neg_index] Get the random item ID
if asin_neg == asin: If it happens to be that item ID, select continue
continue
items[1] = asin_neg
Write negative samples
print>>fo, "0" + "\t" + "\t".join(items) + "\t" + meta_map[asin_neg]
j += 1
if j == 1: #negative sampling frequency
break
Write positive sample
if asin in meta_map:
print>>fo, "1" + "\t" + line + "\t" + meta_map[asin]
else:
print>>fo, "1" + "\t" + line + "\t" + "default_cat"
Copy the code
The final file is extracted as follows, generating a series of positive and negative samples.
0 A10000012B7CGYKOMPQ4L 140004314X 5.0 1355616000 Books
1 A10000012B7CGYKOMPQ4L 000100039X 5.0 1355616000 Books
0 A10000012B7CGYKOMPQ4L 1477817603 5.0 1355616000 Books
1 A10000012B7CGYKOMPQ4L 0393967972 5.0 1355616000 Books
0 A10000012B7CGYKOMPQ4L 0778329933 5.0 1355616000 Books
1 A10000012B7CGYKOMPQ4L 0446691437 5.0 1355616000 Books
0 A10000012B7CGYKOMPQ4L B006P5CH1O 4.0 1355616000 Collections & Anthologies
Copy the code
2.2.3 Sample separation
This step separates the samples in order to determine the last two samples on the timeline.
- Read joinTED-new from the previous step;
- Count the number of records for each user using user_count;
- Walk through Jointed-New again.
- If it is the last two lines of the user record, write 20190119 before the line;
- If it is the first several lines of the user record, write 20180118 before the line.
- New records are written to jointed-new-split-info;
So, in jointed-new-split-info, the two records with the prefix 20190119 are the last two records of the user’s behavior. They happen to be one positive sample, one negative sample, and the last two in time.
The code is as follows:
def split_test() :
fi = open("jointed-new"."r")
fo = open("jointed-new-split-info"."w")
user_count = {}
for line in fi:
line = line.strip()
user = line.split("\t") [1]
if user not in user_count:
user_count[user] = 0
user_count[user] += 1
fi.seek(0)
i = 0
last_user = "A26ZDKC53OP6JD"
for line in fi:
line = line.strip()
user = line.split("\t") [1]
if user == last_user:
if i < user_count[user] - 2: # 1 + negative samples
print>> fo, "20180118" + "\t" + line
else:
print>>fo, "20190119" + "\t" + line
else:
last_user = user
i = 0
if i < user_count[user] - 2:
print>> fo, "20180118" + "\t" + line
else:
print>>fo, "20190119" + "\t" + line
i += 1
Copy the code
The final document is as follows:
20180118 0 A10000012B7CGYKOMPQ4L 140004314X 5.0 1355616000 Books
20180118 1 A10000012B7CGYKOMPQ4L 000100039X 5.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L 1477817603 5.0 1355616000 Books
20180118 1 A10000012B7CGYKOMPQ4L 0393967972 5.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L 0778329933 5.0 1355616000 Books
20180118 1 A10000012B7CGYKOMPQ4L 0446691437 5.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L B006P5CH1O 4.0 1355616000 Collections & Anthologies
20180118 1 A10000012B7CGYKOMPQ4L 0486227081 4.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L B00HWI5OP4 4.0 1355616000 United States
20180118 1 A10000012B7CGYKOMPQ4L 048622709X 4.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L 1475005873 4.0 1355616000 Books
20180118 1 A10000012B7CGYKOMPQ4L 0486274268 4.0 1355616000 Books
20180118 0 A10000012B7CGYKOMPQ4L 098960571X 4.0 1355616000 Books
20180118 1 A10000012B7CGYKOMPQ4L 0486404730 4.0 1355616000 Books
20190119 0 A10000012B7CGYKOMPQ4L 1495459225 4.0 1355616000 Books
20190119 1 A10000012B7CGYKOMPQ4L 0830604790 4.0 1355616000 Books
Copy the code
2.2.4 Generating behavior sequences
Local_aggretor. py is used to generate a sequence of user actions.
For example, for a user whose reviewerID=0, his pos_list is [13179, 17993, 28326, 29247, 62275], The generated training set is in the format of (reviewerID, hist, pos_item, 1) and (reviewerID, hist, neg_item, 0).
Note that hist does not contain pos_item or neg_item. Hist only contains items clicked before pos_item. DIN uses a similar mechanism to attention, and only historical attention affects subsequent attention. So it makes sense that hist only contains items clicked before pos_item.
The specific logic is:
- Traverses all lines of “jointed-new-split-info”
- Continuously add click state item ID and cat ID.
- If it starts with 20180118, write local_train.
- If it starts with 20190119, write local_test.
- Continuously add click state item ID and cat ID.
Because 20190119 is the last two sequences in time, the final local_test file obtains two cumulative behavior sequences of each user, that is, this behavior sequence includes all time from beginning to end.
The file is oddly named here because the actual training test uses data from the local_test file.
One positive sample, one negative sample. The two sequences are the same except for the last item ID and click.
The specific code is as follows:
fin = open("jointed-new-split-info"."r")
ftrain = open("local_train"."w")
ftest = open("local_test"."w")
last_user = "0"
common_fea = ""
line_idx = 0
for line in fin:
items = line.strip().split("\t")
ds = items[0]
clk = int(items[1])
user = items[2]
movie_id = items[3]
dt = items[5]
cat1 = items[6]
if ds=="20180118":
fo = ftrain
else:
fo = ftest
ifuser ! = last_user: movie_id_list = [] cate1_list = []else:
history_clk_num = len(movie_id_list)
cat_str = ""
mid_str = ""
for c1 in cate1_list:
cat_str += c1 + ""
for mid in movie_id_list:
mid_str += mid + ""
if len(cat_str) > 0: cat_str = cat_str[:-1]
if len(mid_str) > 0: mid_str = mid_str[:-1]
if history_clk_num >= 1: # 8 is the average length of user behavior
print >> fo, items[1] + "\t" + user + "\t" + movie_id + "\t" + cat1 +"\t" + mid_str + "\t" + cat_str
last_user = user
if clk: # if it is in the click state
movie_id_list.append(movie_id) # Accumulate the corresponding movie iD
cate1_list.append(cat1) # accumulate the corresponding CAT ID
line_idx += 1
Copy the code
Finally, the local_test data is extracted as follows:
0 A10000012B7CGYKOMPQ4L 1495459225 Books 000100039X039396797204466914370486227081048622709X04862742680486404730 BooksBooksBooksBooksBooksBooksBooks
1 A10000012B7CGYKOMPQ4L 0830604790 Books 000100039X039396797204466914370486227081048622709X04862742680486404730 BooksBooksBooksBooksBooksBooksBooks
Copy the code
2.2.5 It is divided into training set and test set
Split_by_user. py is used to split data sets.
Is an integer randomly selected from 1 to 10. If it is exactly 2, it is used as the validation data set.
fi = open("local_test"."r")
ftrain = open("local_train_splitByUser"."w")
ftest = open("local_test_splitByUser"."w")
while True:
rand_int = random.randint(1.10)
noclk_line = fi.readline().strip()
clk_line = fi.readline().strip()
if noclk_line == "" or clk_line == "":
break
if rand_int == 2:
print >> ftest, noclk_line
print >> ftest, clk_line
else:
print >> ftrain, noclk_line
print >> ftrain, clk_line
Copy the code
Examples are as follows:
The format is label, user ID, candidate Item ID, candidate Item type, behavior sequence, type sequence
0 A3BI7R43VUZ1TY B00JNHU0T2 Literature & Fiction 0989464105B00B01691C14778097321608442845 BooksLiterature & FictionBooksBooks
1 A3BI7R43VUZ1TY 0989464121 Books 0989464105B00B01691C14778097321608442845 BooksLiterature & FictionBooksBooks
Copy the code
2.2.6 Generating a Data dictionary
Generate_voc. py generates three data dictionaries for the user, movie, and genre. The three dictionaries include all user ids, all movie ids, and all category ids respectively. This is simply sorting the three elements starting at 1.
Movie ID, Categories, and reviewerID are used to produce three maps (movie_map, cate_map, and uID_map). Key is the corresponding original information and value is the index sorted by key (starting from 0). Then the corresponding column of the original data is converted to the index corresponding to the key.
import cPickle
f_train = open("local_train_splitByUser"."r")
uid_dict = {}
mid_dict = {}
cat_dict = {}
iddd = 0
for line in f_train:
arr = line.strip("\n").split("\t")
clk = arr[0]
uid = arr[1]
mid = arr[2]
cat = arr[3]
mid_list = arr[4]
cat_list = arr[5]
if uid not in uid_dict:
uid_dict[uid] = 0
uid_dict[uid] += 1
if mid not in mid_dict:
mid_dict[mid] = 0
mid_dict[mid] += 1
if cat not in cat_dict:
cat_dict[cat] = 0
cat_dict[cat] += 1
if len(mid_list) == 0:
continue
for m in mid_list.split("") :if m not in mid_dict:
mid_dict[m] = 0
mid_dict[m] += 1
iddd+=1
for c in cat_list.split("") :if c not in cat_dict:
cat_dict[c] = 0
cat_dict[c] += 1
sorted_uid_dict = sorted(uid_dict.iteritems(), key=lambda x:x[1], reverse=True)
sorted_mid_dict = sorted(mid_dict.iteritems(), key=lambda x:x[1], reverse=True)
sorted_cat_dict = sorted(cat_dict.iteritems(), key=lambda x:x[1], reverse=True)
uid_voc = {}
index = 0
for key, value in sorted_uid_dict:
uid_voc[key] = index
index += 1
mid_voc = {}
mid_voc["default_mid"] = 0
index = 1
for key, value in sorted_mid_dict:
mid_voc[key] = index
index += 1
cat_voc = {}
cat_voc["default_cat"] = 0
index = 1
for key, value in sorted_cat_dict:
cat_voc[key] = index
index += 1
cPickle.dump(uid_voc, open("uid_voc.pkl"."w"))
cPickle.dump(mid_voc, open("mid_voc.pkl"."w"))
cPickle.dump(cat_voc, open("cat_voc.pkl"."w"))
Copy the code
Finally, we get several files processed by the DIN model:
- Uid_voc. PKL: user dictionary, id corresponding to user name;
- Mid_voc. PKL: movie dictionary, id corresponding to item;
- Cat_voc. PKL: category dictionary, category id;
- Item-info: indicates the category information of an item.
- Review-info: review metadata, in the format of userID, itemID, score, timestamp, used for negative sampling;
- Local_train_splitByUser: indicates training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item.
- Local_test_splitByUser: test data in the same format as training data;
0x03 How Do I Use Data
3.1 Training data
The train.py section evaluates the test set once with the initial model, and then evaluates the test set every 1000 times according to batch training.
The code for the lite version is as follows:
def train(
train_file = "local_train_splitByUser",
test_file = "local_test_splitByUser",
uid_voc = "uid_voc.pkl",
mid_voc = "mid_voc.pkl",
cat_voc = "cat_voc.pkl",
batch_size = 128,
maxlen = 100,
test_iter = 100,
save_iter = 100,
model_type = 'DNN',
seed = 2.) :
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
Get training data and test data
train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
n_uid, n_mid, n_cat = train_data.get_n()
# Build a model
model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
iter = 0
lr = 0.001
for itr in range(3):
loss_sum = 0.0
accuracy_sum = 0.
aux_loss_sum = 0.
for src, tgt in train_data:
# Prepare data
uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
# training
loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
loss_sum += loss
accuracy_sum += acc
aux_loss_sum += aux_loss
iter+ =1
if (iter % test_iter) == 0:
eval(sess, test_data, model, best_model_path)
loss_sum = 0.0
accuracy_sum = 0.0
aux_loss_sum = 0.0
if (iter % save_iter) == 0:
model.save(sess, model_path+"--"+str(iter))
lr *= 0.5
Copy the code
3.2 Iteratively read
DataInput is an iterator that returns the next batch of data per call. This code deals with how the data is divided by batch and how to construct an iterator.
As mentioned above, the format of the training data set is: Label, user ID, candidate Item ID, candidate item category, behavior sequence, and category sequence
3.2.1 initialization
The basic logic is:
The __init__ function:
- Self. Source_dicts: [uid_VOC, MID_VOC, cat_VOC];
- Finally, self.meta_id_map is the cateory ID corresponding to each movie ID, that is, the mapping between movie ID and category ID is constructed. The key code is: self.meta_id_map[MID_idx] = cat_idx;
- Read from “review-info” to generate a list of ids needed for negative sampling;
- You have to take all kinds of basic data, like the length of the user list, the length of the movie list, and so on;
The code is as follows:
class DataIterator:
def __init__(self, source,
uid_voc,
mid_voc,
cat_voc,
batch_size=128,
maxlen=100,
skip_empty=False,
shuffle_each_epoch=False,
sort_by_length=True,
max_batch_size=20,
minlen=None) :
if shuffle_each_epoch:
self.source_orig = source
self.source = shuffle.main(self.source_orig, temporary=True)
else:
self.source = fopen(source, 'r')
self.source_dicts = []
Self. Source_dicts [uid_VOC, mid_VOC, cat_VOC]
for source_dict in [uid_voc, mid_voc, cat_voc]:
self.source_dicts.append(load_dict(source_dict))
Self. meta_id_map[mid_idx] = cat_idx; self.meta_id_map[mid_idx] = cat_idx;
f_meta = open("item-info"."r")
meta_map = {}
for line in f_meta:
arr = line.strip().split("\t")
if arr[0] not in meta_map:
meta_map[arr[0]] = arr[1]
self.meta_id_map ={}
for key in meta_map:
val = meta_map[key]
if key in self.source_dicts[1]:
mid_idx = self.source_dicts[1][key]
else:
mid_idx = 0
if val in self.source_dicts[2]:
cat_idx = self.source_dicts[2][val]
else:
cat_idx = 0
self.meta_id_map[mid_idx] = cat_idx
# review-info (); # review-info ();
f_review = open("reviews-info"."r")
self.mid_list_for_random = []
for line in f_review:
arr = line.strip().split("\t")
tmp_idx = 0
if arr[1] in self.source_dicts[1]:
tmp_idx = self.source_dicts[1][arr[1]]
self.mid_list_for_random.append(tmp_idx)
Base data such as user list length, movie list length, etc.
self.batch_size = batch_size
self.maxlen = maxlen
self.minlen = minlen
self.skip_empty = skip_empty
self.n_uid = len(self.source_dicts[0])
self.n_mid = len(self.source_dicts[1])
self.n_cat = len(self.source_dicts[2])
self.shuffle = shuffle_each_epoch
self.sort_by_length = sort_by_length
self.source_buffer = []
self.k = batch_size * max_batch_size
self.end_of_data = False
Copy the code
The final data is as follows:
self = {DataIterator} <data_iterator.DataIterator object at 0x000001F56CB44BA8>
batch_size = {int} 128
k = {int} 2560
maxlen = {int} 100
meta_id_map = {dict: 367983} {0: 1572.115840: 1.282448: 1.198250: 1.4275: 1.260890: 1.260584: 1.110331: 1.116224: 1.2704: 1.298259: 1.47792: 1.186701: 1.121548: 1.147230: 1.238085: 1.367828: 1.270505: 1.354813: 1.. mid_list_for_random = {list: 8898041} [4275.4275.4275.4275.4275.4275.4275.4275.. minlen = {NoneType}None
n_cat = {int} 1601
n_mid = {int} 367983
n_uid = {int} 543060
shuffle = {bool} False
skip_empty = {bool} False
sort_by_length = {bool} True
source = {TextIOWrapper} <_io.TextIOWrapper name='local_train_splitByUser' mode='r' encoding='cp936'>
source_buffer = {list: 0} []
source_dicts = {list: 3}
0 = {dict: 543060} {'ASEARD9XL1EWO': 449136.'AZPJ9LUT0FEPY': 0.'A2NRV79GKAU726': 16.'A2GEQVDX2LL4V3': 266686.'A3R04FKEYE19T6': 354817.'A3VGDQOR56W6KZ': 4..1 = {dict: 367983} {'1594483752': 47396.'0738700797': 159716.'1439110239': 193476..2 = {dict: 1601} {'Residential': 1281.'Poetry': 250.'Winter Sports': 1390..Copy the code
3.2.2 Iteratively Read
When iteratively reading, the logic is as follows:
- if
self.source_buffer
There is no data, the total number of file lines k is read. It can be interpreted as reading the maximum buffer at one time; - If set, sort by the length of user history behaviors.
- Internal iteration starts from
self.source_buffer
Pull out a piece of data:- Retrieve the user’s historical action movie ID list to mid_list;
- Cat id list to cat_list;
- For each POS_mid in mid_list, five negative sampling historical behavior data are generated. Get 5 ids from mid_list_FOR_random (pos_mid if same); That is, for each user’s historical behavior, 5 samples are selected as negative samples in the code;
- Insert [uid, mid, cat, MID_list, cat_list, noclk_MID_list, noclk_cat_list] into souCE as training data;
- Put [float(ss[0]), 1-float(ss[0])] into target as label;
- If the batCH_size is reached, the internal iteration is skipped and the batch data is returned, that is, a list with a maximum length of 128.
See the specific code below:
def __next__(self) :
if self.end_of_data:
self.end_of_data = False
self.reset()
raise StopIteration
source = []
target = []
If self.source_buffer has no data, read k lines. Read the maximum buffer at one time
if len(self.source_buffer) == 0:
#for k_ in xrange(self.k):
for k_ in range(self.k):
ss = self.source.readline()
if ss == "":
break
self.source_buffer.append(ss.strip("\n").split("\t"))
# sort by history behavior length
# If set, sort by user history behavior length;
if self.sort_by_length:
his_length = numpy.array([len(s[4].split("")) for s in self.source_buffer])
tidx = his_length.argsort()
_sbuf = [self.source_buffer[i] for i in tidx]
self.source_buffer = _sbuf
else:
self.source_buffer.reverse()
if len(self.source_buffer) == 0:
self.end_of_data = False
self.reset()
raise StopIteration
try:
# actual work here, internal iteration begins
while True:
# read from source file and map to word index
try:
ss = self.source_buffer.pop()
except IndexError:
break
uid = self.source_dicts[0][ss[1]] if ss[1] in self.source_dicts[0] else 0
mid = self.source_dicts[1][ss[2]] if ss[2] in self.source_dicts[1] else 0
cat = self.source_dicts[2][ss[3]] if ss[3] in self.source_dicts[2] else 0
# retrieve a list of historical behavior movie ids from the user to the mid_list;
tmp = []
for fea in ss[4].split(""):
m = self.source_dicts[1][fea] if fea in self.source_dicts[1] else 0
tmp.append(m)
mid_list = tmp
# retrieve a list of cat ids from user history to cat_list;
tmp1 = []
for fea in ss[5].split(""):
c = self.source_dicts[2][fea] if fea in self.source_dicts[2] else 0
tmp1.append(c)
cat_list = tmp1
# read from source file and map to word index
#if len(mid_list) > self.maxlen:
# continue
ifself.minlen ! =None:
if len(mid_list) <= self.minlen:
continue
if self.skip_empty and (not mid_list):
continue
# Create 5 negative sampling historical data for each POS_mid in mid_list; Get 5 ids from mid_list_FOR_random (pos_mid if same);
noclk_mid_list = []
noclk_cat_list = []
for pos_mid in mid_list:
noclk_tmp_mid = []
noclk_tmp_cat = []
noclk_index = 0
while True:
noclk_mid_indx = random.randint(0.len(self.mid_list_for_random)-1)
noclk_mid = self.mid_list_for_random[noclk_mid_indx]
if noclk_mid == pos_mid:
continue
noclk_tmp_mid.append(noclk_mid)
noclk_tmp_cat.append(self.meta_id_map[noclk_mid])
noclk_index += 1
if noclk_index >= 5:
break
noclk_mid_list.append(noclk_tmp_mid)
noclk_cat_list.append(noclk_tmp_cat)
source.append([uid, mid, cat, mid_list, cat_list, noclk_mid_list, noclk_cat_list])
target.append([float(ss[0]), 1-float(ss[0]])if len(source) >= self.batch_size or len(target) >= self.batch_size:
break
except IOError:
self.end_of_data = True
# all sentence pairs in maxibatch filtered out because of length
if len(source) == 0 or len(target) == 0:
source, target = self.next(a)return source, target
Copy the code
3.2.3 Data processing
After the iteration data is captured, further processing is required.
uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, return_neg=True)
Copy the code
It can be understood as grouping the batch of data (let’s say 128 items). For example, the 128 historical sequences of UIDS, MIDS, CATS are aggregated separately and finally sent to the model for training.
The important point here is that the mask is generated. Its meaning is:
Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. A padding mask is a type of mask,
- What is a padding mask? Because the length of the input sequence is different from batch to batch. That is, we want to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Since these fill positions are meaningless, the attention mechanism should not focus on these positions and needs to do some processing.
- To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax! And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.
DIN in this case, since the sequence of user actions in a Batch may not all be the same, its real length is stored in keys_length, so masks are generated to select the actual historical behavior.
- First, set the mask to 0.
- Then if the data is meaningful, set mask to 1;
The specific code is as follows:
def prepare_data(input, target, maxlen = None, return_neg = False) :
# x: a list of sentences
#s[4] is mid_list. Each input item has a different length of mid_list
lengths_x = [len(s[4]) for s in input]
seqs_mid = [inp[3] for inp in input]
seqs_cat = [inp[4] for inp in input]
noclk_seqs_mid = [inp[5] for inp in input]
noclk_seqs_cat = [inp[6] for inp in input]
if maxlen is not None:
new_seqs_mid = []
new_seqs_cat = []
new_noclk_seqs_mid = []
new_noclk_seqs_cat = []
new_lengths_x = []
for l_x, inp in zip(lengths_x, input) :if l_x > maxlen:
new_seqs_mid.append(inp[3][l_x - maxlen:])
new_seqs_cat.append(inp[4][l_x - maxlen:])
new_noclk_seqs_mid.append(inp[5][l_x - maxlen:])
new_noclk_seqs_cat.append(inp[6][l_x - maxlen:])
new_lengths_x.append(maxlen)
else:
new_seqs_mid.append(inp[3])
new_seqs_cat.append(inp[4])
new_noclk_seqs_mid.append(inp[5])
new_noclk_seqs_cat.append(inp[6])
new_lengths_x.append(l_x)
lengths_x = new_lengths_x
seqs_mid = new_seqs_mid
seqs_cat = new_seqs_cat
noclk_seqs_mid = new_noclk_seqs_mid
noclk_seqs_cat = new_noclk_seqs_cat
if len(lengths_x) < 1:
return None.None.None.None
# lengthS_x Saves the actual length of the user's historical behavior sequence, maxlen_x indicates the maximum length in the sequence;
n_samples = len(seqs_mid)
maxlen_x = numpy.max(lengths_x) Select the largest mid_list length, in this case 583
neg_samples = len(noclk_seqs_mid[0] [0])
# Since the length of user history sequence is not fixed, so the mid_HIS matrix is introduced to fix the sequence length to maxlen_x. For sequences whose length is less than maxlen_x, fill them with 0 (note that mid_HIS and other matrices are initialized with zero matrices)
mid_his = numpy.zeros((n_samples, maxlen_x)).astype('int64') #tuple<128, 583>
cat_his = numpy.zeros((n_samples, maxlen_x)).astype('int64')
noclk_mid_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5>
noclk_cat_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5>
mid_mask = numpy.zeros((n_samples, maxlen_x)).astype('float32')
The # zip function takes an iterable object as an argument, packs the corresponding elements of the object into tuples, and returns a list of those tuples
for idx, [s_x, s_y, no_sx, no_sy] in enumerate(zip(seqs_mid, seqs_cat, noclk_seqs_mid, noclk_seqs_cat)):
mid_mask[idx, :lengths_x[idx]] = 1.
mid_his[idx, :lengths_x[idx]] = s_x
cat_his[idx, :lengths_x[idx]] = s_y
# noclk_mid_his and noclk_cat_his are both (128, 583, 5)
noclk_mid_his[idx, :lengths_x[idx], :] = no_sx # is a direct assignment
noclk_cat_his[idx, :lengths_x[idx], :] = no_sy # is a direct assignment
uids = numpy.array([inp[0] for inp in input])
mids = numpy.array([inp[1] for inp in input])
cats = numpy.array([inp[2] for inp in input])
# select UID, mid, cat from input (128-long list) Bring it all out, aggregate it, return it
if return_neg:
return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x), noclk_mid_his, noclk_cat_his
else:
return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x)
Copy the code
3.2.4 Feeding model
Finally, enter the model training, which is this step in train.py:
loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
Copy the code
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.
0 XFF reference
Deep Interest Network interpretation
Deep Interest Network (DIN)
DIN paper official implementation analysis
Ali DIN source code how to model user sequence (1) : Base scheme