♚ \
Author: Yishui Hancheng, CSDN blog expert, personal research interests: machine learning, deep learning, NLP, CV
Blog: yishuihancheng.blog.csdn.net
Recommendation system plays a very important role in our daily life, I believe that people who have actually engaged in the recommendation related engineering projects will more or less read the book “Recommendation System Practice”, I am also one of the readers, personally feel that this book is a good material for the introduction of recommendation system. Recommendation systems of many shopping malls and large factories are very complex and powerful. Most of them design powerful computing systems based on deep learning. In my last article,
The schematic diagram of the overall architecture of the project is shown below:
The music data in this paper is based on netease Music data. The project mainly includes: netease cloud music data crawling, data preprocessing, text vectorization, data set construction and deep learning model training, and song recommendation.
The following steps will be explained. Since netease cloud music data is not allowed to crawl, crawlers will not be introduced here.
1. Data preprocessing
This part of the content is relatively simple, the specific code implementation is as follows:
def dataPre(one_line):
' ''Get rid of dirty, invalid data'' '
with open('stopwords.txt') as f:
stopwords_list=[one.strip() for one in f.readlines() if one]
sigmod_list=[', '.'. '.'('.') '.The '-'.The '-'.'\n'.' ' '.' ' '.The '*'.The '#'.' ' '.' ' '.', '.'['.'] '.'('.') '.The '-'.'. '.'/'.'】'.'【'.'... '.'! '.'! '.':'.':'.'... '.The '@'.'~ @'.'~'.'「一」'.' ' '.' ' '.'? '.'"'.'? '.'~'.'_'.' '.'; '.'pieces'.'①'.'2.'.'3.'.'④'.'⑤'.'⑥'.'⑦'.'today'.'pet-name ruby'.'attending'.'⑾'.'⑿'.'[13].'[14].'⒂'.'& quot; '.' '.'/'.', '.'... '.'!!!!!! '.'】'.'! '.', '.'. '.'['.'] '.'【'.', '.'? '.'/ ^ ^'.'/ ^'.' ' '.') '.'('.'~'.' ' '.' ' '.'... '.'='.'being'.'(1)'.'⑵'.'(3)'.'(4)'.'[5].'[6].'once'.'... '.'|']
for one_sigmod in sigmod_list:
one_line=one_line.replace(one_sigmod,' ')
return one_line
def seg(one_content):
' 'One_content: single enterprise name data stopWords: list of stopwords' '
stopwords=[]
segs=jieba.cut(one_content,cut_all=False)
segs=[w.encode('utf8') for w inSeg_set =set(set(segs)-set(stopwords))return list(seg_set)
Copy the code
It mainly removes dirty data and invalid characters in the text data obtained by crawling, and then performs word segmentation for Chinese song names.
Second, text vectorization
This part mainly carries out word vector training and text weighted vectorization calculation on the text corpus data processed in the previous stage with the help of word2VEC vectorization tool. The specific code is as follows:
def word2vecModel(con_list,model_path='my.model') :' 'Sentences: can be a List, for large corpus, it is recommended to use BrownCorpus,Text8Corpus or ·ineSentence construction. 2. Sg: used to set the training algorithm, default is 0, corresponding to the CBOW algorithm; Sg =1, skip-Gram algorithm is adopted. 3. Size: refers to the vector dimension of the output word. Default is 100. Larger size requires more training data, but the effect will be better. Recommended values range from tens to hundreds. 4. Window: for the size of the training window, 8 means that the first 8 words and the last 8 words are considered for each word (there is also a random window selection process in the actual code, window size <=5), and the default value is 5. '' '
model=word2vec.Word2Vec(con_list,sg=1,size=100.window=5,min_count=1,
negative=3,sample=0.001, hs=1,workers=4)
model.save(model_path)
return con_list,model
def song2Vec(data='music/songName.txt',model_path='music/song2Vec.model') :' ''Build word2vec model for song name segmentation'' '
with open(data) as f:
data_list=[one.strip() for one in f.readlines() if one]
data=[]
for i in range(len(data_list)):
musicId,content=data_list[i].split('# | |')
con_list=content.split('/'Word2vecModel (data,model_path=model_path)Copy the code
The above code completes the vectorization of the text. Weighted vectorization of a single text is calculated as follows:
def getDocVec(model,word_list=['很'.'good'.'soft shelled turtle'],w_list=[0.12.0.53.0.35) :' ''Vector to generate single text content (weighted sum of all word vectors)'' '
vec=np.array([0] *100,dtype='float32')
for i in range(len(word_list)):
vec+=model[word_list[i]]*w_list[i]
return vec
Copy the code
At this point, the vectorization of the text is completed.
Data set construction and deep learning model training
The recommendation system needs annotated sample set data to help the model learn knowledge in the early stage, and the same is true for music and song recommendation. If the neural network model wants to accurately recommend songs, it first needs to input a batch of recommendation annotation data for learning and calculation.
The specific implementation of data set creation is as follows:
def createVector(songVec='music/song2Vec.json',save_path='music/dataset.json') :' ''Build sample set'' '
with open(songVec) asS: song_vector=json.load(S) #with open('music/score.csv') as f:
data_list=[one.strip().split(', ') for one in f.readlines() if one]
vector=[]
for i in range(len(data_list)):
one_list=[]
userId,songId,rating,T=data_list[i]
try:
songV=song_vector[songId]
one_list+=songV
one_list.append(int(str(int(float(rating)/20)).split('. ') [0].strip()))
vector.append(one_list)
except:
pass
with open(save_path,'wb') as f:
f.write(json.dumps(vector))
Copy the code
After completing the creation of music recommendation data set, a deep learning model can be built for learning calculation. The specific implementation of model building is shown as follows:
def deepModel(data='dataset.json',saveDir='model/') :' ''Deep Learning Network Model'' '
if not os.path.exists(saveDir):
os.makedirs(saveDir)
scaler,X_train,X_test,y_train,y_test=getVector(data=data)
model=Sequential()
model.add(Dense(1024,input_dim=X_train.shape[1]))
model.add(Dropout(0.3))
model.add(Dense(1024,activation='linear'))
model.add(Dropout(0.3))
model.add(Dense(1024,activation='sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(1,activation='tanh')) #softmax relu tanh
optimizer=Adam(lr=0.002,beta_1=0.9,beta_2=0.999,epsilon=1e-08)
model.compile(loss='mae',optimizer=optimizer)
early_stopping=EarlyStopping(monitor='val_loss',patience=20)
checkpointer=ModelCheckpoint(filepath=saveDir+'checkpointer.hdf5',verbose=1,save_best_only=True)
history=model.fit(X_train,y_train,batch_size=128,epochs=50,validation_split=0.3,verbose=1,shuffle=True,
callbacks=[checkpointer,early_stopping]) #validation_data=(X_validation,y_validation)
model.save(saveDir+'music.model')
print(model_summary)
Copy the code
Need use Keras complete neural network model of structures, it is very convenient and quick one job, here you can freely modify the nerve cell number or you need the depth of the neural network, using any of you want to use the optimizer, activation function and evaluation function also is ok, different combinations can get different results.
Due to the different amount of data, the training time here may be very short or very long. It is suggested that the training process of the model be put on the server. Below is a simple paste of my model training process screenshots:
Four, song recommendation
After completing a series of work above, here comes the last step of music recommendation system, that is, music recommendation.
The specific implementation is as follows:
def singleUserRecommend(userId='2230728513',model_path='results/music/DL/DL.model') :' ''Enter user ID, output recommended content'' '
one_song_list=user_song[userId]
no_listen_list=[one for one in song if one not inUserVec =user_vector[userId] one_no_dict={}for one_no in no_listen_list:
one=[]
one+=user_vector[userId]
one+=song_vector[one_no]
X=np.array([one])
score=model.predict(X)
y_pre=score.tolist()[0]
one_no_dict[one_no]=y_pre
one_no_sorted=sorted(one_no_dict.items(),key=lambda e:e[1],reverse=True)
recommend_id_list=[one[0] for one in one_no_sorted][:10]
for oneId in recommend_id_list:
print('songId: ',oneId)
print('songName: ',song_dict[oneId])
Copy the code
The above function implements the recommendation of the most interesting songs that the user has not heard yet for the specified user ID. Example test results output is as follows:
This user seems to prefer Japanese songs
Here is the end of the work of this paper, I am very glad to review my knowledge and write something to share, if you think my content can or is enlightening and helpful to you, also hope to get your encouragement and support, thank you!
Appreciate the author
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Click to become a registered member of the community