The login
registered
Write an article
Home page
Download the APP
Hand in hand teach you to do the best e-commerce search engine (1) – category prediction
_ star,
Hand in hand teach you to do the best e-commerce search engine (1) – category prediction
Hand in hand teach you to do the best e-commerce search engine (1) – category prediction
The introduction
E-commerce has played a pivotal role in our life today, and search engine as the main traffic entrance of e-commerce system, search experience plays a crucial role in the whole system, how to optimize search, let users search better, faster, has become a mature e-commerce system required course.
The heart and essence of search
What is a good e-commerce search engine?
In fact, the essence of an excellent e-commerce search engine is to understand the needs of users, help users quickly find the goods they want, and reach a deal.
So how do search engines understand users’ intentions? There are three common methods:
1. Regular approach based on dictionaries and templates
Manually categorize users’ search terms, for example, when users search for iPhone, manually categorize them into “mobile phone” categories. This kind of processing method is accurate and effective, it is very effective to deal with some popular goods/special nouns, but with more and more users, keywords, including long tail words, more and more complex, relying on artificial has not been able to deal with so much workload.
2. Statistical methods based on user behavior
According to the user’s behavior, for example, when users search for apple, most of them click on mobile phone and a few people click on fruit. According to the statistics, it can be concluded that the apple category is in order: mobile phone > fruit. This method relies on user behavior and, like method 1, it is difficult to hit complex words. Especially in Chinese, the profound word order of the wrong bit will not affect others to understand the language.
3. Distinguish the user’s intention based on machine learning model
With the development of artificial intelligence, Natural Language Processing (NLP) and Natural Language Processing have become the most important components of the AI era. In this way, through machine learning and deep learning, we train and learn the corpus of the marked domain to obtain an intention recognition model. Using this model, when another test set is input, it can quickly predict the corresponding classification of the corpus and provide the corresponding confidence degree. One advantage of using this approach is that the accuracy of the model will improve as the corpus becomes richer.
Today we’re going to focus on the third approach.
Search term processing steps
E-commerce search engines get a user’s search keywords, generally need to carry out the following processing steps:
1. Text normalization
guiyi.jpg
Common operations are as follows:
(1). Stop words should be removed, such as special symbols and punctuation marks carelessly entered by users
(2) uniform case, such as Nike/ Nike, iPhone XR/iPhone XR
(3). Different language conversion, such as iPhone/iPhone, Adidas /adidas
2. Text correction, such as iphoe => iPhone
3.分词, eG: Men’s sports hoodie, Li Ning => Men’s sports hoodie, Li Ning
4. Intention recognition/central word recognition, e.g.
“Men’s sports hoodie and Li Ning”, particip result: “Men’s sports hoodie and Li Ning”
Recognition results:
People: Men category: hoodie category Modification words: sports hoodie brand: Li Ning
5. Category prediction/text classification, e.g.
Men’s sportswear hoodie Li Ning => sportswear
Pajama woman autumn/winter => lingerie/home wear
As one of the most classic application scenarios in the field of NLP, text classification has accumulated many implementation methods, such as Facebook’s open source FastText, convolutional neural network (CNN) and circular neural network (RNN), etc. Here we mainly look at text classification based on deep learning.
We won’t go into the details of the principles, but today we will mainly use FastText to practice category prediction.
Let’s do it!
FastText category prediction practice
System architecture
jiagou.jpg
Data preparation
So let’s start collecting data.
As we all know, one of the major difficulties of machine learning is to collect a large number of labeled samples. It is not easy to ensure that the samples are easy to process, updated in time and covered comprehensively.
As the main source of goods in our system is Taobao, I take Taobao goods as an example. The category uses the first-level category of Taobao, and the text data uses the commodity title of Taobao.
Something like this:
category | The title |
---|---|
Beauty care/body/essential oil | Genuine Skin Care Lancome Moisturizer Soothing Moisturizer 50ml Medium sample Super Moisturizer Gel Gel |
We need a word segmentation tool. There are many tools for word segmentation, such as third-party services such as Ali Cloud and Tencent Cloud, as well as open source tools such as stutter word segmentation. Here we use stutter word segmentation to build an HTTP word segmentation interface.
$ npm init
$ npm install nodejieba --save
$ vim fenci.js
Copy the code
Fenci.js contains the following contents:
"use strict"; (function () { const queryString = require('querystring'), http = require('http'), Nodejieba = the require (" nodejieba ") const port = process. The env. The port | | 8800 const host = process. The env. The host | | '0.0.0.0' const requestHandler = (request, response) => { let query = request.url.split('? ')[1] let queryObj = queryString.parse(query) let result = nodejieba.cut(queryObj.text, true) let res = { rs: result } response.setHeader('Content-Type', 'application/json') response.end( JSON.stringify(res, null, 2) ) } const server = http.createServer(requestHandler); server.listen(port, host, (error) => { if (error) { console.error(error); } console.log(`server is listening on ${port}`) }) }).call(this)Copy the code
Running with PM2:
$ pm2 start fenci.js
Copy the code
Verify the word segmentation interface (note that the parameter should urlencode once) :
$ curl 'http://127.0.0.1:8800/?text=%E4%BF%9D%E6%9A%96%E5%86%85%E8%A1%A3'
Copy the code
If the following output is displayed, the interface is working properly:
{"rs": [" warm ", "underwear"]}Copy the code
The title of the product looked something like this:
Authentic Skin Care Lancome Moisturizer Moisturizer 50 ml Medium sample Super Moisturizer Gel GelCopy the code
If the word segmentation is not accurate, we can collect some brand words and terms commonly used by e-commerce, save them in the word segmentation user – defined thesaurus (user.dict.utf8 file), and restart the word segmentation service.
Text sample library
Mysql > select * from ‘mysql’;
CREATE TABLE 'tb_text_train' (' item_id 'bigint(20) NOT NULL COMMENT' ID', 'title' varchar(255) DEFAULT NULL COMMENT '标题', 'Level_one_category_name' varchar(255) DEFAULT NULL COMMENT 'CATId ', 'title_split' varchar(1000) DEFAULT NULL COMMENT 'title ',' done 'tinyint(1) DEFAULT '0' COMMENT' 0', 'updatetime' datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT 'UPDATE time ', PRIMARY KEY (`item_id`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8;Copy the code
In this way, we can use a timing script to periodically process the commodity titles in the library (excluding stop words, participles) and synchronize them to the sample library. You can also create a background and manually process annotations in the background.
Text processing to be trained
After the text annotation, we can use SQL statements to export the annotated data in the desired format as a data. TXT file in the following format:
__label__ Traditional nourishing nutrients Fu Yi Zhi Wet cream Fu Yi Wet tea Zhi Wet tea __label__ Men's autumn pants men's casual pants Korean fashion students loose leg pants versatile sports pants men's nine minutes pants __label__ residential furniture marble surface kung fu tea tea table simple Modern Chinese style multi-functional tea making one table living room office tea __label__ Dress accessories/belt/hat/scarf bow tie abasic hat summer sun protection fisherman's hat go on a holiday at the seaside day series literary women's straw hat __label__ women's shoes Autumn and winter leather warm bean shoes flat bottom Ladies' shoes large size mother's shoes pregnant women's white nurse's cotton-padded shoes Women's __label__ Household use ultrasonic mosquito repellent household mosquito repellent intelligent electronic mosquito repellent indoor rodent repellent cockroach __label__ Women's underwear/men's underwear/home wear men's boxer underwear pure cotton Middle-aged and old dad boxers full cotton shorts loose old man fat increase pantsCopy the code
Format description, one line:
__label__ The class name is the already processed title stringCopy the code
Then we split the whole data set into two parts, 90% as training set and 10% as test set, and the code is as follows:
Import pandas as pd import numpy as np # corpora data set store path data_path = "/ usr/local/webdata/fastText/" text file" train "= # read corpus data set Pd. read_csv(data_path+"data.txt", header=0, sep='\r\n', engine='python') ts = train. Shape TXT df = pd.dataframe (train) new_train = df.reindex(np.random.permutation(df.index)) # Indice_90_percent = int((TS [0]/100.0)* 90) # break into 2 files new_train[:indice_90_percent].to_csv(data_path+'train.txt',index=False) new_train[indice_90_percent:].to_csv(data_path+'test.txt',index=False)Copy the code
Use fastText for training
1. Install fasttext:
Fasttext installation is very simple:
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
$ pip install .
Copy the code
2. Fasttext training:
Run the training using the command line:
$./ FastText supervised input train.txt-output model-label __label__ -epoch 50-wordngrams 3-DIM 100-LR 0.5-loss hsCopy the code
Or write a Python program to call the Fasttext module:
Import FastText Model = fasttext. Train_supervised (input="model", LR =0.5, EPOCH =100, wordNgrams=3, dim=100, loss='hs')Copy the code
After model training, model file model.bin and text vector file model.vec will be generated.
3. Validate the data set
Training almost, let’s use the test set to check, is the mule is the horse out for a walk.
$ ./fasttext test model.bin test.txt
Copy the code
The following results occur:
N 1892209
P@1 0.982
R@1 0.982
Copy the code
Well, it looks good.
Let’s try it with actual user search terms, such as “thermal underwear”, and get “thermal underwear” as a result, and submit fastText predictions like this:
$echo 'thermal underwear' |. / fasttext predict model. The bin -Copy the code
Wait a moment and you get something like this:
__label__ Women's underwear/men's underwear/home wearCopy the code
Bingo, the prediction was perfect!
The model.bin file generated by default is large, and you can use the quantize command to compress the model file:
$ ./fasttext quantize -output model
Copy the code
The model. FTZ file will be significantly reduced in size, and the result is very impressive. For example, in my case, it has been changed from 4.7G to 616MB
4.7G model.bin
616M model.ftz
60K model.o
6.1G model.vec
Copy the code
Use the same as model.bin, for example:
$. / fasttext test model. The FTZ test. TXT $echo 'thermal underwear. | / fasttext predict model. The FTZ -Copy the code
4. Provide Web services
To make it easier to use the prediction service, let’s make the prediction function an HTTP interface as well, again using NodeJS.
$ npm install fasttext.js --save
$ vim predict.js
Copy the code
Predict.js is as follows:
"use strict"; (function () { const queryString = require('querystring'), FastText = require('fasttext.js'), HTTP = the require () 'HTTP' const port = process. The env. The port | | 8801 const host = process. The env. The host | | '0.0.0.0 const fastText = new FastText({ loadModel: '/usr/local/webdata/fastText/model.ftz' }) const requestHandler = (request, response) => { let query = request.url.split('? ')[1] let queryObj = queryString.parse(query) fastText.predict(queryObj.text) .then(labels => { let res = { predict: labels } response.setHeader('Content-Type', 'application/json') response.end( JSON.stringify(res, null, 2) ) }) .catch(error => { console.error("predict error", error) }) } const server = http.createServer(requestHandler) fastText.load() .then(done => { console.log("model loaded") server.listen(port, host, (error) => { if (error) { console.error(error) } console.log(`server is listening on ${port}`) }) }) .catch(error => { console.error("load error", error) }); }).call(this)Copy the code
Again, we use PM2 to start the service:
$ pm2 start predict.js
Copy the code
Verify the effect (note that the parameter should first use the above word segmentation interface, and then urlencode again) :
$ curl 'http://127.0.0.1:8801/?text=%E4%BF%9D%E6%9A%96%20%E5%86%85%E8%A1%A3'
Copy the code
The result looks like this:
{" predict ": [{" label" : "women/men's underwear underwear/household to take," "score" : "1.00005"}, {" label ":" children's clothing/baby clothes/parent-child outfit ", "score" : "0.0543005"}]}Copy the code
It seems that the effect is very good oh, choose the score value of the largest basically meet the needs.
conclusion
After the above simple steps, we have successfully built a set of Taobao commodity classification and prediction services, including word segmentation system, classification and prediction system. By integrating these services into our search capabilities, our searches will be more accurate and our users will be happier.
See my github for the code
reference
FastText fasttext.js Stutter Chinese segmentation