Abstract
This sharing will mainly introduce how ES helps us complete the task of NLP. When performing NLP related tasks, the similarity algorithm of ES is not enough to support users’ search, and some semantic-related methods need to be used to improve it. But there are a lot of features in ES that help us optimize our search experience.
On June 10, 2017, Wenjun Yang, machine learning engineer at Trend Micro’s Personal Consumer division, spoke at Elastic Meetup in Nanjing about ElasticSearch-assisted Intelligent Customer Service Robots. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.
Read the word count: 1605 | 4 minutes to read
t.cn/RQAEw96
Introduction -Dr. Cleaner /Dr.X series products
Our main service projects are the APP on MAC — Dr. Cleaner and Dr.X series products.
Dr. Cleaner in many countries, areas in the cleaning class MAC APP ranked the first, daily live close to one million.
Happy troubles: Customer service
Multi-language and cross-time zone: our APP may not be very famous in China, and its current customers are mostly overseas, among which the United States is the main customer, as well as users from other countries and regions.
Not keeping up with the number of users: With the rapid increase of users, the number of customer service can not keep up with the growth of users.
Solution: Customer service robots
Customer service robots can first solve product-related problems, and then solve MAC/IOS related technical problems. Multi-language problems need to be translated into English through the translation API and then try to provide solutions.
Composition of Knowledge Base
Any intelligent customer service, no matter how powerful its algorithms are, will fail if it does not have enough knowledge base to support it. So we took a lot of Mac-related websites and stuffed them into our database.
All kinds of spiders
StackExcangeApple sub-forum (open data source), Apple Discussion, Mac world, WikiHow…
Document search
How do we find what we need from the document repository when user problems arise? We tried using ES directly before, but it was too far from semantics to work well.
WMD also has obvious disadvantages, its algorithm complexity is very high, the calculation speed is very slow. WMD is not a silver bullet, and even after WMD there may be some bad results.
Our knowledge base will first go through an ES filter layer. The original knowledge base is probably hundreds of thousands of levels, if directly using WMD calculation will be very slow. ES makes sure that the literals aren’t that bad to some degree, that they can still match something when the literals are close.
ES Specific operation
This is the most original mapping, and we optimize step by step based on this mapping.
Optimization: BM25 or TFIDF
After adopting BM25, the higher the frequency of a word, beyond a certain threshold, its effect is very small.
We did an experiment and modified the mapping to use BM25 or TFIDF respectively. Select 100 questions and 10 answers randomly from the knowledge base, let ES do the query, and then compare the results on both sides.
We did 10 rounds, with 100 responses per round. As you can see in the figure above, the two algorithms are about 91% repeatable.
According to the experiment, the effect of BM25 is relatively obvious, and finally we use BM25 to do the algorithm of similarity.
Optimization: spell check and error correction
Our solution: Term Suggester + Custom Analyzer
Use the Term Suggester
You can directly enter one sentence: How to replace Macbookk SSD?
Term Suggester is self-tuned
Set the minimum number of occurrences to 3 and modify “string_distance” to “jarowinkler”. Its default similarity is some customization based on the edit distance, which defaults to integer output.
Improved method
Add support for user behavior data. A big part of Google’s algorithm is based on user behavior data.
“Before and after”, from our point of view, to consider the relationship between the two words.
Optimization: Input standardization
The solution
First, Gensim is used to generate alternative phrases, and then rules are used to filter out more accurate candidate phrases. When we have a correct phrase, we can generate common misspellings based on the candidate phrase. Finally, user input is processed in real time and the knowledge base stored in ES is processed in batches.
The rules
The rules are pure English characters, minus numbers. Mainly brand name and version number.
POS Tagging + POS filtering
WHY?
The calculation intensity of WMD is relatively high. If we can remove some unimportant words in the input words, the calculation intensity of WMD can be reduced.
It’s expressed differently in some of our knowledge bases. But important words can be expressed in a different way to improve accuracy.
The solution
Our current solution uses Python NLTK for analysis and filtering, printing out the part of speech of each word, and ES stores the results.
We prefer to use ES analysis, filtering, storage one-stop solution, but this solution requires you to write an ES Pos plug-in.
Advantages of the recommended scheme
Performance: Java implementations are generally faster than pure Python implementations, especially when it comes to CPU consumption.
Simplicity: Logic does not need to be maintained in both ES and Python.
Space saving: NLTK model files are also relatively large, multiple Docker images mean occupying multiple memory, disk.
Optimization: Synonym
Synonym based on Word2vec
It’s hard to define a synonym artificially, so we generate “synonyms” based on Word2vec.
Query rewriting scheme
Our synonym scheme is query rewriting through synonyms.
Other optimizations
Based on Machine Learning reordering, the model is reordered according to the predicted click probability.
That’s all for today’s sharing, thank you!