Elasticsearch is a very popular search engine. It can segmentation text, so as to achieve full-text search. In practical use, we will find some characters contain some emoticons, such as smiling faces, animals and so on. Then how should we search for these emoticons?
đť => đť, light skin tone, skin tone, type 1 -- 2 đź => đź, medium-light skin tone, skin tone, type 3 Medium skin tone, skin tone, type 4 đž => đž, medium-dark skin tone, skin tone, type 5 đż => đż, dark skin tone, Skin tone, type 6 (content, music, note â => â, bemolle, flat, music, note ⯠=> âŻ, diese, diesis, music, note, Sharp đ => đ, face, grin, grinning face đ => đ, face, smileface with big eyes, mouth, open, smile smile => kicker, Eye, face, grinning face with smiling, mouth, open, smile đ => đ, beaming face with smiling eyes, eye, face, beaming face Grin, smile đ => đ, face, grinning squinting face, laugh, mouth, satisfied, smile đ
=> đ, cold, face, grinning face with sweat, open, smile, sweat 𤣠=> đ¤Ł, face, floor, laugh, rofl, rolling, Rolling on the floor laughing rotfl đ => đ, face, face with tears of joy, joy, laugh, tear đ => đ, face, Slightly smiling face, smile đ => đ, face, upside-down đ => đ, face, wink, winking face đ => tap, tiger đ => stage, Leopard đ´ => đ´, face, horse đ => đ, equestrian, horse, racehorse, racing whale => 10000, face, unicorn unicorn => kicker, stripe, Zebra đŚ => đŚ, DeerCopy the code
On it, we can see all kinds of emoji symbols. For example, if we want to search for GRIN, it will find documents with the đ emoji as well. In today’s article, we’ll show you how to search for emoji.
Â
The installation
If you haven’t already installed Elasticsearch and Kibana, see the previous article “Elastic: A Beginner’s Guide” to do so. In addition, we must install the ICU Analyzer. For the installation of ICU Analyzer, see the previous article “Elasticsearch: An Introduction to the ICU Analyzer”. Insert the following command into the root directory of Elasticsearch:
./bin/elasticsearch-plugin install analysis-icu
Copy the code
Once installed, we need to restart Elasticsearch to make it work. Run:
./bin/elasticsearch-plugin list
Copy the code
The command above shows:
$ ./bin/elasticsearch-plugin install analysis-icu
-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%Â Â
-> Installed analysis-icu
$ ./bin/elasticsearch-plugin list
analysis-icu
Copy the code
After installing ICU Analyzer, we must restart Elasticsearch.
Â
Search for Emoji
Let’s start with a simple experiment:
GET /_analyze {"tokenizer": "icu_tokenizer", "text": "I live in đ¨đł and I'm đŠđ"}Copy the code
The above uses icu_tokenizer to participle “I live in đ¨đł and I’m đŠđ”. The đŠđ emoji is unique because it’s a combination of the more classic đŠ and đ emojis. The National flag of China is also very special. It is a combination of đ¨ and đł. So, not only are we talking about properly splitting Unicode code points, but we’re really getting to know emoji here.
The result of the above request is:
{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "live", "start_offset" : 2, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "in", "start_offset" : 7, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "" "" "đ¨ đł", "start_offset" : 10, "end_offset" : 14, "type" : "< EMOJI >", "position" : 3}, {" token ": "and", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "I'm", "start_offset" : 20, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "" "" "đŠ đ", "start_offset" : 24, "end_offset" : 29, "type" : "< EMOJI >", "position" : 6}]}Copy the code
Apparently emoji symbols are segmented correctly and can be searched.
In actual use, we may not be limited to the search of these emoji symbols. For example, we want to search for the following documents:
PUT emoji-capable/_doc/1 {"content": "I like đ
"}Copy the code
The above document contains an đ , or tiger. For the above documents, we want to search tiger for documents correctly, so how do we do that?
On Github, there is a project called github.com/jolicode/em… . Among its projects, there is a directory github.com/jolicode/em… . This is essentially a catalog of synonyms. We now download one of the files github.com/jolicode/em… Go to Elasticsearch’s local installation directory:
â ââ ââ ââ ââ ââ download.txt ââ download.txt ââ download.txt ââ download.txt ââ download.txt...Copy the code
On my computer:
$PWD/Users/liuxg/elastic1 / elasticsearch tree - L - 7.11.0 / config $3. â â â analysis â â â â Cldr-emoji-annotate-synonyms.txt ââ ElasticSearch. Keystore ââ ElasticSearch.yML ââ jv.options ââ jv.options âââ double exercises, double Exercises, double Exercises, double Exercises, double ExercisesCopy the code
In the file cldr-emoji-annotation-synonym-en.txt above, it contains synonyms for common emoji symbols. Such as:
đ => đ, face, grin, grinning face đ => đ, face, smileface with big eyes, mouth, open, smile smile => Eye, Beaming face, grinning face with smiling eyes, mouth, open, smile đ => đ, beaming face with smiling eyes, eye, face, grin, beaming face Smile đ => đ, face, grinning squinting face, laugh, mouth, satisfied, smile đ
=> đ, cold, face, grinning face with sweat, open, smile, sweat ....Copy the code
To this end, we carry out the following experiments:
PUT /emoji-capable
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "icu_tokenizer",
"filter": [
"english_emoji"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
}
Copy the code
Above, we defined the english_with_emoji descriptor, and we used the same descriptor, english_with_emoji, for the content field. We use the _analyze API to do the following:
GET emoji-capable/_analyze {"analyzer": "english_with_emoji", "text": "I like đ
"}Copy the code
The command above returns:
{ "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "like", "start_offset" : 2, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : đ
"" "" ""," start_offset ": 7," end_offset ": 9," type ":" SYNONYM ", "position" : 2}, {" token ": "tiger", "start_offset" : 7, "end_offset" : 9, "type" : "SYNONYM", "position" : 2 } ] }Copy the code
It obviously returns tokens like Tiger as well as đ . So we can search for both, and we can search for this document. In the same way:
GET emoji-capable/_analyze
{
"analyzer": "english_with_emoji",
"text": "đ means happy"
}
Copy the code
It returns:
{" tokens ": [{" token" : "" đ" "" "," start_offset ": 0," end_offset ": 2," type ":" SYNONYM ", "position" : 0 }, { "token" : "face", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grin", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "grinning", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "means", "start_offset" : 3, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "face", "start_offset" : 3, "end_offset" : 8, "type" : "SYNONYM", "position" : 1 }, { "token" : "happy", "start_offset" : 9, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 } ] }Copy the code
It shows that if we search for Face, grinning, GRIN, the document will also be returned correctly.
Now, we enter the following two documents:
PUT emoji-capable/_doc/1 {"content": "I like đ
"} PUT emoji-capable/_doc/2 {"content": "đ means happy"}Copy the code
We search the documents as follows:
GET emoji - capable / _search {" query ": {" match" : {" content ":" đ
"}}}Copy the code
Or:
GET emoji-capable/_search
{
"query": {
"match": {
"content": "tiger"
}
}
}
Copy the code
They all return the first document:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.8514803, "hits" : [{" _index ": "Emoji - capable", "_type" : "_doc", "_id" : "1", "_score" : 0.8514803, "_source" : {" content ": """I like đ
"""}}]}}Copy the code
In general, we conduct the following search:
GET emoji - capable / _search {" query ": {" match" : {" content ":" đ "}}}Copy the code
Or:
GET emoji-capable/_search
{
"query": {
"match": {
"content": "grin"
}
}
}
Copy the code
They all return a second document:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.8514803, "hits" : [{" _index ": "Emoji - capable", "_type" : "_doc", "_id" : "2", "_score" : 0.8514803, "_source" : {" content ": """đ means happy""}}]}}Copy the code
Â