In many cases, our existing participles suffice for many of our business needs, but there are also many cases where we need to customize a specific participle to meet our specific needs. We know that to implement full text search, every field needs to be analyzed after the document is imported into Elasticsearch. This is where participles come in. If you are not familiar with word segmentation, please refer to my previous article “Elasticsearch: Analyzer”.
Once the document has been imported into Elasticsearch, you can search the fields in it. It usually scores the relevance of each document according to the default BM25 algorithm. See the documentation “Elasticsearch: Distributed Scoring” to learn more about how the score for each document is obtained. This score affects the order in which the search results are returned. The document with the highest score is first returned, followed by the second-highest score, and so on. Default BM25 scoring rules although can meet most of our demand, but in the course of actual use, sometimes can’t fully meet our requirements, such as I want the top song will affect the final score, from our position on the news recently at the top, the recent news is preferred in the news before many years ago. For these special needs, we need to customize the algorithm of the score.
In today’s presentation, I’ll show you how to implement a custom analyzer and custom dependencies.
The installation
If you don’t already have your own Elasticsearch and Kibana installed, see my previous post “Elastic: A Beginner’s Guide”.
Custom analyzer
By default, Elasticsearch uses the “Standard” word splitter to analyze input text if custom Settings are not applied. Such as:
POST _analyze {"text": "Helene Segara it's! < > # "}Copy the code
The string above is some foreign language text. They look very untidy, and there are some symbols in them besides alphanumeric. The result is as follows:
{" tokens ": [{" token" : "Helene," "start_offset" : 0, "end_offset" : 6, "type" : "< ALPHANUM >", "position" : 0}, {" token ":" segara ", "start_offset" : 7, "end_offset" : 13, "type" : "< ALPHANUM >", "position" : 1}, {" token ": "it's", "start_offset" : 14, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 } ] }Copy the code
The text is segmented above using the Standard word splitter. When we search, we can search the document according to the above token.
Let’s enter four documents:
POST content/_bulk {"index":{"_id":"a1"}} {"type":"ARTIST","artist_id":"a1","artist_name":"Sezen Aksu","ranking":10} {"index":{"_id":"a2"}} {"type":"ARTIST","artist_id":"a2","artist_name":"Selena Gomez","ranking":100} {"index":{"_id":"a3"}} {"type":"ARTIST","artist_id":"a3","artist_name":"Shakira","ranking":10} {"index":{"_id":"a4"}} {" type ":" ARTIST ", "artist_id" : "a4", "artist_name" : "Helene Segara", "ranking" : 1000}Copy the code
Above is the document of a hypothetical music library. It contains the artist’s ID, the artist’s name, and an artist’s ranking. Execute the command above. Now we have imported the above four documents into Elasticsearch.
Suppose we have a search screen in our phone that looks like this:
Up there, as soon as we type in the letter C, there’s a list of all the artists that start with c for us to choose from. In our case, we type the following command to search for artist_name starting with s:
POST content/_search
{
"query": {
"multi_match": {
"query": "s",
"fields": [
"artist_name"
]
}
}
}
Copy the code
The above search will not return any content. This result is not unusual at all, because none of the participles in all the lots are tokens containing “s”.
Let’s start with an example to show how to create a custom Analyzer. First, we can replace some of these characters with the char_filter pattern_replace:
POST _analyze {"text": "Helene Segara it's! <>#", "char_filter": [ { "type": "pattern_replace", "pattern": "[^\\s\\p{L}\\p{N}]", "replacement": "" } ], "tokenizer": "standard" }Copy the code
In pattern, it uses regular expression. You can see the link to learn more about Java Regular Expressions. Replace any character that starts with a space but is not an alphanumeric character with a null character. The result of running the command above is:
{" tokens ": [{" token" : "Helene," "start_offset" : 0, "end_offset" : 6, "type" : "< ALPHANUM >", "position" : 0}, {" token ":" Segara ", "start_offset" : 7, "end_offset" : 13, "type" : "< ALPHANUM >", "position" : 1}, {" token ": "its", "start_offset" : 14, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 } ] }Copy the code
Above, we can see that there are uppercase letters in the last token. We can use the following method to change all letters to lowercase letters:
POST _analyze {"text": "Helene Segara it's! <>#", "char_filter": [ { "type": "pattern_replace", "pattern": "[^\\s\\p{L}\\p{N}]", "replacement": "" } ], "tokenizer": "standard", "filter": [ "lowercase" ] }Copy the code
In the filter above, we added lowercase. The filter changes all letters to lowercase. The command output above looks like this:
{" tokens ": [{" token" : "Helene," "start_offset" : 0, "end_offset" : 6, "type" : "< ALPHANUM >", "position" : 0}, {" token ":" segara ", "start_offset" : 7, "end_offset" : 13, "type" : "< ALPHANUM >", "position" : 1}, {" token ": "its", "start_offset" : 14, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 } ] }Copy the code
From the above we can see that all tokens are lowercase. We can also see some of these strange letters, such as e. This letter is not visible in English letters. We can treat it by using the asciifolding filter:
POST _analyze {"text": "Helene Segara it's! <>#", "char_filter": [ { "type": "pattern_replace", "pattern": "[^\\s\\p{L}\\p{N}]", "replacement": "" } ], "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] }Copy the code
Above, we added the Asciifolding filter, so its output is:
{
"tokens" : [
{
"token" : "helene",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "segara",
"start_offset" : 7,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "its",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
Copy the code
From the output above, we can see that all letters are lowercase and are English letters. Next, we want to further participle each of these tokens to make them searchable tokens, such as ‘h’, ‘he’, ‘hel’, ‘hele’. We need to use Ngram for this. You can refer to my previous document “Elasticsearch: Ngrams, Edge Ngrams, and Shingles”.
POST _analyze {"text": "Helene Segara it's! <>#", "char_filter": [ { "type": "pattern_replace", "pattern": """[^\s\p{L}\p{N}]""", "replacement": "" } ], "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", { "type": "edge_ngram", "min_gram": "1", "max_gram": "12" } ] }Copy the code
The command output is as follows:
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hele",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "helen",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
...
Copy the code
This means that when we type ‘h’, the document will also be searched. When we type ‘he’, that document can also be searched.
It is now time to define our custom Analyzer. We first delete the previous content index:
DELETE /content
Copy the code
We then use the following command to create the content index:
PUT content
{
"settings": {
"analysis": {
"filter": {
"front_ngram": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "12"
}
},
"analyzer": {
"i_prefix": {
"filter": [
"lowercase",
"asciifolding",
"front_ngram"
],
"tokenizer": "standard"
},
"q_prefix": {
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"properties": {
"type": {
"type": "keyword"
},
"artist_id": {
"type": "keyword"
},
"ranking": {
"type": "double"
},
"artist_name": {
"type": "text",
"analyzer": "standard",
"index_options": "offsets",
"fields": {
"prefix": {
"type": "text",
"term_vector": "with_positions_offsets",
"index_options": "docs",
"analyzer": "i_prefix",
"search_analyzer": "q_prefix"
}
},
"position_increment_gap": 100
}
}
}
}
Copy the code
Above, there are two sections: Settings and Mappings. In the Settings section, it defines two toggles: i_prefix and q_prefix. These are input, which is the toggle used to import documents, and q_prefix, which refers to the toggle used in query, which is the search. If reading this is not clear, please refer to my previous document “Elasticsearch: Analyzer”. In mappings, it is a multi-field field for Content. In addition to content being searched normally, we add the content.prefix field. For this field, the i_prefix splitter is used for import, and for search text, it uses the q_prefix splitter.
Next, we re-import the previous four documents using the previous method:
POST content/_bulk {"index":{"_id":"a1"}} {"type":"ARTIST","artist_id":"a1","artist_name":"Sezen Aksu","ranking":10} {"index":{"_id":"a2"}} {"type":"ARTIST","artist_id":"a2","artist_name":"Selena Gomez","ranking":100} {"index":{"_id":"a3"}} {"type":"ARTIST","artist_id":"a3","artist_name":"Shakira","ranking":10} {"index":{"_id":"a4"}} {" type ":" ARTIST ", "artist_id" : "a4", "artist_name" : "Helene Segara", "ranking" : 1000}Copy the code
Once the data has been imported, we can search the document using the following command:
POST content/_search
{
"query": {
"multi_match": {
"query": "s",
"fields": [
"artist_name.prefix"
]
}
}
}
Copy the code
We can see that all four documents have been searched. We can type in the following search:
POST content/_search
{
"query": {
"multi_match": {
"query": "se",
"fields": [
"artist_name.prefix"
]
}
}
}
Copy the code
The command output is as follows:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 3, "base" : "eq"}, "max_score" : 0.3359957, "hits" : [{" _index ":" content ", "_type" : "_doc", "_id" : "a1", "_score" : 0.3359957, "_source" : {" type ":" ARTIST ", "artist_id" : "a1", "artist_name" : "Sezen Aksu", "ranking" : 10 } }, { "_index" : "content", "_type" : "_doc", "_id" : "a2", "_score" : 0.30920535, "_source" : {"type" : "ARTIST", "artist_id" : "a2", "artist_name" : "Selena Gomez", "ranking" : 100}}, {" _index ":" content ", "_type" : "_doc", "_id" : "a4", "_score" : 0.29735082, "_source" : {" type ": "ARTIST", "artist_id" : "a4", "artist_name" : "Helene Segara", "ranking" : 1000}}}}]Copy the code
From the results returned above, the names of artists starting with SE are searched correctly and returned. But there are some downsides, like Sezen Aksu has the highest score, but his ranking is only 10, whereas Helene Segara has the lowest score, but its ranking is very high. The score of the returned result is obviously different from what we need. If you read the article “Elasticsearch: Distributed Scoring” carefully, you will see that Sezen has a shorter string length than Segara. That’s why it scored well. Next, let’s customize the correlation through some algorithms.
Custom relevance
The so-called relevance, that is, the score when the search returns. The higher the correlation, the higher the score will be, and it will precede the returned result, which we can customize with function_score. In order for the score and rangking fields to be combined effectively. We want a higher ranking value to have an impact in the final minute. We can write it like this:
POST content/_search
{
"from": 0,
"size": 10,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "se",
"fields": [
"artist_name.prefix"
]
}
},
"functions": [
{
"filter": {
"match_all": {
"boost": 1
}
},
"script_score": {
"script": {
"source": "Math.max(((!doc['ranking'].empty)? Math.log10(doc['ranking'].value) : 1), 1)",
"lang": "painless"
}
}
}
],
"boost": 1,
"boost_mode": "multiply",
"score_mode": "multiply"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
Copy the code
In the first part above:
"query": {
"multi_match": {
"query": "se",
"fields": [
"artist_name.prefix"
]
}
},
Copy the code
We search for documents that contain a string beginning with SE. In part TWO:
"functions": [
{
"filter": {
"match_all": {
"boost": 1
}
},
"script_score": {
"script": {
"source": "Math.max(((!doc['ranking'].empty)? Math.log10(doc['ranking'].value) : 1), 1)",
"lang": "painless"
}
}
}
],
Copy the code
We used one of our own algorithms for ranking and came up with a score. In part THREE:
"boost": 1,
"boost_mode": "multiply",
"score_mode": "multiply"
Copy the code
We take the score we just got and multiply it by the previous search score to get the final score. Based on this algorithm, the higher the ranking, the higher the ranking. To some extent, the size of ranking affects the final ranking.
After the above modifications, the final ranking is:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 3, "base" : "eq"}, "max_score" : 0.9777223, "hits" : [{" _index ":" content ", "_type" : "_doc", "_id" : "a4", "_score" : 0.9777223, "_source" : {" type ":" ARTIST ", "artist_id" : "a4", "artist_name" : "Helene Segara", "ranking" : 1000}}, {" _index ":" content ", "_type" : "_doc", "_id" : "a2" and "_score" : 0.6778009, "_source" : {"type" : "ARTIST", "artist_id" : "a2", "artist_name" : "Selena ", "ranking" : 100}}, {" _index ":" content ", "_type" : "_doc", "_id" : "a1", "_score" : 0.36826363, "_source" : {" type ": "ARTIST", "artist_id" : "a1", "artist_name" : "Sezen Aksu", "ranking" : 10 } } ] } }Copy the code
This time, we see Helene Segara in first place.
In practical use, the calculation of scripts will affect the speed of search due to the large amount of data. We can filter for a song we care about:
POST /content/_search
{
"from": 0,
"size": 10,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "s",
"fields": [
"artist_name.prefix"
]
}
},
"functions": [
{
"filter": {
"terms": {
"artist_id": [
"a4",
"a3"
]
}
},
"script_score": {
"script": {
"source": "params.boosts.get(doc[params.artistIdFieldName].value)",
"lang": "painless",
"params": {
"artistIdFieldName": "artist_id",
"boosts": {
"a4": 5,
"a3": 2
}
}
}
}
}
],
"boost": 1,
"boost_mode": "multiply",
"score_mode": "multiply"
}
},
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
Copy the code
Above, for example, for different users, the list of artist_id here will send changes. The result of this modification can save script operations and thus speed up the search.