Elasticsearch master (19)
Deep search technology _ Approximate matching based on SLOP parameters, principle analysis and related experiments
slop
GET /waws/article/_search
{
"query": {
"match_phrase": {
"title": {
"query": "java spark"."slop": 1}}}}Copy the code
-
The meaning of the slop
- Terms in a query string, search text, have to go through a number of moves to match a document, and that number of moves is sloP
For example, a Query String can be moved several times to match a document and then set sloP
hello world, java is very good, spark is also very good.
Java Spark, match Phrase, cannot be found
If we specify sloP, then we allow Java Spark to move to try to match doc
java is very good spark java spark java spark java spark java spark The sloP here is 3, because of the phrase Java Spark, spark moves three times to match a doc
Slop means more than simply saying that a query String terms moves several times to match a doc. A query String terms can be moved up to a number of times to try to match a doc
Slop is set to 3, so it’s OK
GET /waws/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java spark"."slop": 3}}}} {"took": 19."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 1."max_score": 0.5753642."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.5753642."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
I can match that doc, and that doc will be returned as the result
If sloP is set to 2, spark can only be moved a maximum of two times. In this case, spark does not match doc. The DOC will not be returned as the result of experiments to verify sloP
GET /waws/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "spark data"."slop": 3}}}} {"took": 1."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 1."max_score": 0.21824157."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.21824157."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
spark is best big data solution
spark data
data
data
data
GET /waws/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "data spark"."slop": 5}}}} {"took": 1."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 1."max_score": 0.154366."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.154366."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
GET /waws/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "data spark".# Examples of reverse positions
"slop": 5}}}}Copy the code
- Step of change
steps | spark | is | best | big | data |
---|---|---|---|---|---|
0 | data | spark | |||
1 | data/spark | ||||
2 | spark | data | |||
3 | spark | data | |||
4 | spark | data | |||
5 | spark | data |
- In SLOP search, the closer the keywords are, the higher the relevance score will be
GET /waws/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java best"."slop": 15}}}} {"took": 3."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 2."max_score": 0.65380025."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.65380025."_source": {
"articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
"postDate": "2017-01-02"."tag": [
"java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.07111243."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
In fact, a phrase match with SLOP is a proximity match
- Java Spark, phrase, doc, phrase match
- Java Spark can have a certain distance, but the closer it is, the closer it is, the closer it is to search. It can be seen by proximity Match
Elasticsearch master (20)
Deep search techniques – Use match and approximate match to balance recall and accuracy
The recall rate
- For example, if you search for a Java Spark and there are 100 doc’s in total, how many doc’s are returned as results
precision
- For example, if you search for a Java Spark, can you make the doc that contains Java Spark, or a doc that is close to Spark, as high as possible? Precision
Searching directly with the match_PHRASE results in a match that requires all terms to be present in the DOC field and within sloP limits
Match phrase and proximity match requires that doc must contain all terms before it can be returned as the result. If a doc might just have a term not included, it cannot be returned as a result
- Java Spark –> Hello world Java –> cannot be returned
- Java Spark –> Hello world, Java Spark –> can return
When it comes to approximate matching, the recall rate is low and the accuracy is too high
Requirements:
However, sometimes what we hope to match is part in several terms, which can be taken as the result, so as to improve the recall rate. At the same time, we also hope to use the function of Match_PHRASE to improve the score according to the distance, so that the closer several terms are, the higher the score will be, and the priority will be returned
Java spark is returned if it contains Java, spark is returned if it contains Java, spark is returned if it contains Java and Spark. In terms of precision, both Java and Spark are included, and doc, which is closer to Java and Spark, comes first
You can use the bool combination match query and match_phrase query to achieve the above effect
- The title case is not very clear
GET /waws/article/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": {
"query":"java spark"}}},"should": {
"match_phrase": {
"title": {
"query": "java spark"."slop": 50}}}}}} {"took": 5."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 4."max_score": 0.2876821."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 0.2876821."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}, {"_index": "waws"."_type": "article"."_id": "1"."_score": 0.26742277."_source": {
"articleID": "XHDK-A-1293-#fJ3"."userID": 1."hidden": false,
"postDate": "2017-01-01"."tag": [
"java"."hadoop"]."tag_cnt": 2."view_cnt": 30."title": "this is java and elasticsearch blog"."content": "i like to write best elasticsearch article"."sub_title": "learning more courses"."author_first_name": "Peter"."author_last_name": "Smith"."new_author_last_name": "Smith"."new_author_first_name": "Peter"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.19856805."_source": {
"articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
"postDate": "2017-01-02"."tag": [
"java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "4"."_score": 0.155468."_source": {
"articleID": "QQPX-R-3956-#aD8"."userID": 2."hidden": true,
"postDate": "2017-01-02"."tag": [
"java"."elasticsearch"]."tag_cnt": 2."view_cnt": 80."title": "this is java, elasticsearch, hadoop blog"."content": "elasticsearch and hadoop are all very good solution, i am a beginner"."sub_title": "both of them are good"."author_first_name": "Robbin"."author_last_name": "Li"."new_author_last_name": "Li"."new_author_first_name": "Robbin"}}]}}Copy the code
The content of the sample
- In the first case below, where the accuracy rate is high but the recall rate is low, Java alone can rank very high
- In the second case below, the amount of data we got was relatively small and the accuracy was low. The combination of Java and Spark ranked very high
GET /waws/article/_search
{
"query": {
"bool": {
"must": [{"match": {
"content": "java spark"}}]}}} {"took": 1."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 2."max_score": 0.68640786."hits": [{"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
"articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
"postDate": "2017-01-02"."tag": [
"java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}, {"_index": "waws"."_type": "article"."_id": "5"."_score": 0.68324494."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}]}}Copy the code
- The second case
GET /waws/article/_search
{
"query": {
"bool": {
"must": [{"match": {
"content": "java spark"}}]."should": [{"match_phrase": {
"content": {
"query": "java spark"."slop": 50}}}} {"took": 2."timed_out": false,
"_shards": {
"total": 5."successful": 5."failed": 0
},
"hits": {
"total": 2."max_score": 1.258609."hits": [{"_index": "waws"."_type": "article"."_id": "5"."_score": 1.258609."_source": {
"articleID": "DHJK-B-1395-#Ky5"."userID": 3."hidden": false,
"postDate": "2017-03-01"."tag": [
"elasticsearch"]."tag_cnt": 1."view_cnt": 10."title": "this is spark blog"."content": "spark is best big data solution based on scala ,an programming language similar to java spark"."sub_title": "haha, hello world"."author_first_name": "Tonny"."author_last_name": "Peter Smith"."new_author_last_name": "Peter Smith"."new_author_first_name": "Tonny"}}, {"_index": "waws"."_type": "article"."_id": "2"."_score": 0.68640786."_source": {
"articleID": "KDKE-B-9947-#kL5"."userID": 1."hidden": false,
"postDate": "2017-01-02"."tag": [
"java"]."tag_cnt": 1."view_cnt": 50."title": "this is java blog"."content": "i think java is the best programming language"."sub_title": "learned a lot of course"."author_first_name": "Smith"."author_last_name": "Williams"."new_author_last_name": "Williams"."new_author_first_name": "Smith"}}]}}Copy the code