Fourth, in-depth search
1. Term based and full-text based search
1.1 based onTerm
The query
Term
The importance ofTerm
Is the smallest unit of semantic expression. Both search and natural language processing using statistical language models require processingTerm
- The characteristics of
Term Level Query:
Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query- in
ES
,Term
Query, do not do word segmentation for input. The input as a whole is searched for the exact term in the inverted index, and the relevance score is performed for each document containing that term using the relevance score formula – for exampleApple Store
- Can be achieved by
Constant Score
Convert the query to oneFiltering
, avoid scoring, and use caching to improve performance
1.2 Term
Examples of queries
1.2.1 Inserting Data
# Term query example, And think about the POST/products / _bulk {" index ": {" _id" : 1}} {" productID ":" XHDK - A - 1293 - # fJ3 ", "desc" : "iPhone"} {" index ": {" _id" : 2}} {"productID":"KDKE-B-9947-#kL5","desc":"iPad"} {"index":{"_id":3}} {"productID":"JODL-X-1937-#pV7","desc":"MBP"} GET /productsCopy the code
1.2.2 example 1
POST /products/_search
{
"query": {
"term": {
"desc": {
"value": "iPhone"
}
}
}
}
Copy the code
I can’t find anything. What’s the reason? Because of the term query we use, ES will not do any processing to the input condition, that is to say, the condition we search is “iPhone” with uppercase, while es will do default word segmentation processing to the data of text type and turn lowercase when making data index. That’s why we can’t get the data.
POST /products/_search
{
"query": {
"term": {
"desc": {
"value": "iphone"
}
}
}
}
Copy the code
So we can get the numbers
1.2.3 case 2
POST /products/_search
{
"query": {
"term": {
"productID": {
"value": "XHDK-A-1293-#fJ3"
}
}
}
}
Copy the code
There’s nothing here. What’s the reason for that? Term = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3
POST /_analyze
{
"analyzer": "standard",
"text": ["XHDK-A-1293-#fJ3"]
}
Copy the code
The result of the above is
{
"tokens" : [
{
"token" : "xhdk",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "1293",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "fj3",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
Copy the code
We can look it up like this
POST /products/_search
{
"query": {
"term": {
"productID": {
"value": "xhdk"
}
}
}
}
Copy the code
XHDK in lower case can match the content after the word segmentation, so we can look up the result, so how do we match exactly?
Xhdk-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3 = xhdK-a-1293 -#fJ3
POST /products/_search
{
"query": {
"term": {
"productID.keyword": {
"value": "XHDK-A-1293-#fJ3"
}
}
}
}
Copy the code
If you want a full match, you can use a multi-field property in ES, which adds a keyword field to a text field by default. The keyword field provides a full match
1.3 Composite Query –Constant Score
toFilter
The Term query still returns the corresponding score, so what if we want to skip the score?
- will
Query
toFilter
To ignoreTF-IDF
Calculation to avoid the overhead of correlation calculation Filter
Caching can be used effectively
1.4 Full-text Query
- Full-text based search
Match Query
/Match Phrase Query
/Query String Query
- The characteristics of
- Indexing and search are segmented, and the query string is passed to an appropriate tokenizer, which generates a list of terms to query
- During the query, the input query will be divided into words first, and then each word item one by one for the bottom of the query, the final results are merged. A score is generated for each document. Such as check
"Matrix reloaded"
, will be found to includeMatrix
orreload
All the results of
1.5 Match Query
The query process
1.6 summarize
- Term based lookup vs full-text based lookup
- Through the field
Mapping
Controls the segmentation of fieldsText
vsKeyword
- Query controlled by parameters
Precision
&Recall
- Composite query –
Constant Score
The query- Even for
Keyword
forTerm
Query, will also be calculated points - Queries can be converted to
Filtering
In order to improve performance, the correlation calculation is eliminated
- Even for
2. Structured search
2.1 Structured Data
2.2 ES
Structured search in
- Structured data such as booleans, times, dates, and numbers: there are precise formats that we can logically manipulate. Involves comparing ranges of numbers or times, or determining the size of two values
- Structured text can be matched exactly or partially
Term
Query /Prefix
The prefix queries
- Structured results have only yes or no values
- Depending on the scenario, you can decide whether structured search needs to be scored
2.3 example
2.3.1 Inserting Data
# Structured search, DELETE products POST /products/_bulk {"index":{"_id":1}} {"price":10,"avaliable":true,"date":"2018-01-01","productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":2}} {"price":20,"avaliable":true,"date":"2019-01-01","productID":"KDKE-B-9947-#kL5"} {"index":{"_id":3}} {"price":30,"avaliable":true,"productID":"XHDK-A-1293-#fJ3"} {"index":{"_id":4}} {"price":10,"avaliable":false,"productID":"XHDK-A-1293-#fJ3"} GET products/_mappingCopy the code
2.3.2 For Boolean term query, score is calculated
POST /products/_search {"profile": "true", "query": {"term": {"avaliable": true}}Copy the code
2.3.3 Boolean term queries are converted to filtering through constant score without scoring
POST products/_search {"profile": "true", "explain": true, "query": { "constant_score": { "filter": { "term": { "avaliable": true } } } } }Copy the code
2.3.4 Digit Range Query
GET products/_search {"query": {"constant_score": {"filter": {" Range ": {"price": {"gte": 20, "lte": 30}}}}}}Copy the code
2.3.4 date range
GET products/_search {"query": {"constant_score": {"filter": {"range": {"date": {"gte": "now-1y" } } } } } }Copy the code
Now minus 1 year (now = now, y = year, 1y = year)
field | The field |
---|---|
y | years |
M | month |
w | weeks |
d | day |
H/h | hours |
m | minutes |
s | seconds |
2.3.5 Exists This parameter is used to query a document that does not contain a field
GET /products/_search {"query": {"constant_score": {"filter": {" Exists ": {"field": "date" } } } } }Copy the code
2.3.6 Multi-value Field Query
POST /movies/_bulk
{"index":{"_id":1}}
{"title":"Father of the Bridge Part II","year":1995,"gener":"Comedy"}
{"index":{"_id":2}}
{"title":"Dave","year":1993,"gener":["Comedy","Romance"]}
Copy the code
2.3.6.1 Handling multi-valued fields, term queries are included rather than equal
Term query {"query": {"constant_score": {"filter": {"term": {"gener.keyword": "Comedy" } } } } }Copy the code
All documents containing “Comedy” are returned, so what if we want an exact match in a multi-valued field? How do we do the solution: Add a genre_count field to count. The solution is given in the combined bool query
2.3.6.2 If the term query matches accurately with multi-value fields (here if you do not understand, you can skip first)
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }
GET /my_index/my_type/_search
{
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "tags" : "search" } },
{ "term" : { "tag_count" : 1 } }
]
}
}
}
}
}
Copy the code
2.3 summarize
- Structured data & Structured Search
- If you don’t have to score, you can pass
Constant Score
To convert the query toFiltering
- If you don’t have to score, you can pass
- Range query and
Date Math
- use
Exit
Query processing is non-nullNULL
value - Exact values & Exact lookups for multi-valued fields
Term
Queries are inclusive, not equal. Be especially careful with multi-value field queries
3. Search relevance score
3.1 Correlation and correlation score
3.2 word frequency (TF)
3.3 Inverse document frequency IDF
3.4 Concept of TF-IDF
3.5 TF-IDF scoring formula in Lucene
3.6 BM25
3.7 Customized Similarity
3.8 Viewing TF-IDF through the Explain API
3.9 Boosting Relevance
3.10 summarize
- What is correlation & correlation score introduction
- TF-IDF/BM25
- Customize the relatedness algorithm parameters in Elasticsearch
- ES can set Boosting parameter for index and field respectively
4. Query
&Filtering
With multi-string multi-field query
4.1 Query Context
& Filter Context
We see that many systems support multiple field queries, search engines generally also provide filtering conditions based on time and price, so ES is also supported, the following is to introduce the advanced query of ES;
ES
Advanced search: supports multiple text input and searches for multiple fields- in
ES
There,Query
andFilter
Two different onesContext
(Context
Context, which will be covered later.)Query Context
Use:Query Context
Query, the search results will carry out correlation scoreFilter Context
Use:Filter Context
The results of the query will not be graded, so that caching can be used for better performance
4.2 Combination of Conditions
Suppose we now complete the following query:
- Suppose the search for movie reviews includes Guitar, with user ratings higher than 3 and release dates between 1993 and 2000.
This search contains 3 pieces of logic, each for different fields, including Guitar reviews, user ratings greater than 3, release dates in a given range, all three pieces of logic, and good performance. How do we do this?
This requires a compound Query in ES: bool Query
4.3 Boolean query
- a
bool
A query is a combination of one or more query clauses- There are four clauses in total. Two of them will affect the calculation of scoring, two do not affect the calculation of scoring;
- Relevance is not just the preserve of full-text search. Applies to yes | no clause, matching the clause, the more the higher the relevance score. If multiple query clauses are merged into a single compound query statement, for example
bool
Query, then the score calculated from each query clause is combined into the total correlation score category.
clause | describe |
---|---|
must | Must match. Contributions count |
should | Selective matching. Contributions count |
must_not | Filter Context Query clause, must not match |
filter | Filter Context Must match, but does not contribute to the score |
4.3.1 bool
The query syntax
bool
Query seed queries can appear in any order- Multiple queries can be nested
- If you have a
bool
Query species, nonemust
Conditions,should
The species must satisfy at least one query
From here it is easy to go back to section 2.3.6.2
4.3.2 bool
Nested query
- So that’s one of them
should_not
Although there is no logicshould_not
But we can do it this way.
4.3.3 bool
The structure of the query statement affects the relevance score
- Competing fields at the same level have the same weight;
- Through the nested
bool
Query, can change the impact of the score;
4.3.3.1 Control fieldBoosting
Boosting
It’s a way of controlling relevancyBoosting
Can be used in indexes, fields, or query subconditions
- parameter
boost
The meaning of- when
boost
> 1, the relativity of scoring increases; - When 0 <
boost
When < 1, the relativity of scoring weight decreases; - when
boost
< 0, contribution negative points;
- when
- Insert data
Boosting 'POST /blogs/_bulk {"index":{"_id":1}} {"title":"Apple iPad","content":"Apple iPad,Apple iPad"} {"index":{"_id":2}} {"title":"Apple iPad,Apple iPad","content":"Apple iPad"}Copy the code
- Query 1 (
title
The fieldboost
The value is relatively high)
POST blogs/_search { "query": { "bool": { "should": [ { "match": { "title": { "query": "apple, ipad", "boost": 1.1}}}, {" match ": {" content" : {" query ":" apple, apple, "" boost" : 1}}}]}}}Copy the code
Because the boost value of the title field is higher, it has a higher weight, so document 2 is shown first, because the two title values of the document contain two Apple ipads
- Query 2 (
content
The fieldboost
The value is relatively high)
POST blogs/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "apple, ipad",
"boost": 1
}
}
},
{
"match": {
"content": {
"query": "apple, ipad",
"boost": 2
}
}
}
]
}
}
}
Copy the code
4.4 summarize
Query Context
vsFilter Query
Bool Query
: multiple combination conditions (similar to SQL where followed by multiple conditions)- Query structure and correlation score
- How to control the accuracy of the query
Boosting
&Boosting Query
(The example here is not recorded, the video is available)
5. Single string multi-field query:Dis Max Query
5.1 Example of single string query
We performed a single string multi-field query on the above document, which is to match the single string Brown Fox with multiple fields. In the above example, we matched the title and body fields.
Let’s analyze the document:
- title
- It only appears in document 1
Brown
- It only appears in document 1
- body
- It appears in document 1
Brown
Brown fox
They all appear in document 2 and remain in the same order as the query, with the highest visual relevance
- It appears in document 1
# Blogs / _BULK {"index":{"_id":1}} {"title":"Quick Brown Rabbits "," Content ":"Brown Rabbits are commonly found seen"} {"index":{"_id":2}} {"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular Basis "# query POST/blogs / _search {1}" query ": {" bool" : {" should ": [{" match" : {" title ":" Brown fox "}}, {" match ": {"content": "Brown fox"}} ] } } }Copy the code
Strange, why document 1 comes first and scores higher than document 2 when we know that document 2 should be more relevant?
5.2 bool
Of the queryshould
The scoring process of the query
- The query
should
Two queries in the statement - Add and score the two queries
- Times the total number of matching statements
- Divided by the total number of statements
Analysis: Both title and content in document 1 contain the key words of our query, so the two subqueries of should will be matched. Although document 2 contains the key word of the query precisely, it only appears in content but not in title. Only one subquery of should can be matched. So document 1 is rated higher than document 2, that’s why.
5.3 Disjunction Max Query
The query
- In the case,
title
andcontent
Competition with each other- Instead of simply stacking scores, you should find a score for a single field that best matches
Disjunction Max Query
- Any documents that match any query are returned as a result. The score that best matches the field is used to return the final score
POST /blogs/_search
{
"query": {
"dis_max": {
"queries": [
{"match": {"title": "Quick fox"}},
{"match": {"content": "Quick fox"}}
]
}
}
}
Copy the code
When we use a Disjunction Max Query, because the Disjunction Max Query takes the field’s best match and returns the final score, if two documents do not match exactly, then their scores are the same. What happens in this case?
POST /blogs/_bulk
{"index":{"_id":1}}
{"title":"Quick brown rabbits","content":"Brown rabbits are commonly seen"}
{"index":{"_id":2}}
{"title":"Keppping pets healthy","content":"My quick brown fox eats rabbits on a regular basis"}
POST /blogs/_search
{
"query": {
"dis_max": {
"queries": [
{"match": {"title": "Quick pets"}},
{"match": {"content": "Quick pets"}}
]
}
}
}
Copy the code
There is no exact keyword match in either documentQuick pets
So their scores should be the same. Let’s see
5.3.1 Tie Breaker
parameter
Tie Breaker
Is a floating point number between 0 and 1. 0 indicates the best match. 1 indicates that all statements are equally important.Disjunction Max Query
Gets a score for the best matching statement_score
- Compare the scores of other matching statements with
Tie Breaker
multiply - Sum and normalize the above scores
POST /blogs/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "Quick pets"}}, {"match": {"content": "Quick pets"}}], "tie_breaker": 0.1}Copy the code
6. Single string multi-field query:Multi Match
6.1 Three Scenarios of Single-String multi-field Query
- Best field (
Best Fields
)- When fields compete with each other, they relate to each other. For example,
title
andbody
Such a field (mentioned in the previous section). The score comes from the best match field
- When fields compete with each other, they relate to each other. For example,
- Most fields (
Most Fields
)- When dealing with English content: A common approach is to use the main field (
Engish Analyzer
), extract stems and add synonyms to match more documents. Same text, add subfield (Standard Analyzer
) to provide a more accurate match. Other fields serve as a signal to match documents for increased relevance. The more fields that match, the better
- When dealing with English content: A common approach is to use the main field (
- Mixed field (
Cross Field
)- For certain entities, such as names, addresses, book information. Information needs to be determined in multiple fields, and a single field can only be part of the whole. Expect to find as many words as possible in any of these listed fields
6.2 Multi Match Query
Syntax format
Best Fields
Is the default type and may not be specifiedMinimum should match
Isoparameters can be passed to the generatedquery
In the
6.3 Multi Match
In themost field
case
6.3.1 Defining indexes and Inserting Data
DELETE title
PUT /titles
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "english"
}
}
}
}
POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code
6.3.2 Using commonmatch
The query
GET titles/_search
{
"query": {
"match": {
"title": "barking dogs"
}
}
}
Copy the code
We analyzed the document content and clearly found that the second document was more relevant, but using a normal match query, we found that the first document came first. Why? Because we use English word segmentation when setting up the mapping and the first document is short, the first document comes first, so we need to do some optimization for this situation.
6.3.3 RedefineMapping
And insert data
DELETE titles
PUT /titles
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "english",
"fields": {
"std": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
}
POST titles/_bulk
{"index":{"_id":1}}
{"title":"My dog barks"}
{"index":{"_id":2}}
{"title":"I see a lot of barking dogs on the road"}
Copy the code
- Analyze our
mapping
define - Subfields are added
std
, and the subfield type istext
, and usestandard
Word segmentation is - using
english
Word splitters are used to divide words according to English grammarstandard
Word segmentation, not for English grammar word segmentation, so as to ensure the accuracy of data
6.3.4 usingMulti Query
The query
GET /titles/_search
{
"query": {
"multi_match": {
"query": "barking dogs",
"type": "most_fields",
"fields": ["title","title.std"]
}
}
}
Copy the code
6.3.5 Multi Query
The field weights
- Matches fields with breadth
title
Include as many documents as possible — to improve recall — while using fieldstitle.std
As a signal, place the more relevant documents at the top of the results - The contribution of each field to the final score can be customized
boost
To control, for example, to maketitle
Fields are more important, which also reduces the role of other signal fields
GET /titles/_search
{
"query": {
"multi_match": {
"query": "barking dogs",
"type": "most_fields",
"fields": ["title^10","title.std"]
}
}
}
Copy the code
6.4 Multi Match
In thecross field
(Cross-field search) cases
- When we want to query in multiple fields, we might want to use
most fields
To implement the - That’s right,
most fields
It can satisfy our requirements to some extent, but it cannot satisfy some special cases, such as: the data we want to query appears in all fields at the same time,most fields
It can’t be satisfied that we use"operator":"and"
Also can not meet (here I also some silly silly points not clear, look at the following example, this to understand the words need to have a specific scene analysis, you can baidu), we can usecopy_to
(mentioned earlier), but requires extra storage space;
At this point we can use cross_fields
6.4.1 Inserting Data
{"street": "5 Poland Street", "city" : "London", "country": "United Kingdom", "postcode": "W1V 3DG" }Copy the code
6.4.2 usemost_fields
To query
POST address/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "most_fields",
"fields": ["street","city","country","postcode"]
}
}
}
Copy the code
It can meet our needs
If we want all fields to show the result of the query, we can use “operator”: “and” plus most_fields
POST address/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "most_fields",
"operator": "and",
"fields": ["street","city","country","postcode"]
}
}
}
Copy the code
However, if we want to expect all the words in the query text to appear in the document and don’t mind which fields in the document, we can use corss_fields+and
POST address/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "cross_fields",
"operator": "and",
"fields": ["street","city","country","postcode"]
}
}
}
Copy the code
6.5 Distinguish field-centered query from entry – centered query
www.cnblogs.com/jiangtao121…
best_fields
- Suitable for multi-field query and query the same text;
- Score Takes the highest score for one of the fields.
- through
tie_breaker
(0 ~ 1) Adds the score of the low-scoring field to the final score. best_fields
But withdis_max
Query interchange. ES internally converts todis_max
The queryoperator
(Use with caution in this query)minimum_should_match
Within the subquery of each field.
For example :" query":" Complete Conan Doyle ""field":["title","author","characters"] "type":"best_fields" "operator":"and" is equivalent to: (+title:complete +title:conan +title:doyle) | (+autorh:complete +author:conan +autore:doyle) | (+characters:complete +characters:conan +characters:doyle)Copy the code
corss_fields
- This applies when you expect all the words in the query text to appear in the document, regardless of which fields in the document they appear in.
operator
Acts on the join between subqueries- Application scenario: Information is divided into different fields, such as address, last name, and first name. Most of the time
opertaotr
useand
The above query is equivalent to: +(title:complete author:complete charactors:complete) +(title:conan author:conan charators:conan) +(title:doyle author:doyle charactor:doyle)Copy the code
- most_fields
- It is useful for retrieving documents that contain the same text in multiple places but are handled differently by the underlying analysis.
- Most of the time
operator
useor
.ES
Internal conversion tobool
The query - Application scenario: Multi-language processing
7. Search Template
andIndex Alias
The query
7.1 Search Template
: Decouple programs and searchesDSL
- Parameterize the query so that everyone can do their job, you write your business logic and I optimize my DSL
7.2 Index Alias
Achieve zero downtime operation and maintenance
- We can create aliases for indexes
- We create new indexes every day, but when reading or writing, we want them to read from an Index so that we can use aliases