Document Data format

Document-oriented search and analysis engine

Elasticsearch uses JavaScript Object Notation (or JSON) as the serialization format for documents. JSON serialization is supported by most programming languages and has become the standard format in the NoSQL world. It’s simple, concise and easy to read.

Simple cluster management

Es provides a set of apis, called the CAT API, to view a wide variety of data in ES

GET /_cat/health? v

epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time Active_shards_percent 1606447321 03:22:01 ElasticSearch YELLOW 1 1 7 7 0 01 0Copy the code

Green: The primary shard and replica shard in each index are both yellow in the active state: The primary shard of each index is in the active state, but some replica shards are not in the active state and are unavailable (red) : Not all primary shards of the index are in active state. Some indexes are in yellow state. We now have a laptop and we have started an ES process, which is equivalent to just one Node. There is now an index in ES, which is built into Kibana’s own index. Since the default configuration is to allocate 5 primary shards and 5 Replica shards to each index, the primary shards and replica shards cannot be on the same machine (for fault tolerance). Now kibana’s own indexes are 1 primary shard and 1 Replica shard. Currently there is only one node, so only one primary shard is allocated and started, but one Replica shard has no second machine to start. At this point, as long as the second ES process is started, there will be two nodes in the ES cluster, and then the replica shard will be automatically allocated, and the cluster status will become green.

The cat command

/_cat/allocation /_cat/shards/ _cat/shards/{index} /_cat/master /_cat/indices/{index} # View the information about the specified index in the cluster /_cat/segments # query the segment details of each index, including the segment name, shard, and memory usage. /_cat/segments/{index}# Check the segment details of the specified index /_cat/count # Check the doc number of the current cluster /_cat/count/{index} # Check the DOC number of the specified index /_cat/recovery # Check the recovery process of each shard in the cluster. Adjust up. /_cat/recovery/{index} /_cat/health Red, yellow, and green /_cat/pending_tasks # Check the pending task of the current cluster /_cat/aliases # Check the alias information of the specified index /_cat/thread_pool /_cat/fielddata/{fields} # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # output information about the existing templateCopy the code

Simple CRUD operations

Create index: PUT /test_index? Index: get_cat /indices? V DELETE index: DELETE /test_index? Pretty V is with title Pretty is beautification outputCopy the code

Es will automatically create index and type, no need to create in advance, and BY default, ES will set up inverted index for each field of document, so that it can be searched

Add data: PUT /phone/_doc/1 {"brand":"xiaomi", "price":1999, "title":" xiaomi", "Tag" : "5 g dual mode", "865 Xiao dragon", "120 hz refresh rate", "120 times telephoto lens", "120 w quick charge", "NFC"] {} PUT/phone / _doc / 2 "brand" : "vivo", "price" : 1099, "Title" : "vivo Y30", "tag" : [" wisdom beauty ", "5 g"]} PUT/phone / _doc / 3 {" brand ":" huawei ", "price" : 2999, "Title ":" HUAWEI P40 Pro", "tag":[" Kylin 990"," 50x digital zoom "]} PUT /phone/_doc/4 {"brand":" Apple" "Price ":3999, "title":"Apple iPhone 11", "tag":["4G"," dual sim/dual standby "]} PUT /phone/_doc/2 {"brand":"vivo", "price":1099, "title":"vivo Y30", "tag":[" smart beauty ","5G"]} Modify the document: POST /phone/_doc/2/_update {"price":"1199",} DELETE the document: DELETE /phone/_doc/2Copy the code

Elasticsearch Search syntax

Timeout: (1) Setting: There is no timeout by default. If timeout is set, the timeout mechanism will be executed. (2)Timeout mechanism: Suppose that the query result has 1W pieces of data, but it takes 10 “to complete the query, but the user sets a Timeout of 1”, no matter how much data has been queried, ES will stop the query after 1 “and return the current data. (3) use: GET /_search? timeout=1s/ms/m

1.query string search

Query String Search is derived because the search parameter is attached to the QUERY string of the HTTP request. Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl In production environments, query String search is rarely used to query all: GET /phone/_search with parameters: GET /phone/_search? Q =brand: Xiaomi page: GET /phone/_search? from=0&size=2&sort=price:asc

2. Dsl-domain specific Language

HTTP request body HTTP request body HTTP request body HTTP request body HTTP request body HTTP request body HTTP request body HTTP request body HTTP request body

GET /phone/_search {"query": {"match_all": {}}} GET /phone/_search {"query": {"match": {"brand": "Huawei"}}} / / sort by price GET/phone / _search {" query ": {" match_all" : {}}, "sort" : [{" price ": {" order" : GET /phone/_search {"query": {"multi_match": {"query": "apple", "fields": {"query": "apple", "fields": [" brand ", "title"]}}} / / _source query specified field GET/phone / _search {" query ": {" match_all" : {}}, "_source" : [" title ", "price"]} / / paging GET/phone / _search {" query ": {" match_all" : {}}, "from" : 2, the "size" : 4} / / filter filter GET/phone / _search {" query ": {" bool" : {" must ": [{" match" : {" brand ":" apple "}}], "filter" : [ {"range": { "price": { "gte": 3000, "lte": 4000 } }} ] } } }Copy the code

3. Full-text search

GET /phone/_search
{
  "query": {
    "match": {
      "title": "11"
    }
  }
}

GET /phone/_search
{
  "query": {
    "match": {
      "tag": "5G"
    }
  }
}
Copy the code

4. Phrase search

With full text search, on the contrary, the full text retrieval will enter the search string apart to inverted index structure to match one by one, as long as you can match on any one apart after the words, you can return to phrase as a result the search, request input search string, must be in the specified field in the text, totally contain exactly the same, In order to count as a match, in order to return as a result

GET /phone/_search
{
  "query": {
    "match": {
      "title": "apple 11"
    }
  }
}
Copy the code

5. Highlight search

GET /phone/_search { "query": { "match": { "title": "apple" } }, "highlight": { "fields": {"title": 2. To leap lightly or be deflected. 3. To leap lightly or be deflected. {}}}} skipped: {"took" : 125, "timed_out" : false, "_shards" : {"total" : 1, "successful" : 1, "skipped" : 2. 0, "failed" : "hits" : {0}, "total" : {" value ": 1, the" base ":" eq "}, "max_score" : 1.3940738, "hits" : [{" _index ":" phone ", "_type" : "_doc", "_id" : "4", "_score" : 1.3940738, "_source" : {" brand ":" apple ", "price" : 3999, "title" : "Apple iPhone 11", "tag" : ["4G", "double sim"]}, "highlight" : {"title" : [ "<em>Apple</em> iPhone 11" ] } } ] } }Copy the code

special

The difference between match and term queries

match

  • The query word of match will be split
  • Match_phrase doesn’t split words
  • Match_phrase matches multiple fields

term

  • Term stands for perfect match and no word segmentation analysis is performed
  • The term query field must be defined during the mapping; otherwise, the term may be segmented. The specified string is passed, but no data is found

Bool joint query

Query and filter: queries and filters

  • Bool: Multiple query conditions can be combined. The bool query also uses more_matches_IS_better mechanism, so documents that satisfy the must and should clauses will be combined to calculate the score.

The clause (query)must appear in the matching document and will contribute to the score. 2) Filter: The filter does not calculate the correlation score. The cache☆ clause (query) must appear in the matching document. But scores that are not queries like MUST are ignored. The Filter clause is executed in the Filter context, which means that scoring is ignored and the clause is considered for caching. The or clause (query)should appear in the matching document. 4) MUST_NOT: must not satisfy do not calculate relevance score not clause (query)must not appear in matching documents. The clause is executed in the context of the filter, which means that scoring is ignored and the clause is treated as being used for caching. Because scoring is ignored, 0 therefore returns scores for all documents. 5) The minimum_should_match parameter specifies the number or percentage of clauses that the document should return must match. If the bool query contains at least one should clause and no must or filter clause, the default value is 1. Otherwise, the default value is 0

Deep paging

When searching too deeply, it is necessary to save a large amount of data on Node and sort a large amount of data before taking out the corresponding page. Therefore, this process consumes not only network broadband, memory, but also CPU. This is the performance problem of deep Paging and should be avoided as much as possible.

Elasticsearch returns a scroll_id after each query. Perform the next page query based on this scroll_id. You can think of this scroll_id as a cursor in a normal relational database. However, the disadvantage of this scroll mode is that repeated queries cannot be performed, that is to say, only the next page can be carried out, not the previous page.

GET /phone/_search? GET /phone/_search? GET /phone/_search? Scroll =1m {"query": {"match_all": {}}, "size": 2} POST /_search/scroll {"scroll": "1m", "scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFnZVX0YyNURsU3Z5eVNQWmY2cWc5VFEAAAAAAAAfwRZYNjJ6X0I0c1J2Q21 } {scroll_id}} {scroll_id}Copy the code
  • The principle of

– initialization time will be all in line with the cached search criteria of search results, you can imagine a snapshot – in traversed, fetch the data from this snapshot — that is, after the initialization for index to insert, delete, update, data will not affect traversal results The cursor can increase performance reasons, because if do deep paging, Every search has to be reordered, which is very wasteful, so to use Scroll is to row all the data at once and extract it in batches

The filter principle

Aggregate analysis

Elasticsearch has a feature called Aggregations that allows us to generate some fine-grained analysis results based on the data. Aggregation is similar to but more powerful than GROUP BY in SQL.

Count the number of items under each tag

// Set the fieldData attribute to true. PUT /phone {"mappings": {"properties": {"tag":{"type": "text", "fielddata": Statement true}}}} / / query polymerization GET/phone / _search {" aggs ": {" all_tag" : {" terms ": {" field" : "tag", "size" : 100000}}}}Copy the code

For products whose name contains Xiaomi, calculate the number of products under each tag

GET /phone/_search
{
  "query": {
    "match": {
      "brand": "xiaomi"
    }
  }, 
  "aggs": {
    "all_tag": {
      "terms": {
        "field": "tag",
        "size": 100000
      }
    }
  }
}
Copy the code

First group, then calculate the average of each group, calculate the average price of goods under each tag

GET /phone/_search
{
  "aggs": {
    "all_tag": {
      "terms": {
        "field": "tag",
        "size": 100000
      },
      "aggs": {
        "all_avg": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
Copy the code

Calculate the average price of the items under each tag and sort them in descending order by average price

GET /phone/_search
{
  "aggs": {
    "all_tag": {
      "terms": {
        "field": "tag",
        "order": {
          "all_avg": "desc"
        }, 
        "size": 100000
      },
      "aggs": {
        "all_avg": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
Copy the code

ElasticSearch complete directory

Elasticsearch is the basic application of Elasticsearch.Elasticsearch Mapping is the basic application of Elasticsearch.Elasticsearch is the basic application of Elasticsearch Elasticsearch tF-IDF algorithm and advanced search 8.Elasticsearch ELK