I. Basic theoretical knowledge

  1. introduce

    ElasticSearch is a near real-time search platform. This means that there may be a slight delay (usually in seconds) when you index a document, and the company’s LogCenter latency is in the Kafka2ES layer, not the index layer.

    ElasticSearch is a distributed, scalable real-time search and analysis engine built on top of the full text search engine Apache Lucene(TM). ElasticSearch is more than Just Lucene. It includes full text search and can do the following:

    • A distributed real-time document store where every field can be indexed and searched

    • A distributed real-time analysis search engine

    • Capable of extending hundreds of service nodes and supporting PB level of structured or unstructured data

    ElasticSearch

  2. The basic concept

    ElasticSearch is a document-oriented database where a piece of data is a document and uses JSON as the document serialization format. ElasticSearch and relational database terms are listed below:

    MySQL

    ElasticSearch

    Database (Database)

    Corresponding terms

    meaning

    Index

    A collection of documents similar to the concept of a database in mysql

    The Table (Table).

    Type (type)

    Different types can be defined in Index. The concept of type is similar to the concept of table in mysql, which is a combination of a series of data with the same characteristics.

    The Row (line)

    Document

    The concept of a document is similar to that of a stored record in mysql. It is in JSON format and can have many documents under different types of Index.

    The Column (Column)

    Fields

    Index = Index

    Everything Indexed by Default

    SQL (Structured Query Language)

    Query DSL (Query Specific Language)

    Shards

    When there is a large amount of data, horizontal scaling is performed to improve search performance.

    Sharding only holds a fraction of all the data in the index.

    Our documents are stored in the shard and indexed in the shard, but our application does not communicate directly with them. Instead, it communicates directly with the index.

    Documents are stored in shards, which are then distributed to nodes in your cluster. When your cluster expands or shrinks, Elasticsearch will automatically migrate shards between your nodes to keep the cluster in balance.

    Replicas

    To prevent the loss of data in a fragment, it can be searched in the backup data in parallel to improve performance.

  3. advantage

    advantage

    meaning

    Horizontal scalability

    You can add a server to a cluster. Ensure that the name of the cluster is consistent.

    Fragmentation mechanism

    Can provide better distribution, divide-and-conquer approach to improve processing efficiency.

    High availability

    Provide replica mechanism.

    The real time

    Speed up queries by putting files on disk into a file caching system.

    Horizontal scaling versus vertical scaling: For most databases, horizontal scaling means that your program will have to change a lot to take advantage of these new devices. By contrast, Elasticsearch is naturally distributed and knows how to manage nodes to provide high scalability and availability. This means that your program doesn’t need to care. As long as you follow certain rules when creating clusters, nodes, and shards, you can scale according to your needs and keep your data safe in the event of hardware failures.

    To illustrate what sharding and copy sharding mean, and how to extend it, here’s an example:



    (1) Start an empty node with no index or document data.

    (2) We added an index named log with three master shards. ES defaults to 5 master shards.

    (3) Single node operation is easy to appear a single point of failure — data loss, then we expand the second node, set the same cluster.name can join the same cluster, set the log index each master fragment has a replication fragment, then any node failure can still support the client’s query requirements.

    (4) When we continue to expand and expand the third node, the ES cluster will reorganize itself and the fragments will be redistributed to balance the load. At this time, there are only two fragments per node, one less than before, which means that the fragments of each node have more resources, such as CPU, I/O, etc.

    (5) If a node has only one shard, that shard can have all the resources of the current node. What if we need to scale beyond 6 nodes? The number of primary shards is already determined when the index is created, so we can increase the number of replicated shards by setting the number of replicated shards per shard to two.

    (6) ES can deal with a single point of failure. When we kill a node process that is the master node, ES will instantly elect a new master node that can support query and access of all data.

  4. The index

    Relational database B-/B+Tree, ES uses inverted index, all the design of ES index is to improve the performance of search.

    B/B + Tree index:

    The index structure is designed to optimize the writing. The binary search efficiency is logN. At the same time, it is unnecessary to move all nodes when inserting new nodes.

    Inverted index:

    Inverted index structure:



    Here’s an example:

    ID

    Name

    Age

    Sex

    1

    Kate

    24

    female

    2

    John

    24

    male

    3

    Bill

    29

    male

    Then the index established by ES is as follows:

    Name:

    Term

    Posting List

    Kate

    [1]

    John

    [2]

    Bill

    [3]

    Age:

    Term

    Posting List

    24

    [1, 2]

    29

    [3]

    Sex:

    Term

    Posting List

    male

    [2, 3]

    female

    [1]

    As shown above, Elasticsearch creates an index for each field. Kate, John, 24, and Female are called terms, while 1,2 are Posting lists. Posting List is an array of ints that store all document ids that conform to a certain term. Some optimization and compression techniques are used in each part of ES. For example, Term Index is a dictionary-like Index page, which can be understood as a tree. Term Dictionary stores blocks. After finding the corresponding node from the Term Index, search for the corresponding block in the Term Dictionary, so that the disk can find the Term.



    In order to store Term Index, Term Dictionary, and Posting List in memory, Term Index uses FST compression. Posting List uses FST compression. Delta encoding compression, the large number into decimal, according to the way of byte storage compression.

    Compression of Posting List:



Second, the query

How you interact with ES depends on whether you use JAVA or not. The following uses JAVA as an example to illustrate the ES query.

JAVA API

If you are using Java, you can use two of ElasticSearch’s built-in clients in your code: the Node Client and the Transport Client. The Java client defaults to port 9300 and uses ES’s native transport protocol to interact with the cluster.

RESTful API with JSON over HTTP

Introduction to the

All other languages can use RESTful apis through ports

9200

curl -X<VERB> '
       
        ://
        
         :
         
          /
          
           ? 
           
            '
           
          
         
        
        -d '<BODY>'Copy the code

The part marked by < >

meaning

VERB

The appropriate HTTP

methods

The predicate

: GET, POST, PUT, HEAD, or DELETE.

PROTOCOL

HTTP or HTTPS (if you have an ‘HTTPS proxy’ in front of ElasticSearch)

HOST

The host name of any node in the ElasticSearch cluster, or localhost for the node on the local machine.

PORT

Port number for running ElasticSearch HTTP. The default port number is 9200.

PATH

The terminal path of the API (for example, _count will return the number of documents in the cluster). Path may contain multiple components, such as _cluster/stats and _nodes/stats/ JVM.

QUERY_STRING

Optional query string arguments (for example pretty will format the JSON return value to make it easier to read)

BODY

A jSON-formatted request body (if required)

Give me two chestnuts:

(1) Calculate the number of documents in the cluster, we can use this:

curl -XGET 'http://localhost:9200/_count? pretty' -d ' { "query": { "match_all": {} } }Copy the code

ES will return an HTTP status code (for example: 200 OK ‘) and (except for the ‘HEAD’ request) a RETURN value in JSON format. The JSON body is as follows:

{
    "count": 0."_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed": 0}}Copy the code

(2) Take the query of xinmei University log Center as an example:

curl -XPOST 'http://es.data.sankuai.com/log.mapi-log-service.userbehaviours_all/_search?pretty' -d'{"query": {//TODO: query statement}, "size":"100"}'Copy the code

retrieve

There are two methods: sending HTTP GET requests for retrieval and using ES query expressions (DSL) for retrieval.

The first method is to construct an HTTP GET request. The following is an abbreviated format, which omits all the same parts of the request, such as the host name, port number, and curl command itself.

Reference: www.elastic.co/guide/cn/el…

www.elastic.co/guide/cn/el…

curl -XGET /megacorp/employee/1Copy the code

The return information contains some basic information as well as JSON metadata for the search:

{
  "_index" :   "megacorp"."_type" :    "employee"."_id" :      "1"."_version" : 1,
  "found" :    true."_source" :  {
      "first_name" :  "John"."last_name" :   "Smith"."age": 25."about" :       "I love to go rock climbing"."interests":  [ "sports"."music"]}}Copy the code

The second approach is to use ES query expression (DSL) retrieval: request parameters are constructed as a JSON request in the ES convention format.

Reference: www.elastic.co/guide/cn/el…

Common queries:

A query

meaning

Method of use

note

The term filter

Term is primarily used to match exactly what values, such as numbers, dates, Boolean values, or not_analyzed strings.

Not_analyzed string: Uncut text data type.

{ 
  "query": { 
    "term": { 
      "title": Inner Mongolia}}}Copy the code

Terms filter

Terms is similar to term, but terms allows you to specify multiple matching conditions. If a field specifies more than one value, document needs to match together.

{
  "query": {
    "terms": {
      "title": [
        Inner Mongolia."Heilongjiang"]}}}Copy the code

range

Range filtering allows us to find a batch of data in a specified range.

{
"query": {"range": {
    "pubTime": {
      "gt": "2017-06-25"."lt": "2017-07-01"}}}}Copy the code

Range operators include:

The keyword

meaning

gt

Is greater than

gte

Greater than or equal to

lt

Less than

lte

Less than or equal to

The exists and missing

The exists and MISSING filters can be used to find whether a document contains a specified field or does not contain a field, similar to the IS_NULL condition in an SQL statement.

These two filters are only used when a batch of data has been detected, but you want to distinguish whether a field exists or not.

{
    "exists": {"field":"title"}}Copy the code

Bool filter

Bool filter Boolean logic that can be used to merge the results of multiple filter criteria. It contains the following operators: bool filter

Must :: An exact match of multiple query conditions, equivalent to and.

Must_not :: Reverse match of multiple query conditions, equivalent to NOT.

Should :: At least one query condition matches, which is equivalent to or.

These parameters can inherit a filter criterion or an array of filter criteria, respectively

{
    "bool": {"must": {"term": {"folder":"inbox"}},"must_not": {"term": {"tag":"spam"}},"should":[
            {
                "term": {"starred":true}}, {"term": {"unread":true}}]}}Copy the code

Three, reference

www.elastic.co/guide/en/el…

www.elastic.co/guide/cn/el…

Juejin. Cn/post / 684490…