ElasticSearch literacy for ElasticSearch

This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.

Hopefully by the end of this article you will have a basic and comprehensive understanding of ElasticSearch.

(Note that this article mainly covers the content after ES 7.)

1 What is ElasticSearch?

A restful API standard test extension and high availability of real-time data analysis of the full text search tool

2 ES Concept, architecture, and principles

2.1 Basic Concepts

Node: a single server with ES installed.
Cluster: An ES Cluster consists of one or more nodes (each of which is a peer, decentralized relationship).
Document: A Document is a basic unit of information that can be indexed
Index: An Index is a collection of documents with similar characteristics
Type: In an index, you can define one or more types (removed from 7.x and later)
Filed (field) : Filed is the smallest unit of ES, which is equivalent to a column of data
Shards: ES divides an index into multiple pieces, each of which is a Shard
Replicas: Replicas indexes one or more copies

2.2 Comparison of ES and relational database concepts

7. Relationship comparison before version X

7. Relationship comparison after x version (excluding type)

2.3 ES architecture

Gateway represents the persistent storage mode for ElasticSearch indexes. By default, ElasticSearch stores indexes in memory in the Gateway and persists them to the Gateway when memory is full. When the ES cluster shuts down or restarts, it reads index data from the Gateway. For example, LocalFileSystem, HDFS, and AS3.
DistributedLucene Directory, which is a Directory of index files in Lucene. It is responsible for managing these index files. It includes data reading, writing, index adding and merging, etc.
River stands for data source. Exists as a plug-in in ElasticSearch.
Mapping, which means Mapping, is very similar to data types in static languages. For example, if we declare a variable of type int, then the variable can only store data of type int.
Search Moudle, Search module
Index Moudle, Index module
Disvcovery is mainly responsible for cluster master node discovery. For example, when a node suddenly leaves or comes in, a shard is sharded and re-sharded.
Scripting, a Scripting language. There are a lot of them that I won’t go into here. Such as Mvel, JS, Python, etc.
Transport, representing the ElasticSearch internal node, represents the client that interacts with the cluster. These protocols include Thrift, Memcached, and Http
RESTful Style API, through RESTful way to achieve API programming.
3rd Plugins, representing third-party plug-ins.
Java(Netty), is the development framework.
JMX is for monitoring.

2.4 Primary Shards and Replicas

If an index is set to 3 shards and 1 replica, it might be stored in a cluster of 3 data nodes like the one below.

2.5 ES Data Writing Principle

2.6 Node Types

The node type is determined by the Node. roles configuration item in the configuration file. A cluster must have at least master and date nodes.

As the cluster grows, especially when there are a lot of machine learning tasks or continuous transformations, it makes sense to consider separating dedicated master nodes from dedicated data, machine learning, and transformation nodes.

However, most node types are not generally used.

A, Master eligible node 【 candidate primary node 】

The master node is responsible for lightweight cluster-wide operations, such as creating or deleting indexes, tracking which nodes are part of the cluster, and deciding which shards to assign to which nodes. Cluster health requires a stable primary node. The primary node is generated from the candidate host point

Any node that has the role of candidate primary node can be selected as the primary node during the election process.

Roles: [master] or Node.roles: [master,data]Copy the code

B. Voting-only master-eligible nodes

This candidate primary node will only vote in the election phase, but will not be voted as the primary node (only nodes with the master role can be marked as having the voting_only role).

Roles: [data, master, voting_only] or Node.roles: [master, voting_only]Copy the code

C, Data node

Data nodes hold shards of documents that have been indexed. Handles data-related operations such as CRUD, search, and aggregation. These operations are I/O, memory, and CPU intensive, so it is important to monitor these resources and add more data nodes for horizontal expansion if they become overloaded.

Roles: [data]Copy the code

D. Content data node

The content data node holds user-created content. They support operations such as CRUD, search, and aggregation.

Roles: [data_content]Copy the code

E. Hot Data node

Hot data nodes store time series data. The hot layer requires high read/write speed and occupies a large number of hardware resources, such as SSDS.

Roles: [data_hot]Copy the code

F, Warm data node

Warm data nodes store indexes that are not updated regularly but are still being queried. The frequency of the query volume is usually lower than when the index is in the hot layer. Lower-performing hardware is typically available for nodes in this layer.

Roles: [data_warm]Copy the code

G, Cold data node

Cold data nodes store read-only indexes, which are rarely accessed. This layer uses poorly performing hardware and can minimize the resources required with searchable snapshot indexes.

Roles: [data_cold]Copy the code

H. Frozen Data node

The freeze layer stores only partial installed exponents. You are advised to use dedicated nodes.

Roles: [data_frozen]Copy the code

I, Ingest node

An ingestion node can execute a preprocessing pipeline consisting of one or more ingestion processors. Depending on the type of operation the ingestion processor performs and the resources required, it may make sense to use dedicated ingestion nodes that will only perform this particular task.

Roles: [ingest]Copy the code

J, Coordinating only node

If a node does not handle primary tasks, does not save data, does not pre-process documents, and does not have any role, then the node can be said to be a coordination node that can only route requests, handle the search reduce phase, and distribute bulk indexes. Essentially, it’s like an intelligent load balancer.

Coordinating nodes can benefit large clusters. They join the cluster and monitor the full cluster state, just like every other node, using the cluster state to route requests directly to the appropriate node.

Roles: []Copy the code

K, remote-eligible nodes

Eligible remote nodes act as cross-cluster clients and connect to the remote cluster. Once connected, you can use cross-cluster search to search for remote clusters. You can also use cross-cluster replication to synchronize data between clusters.

Roles: [remote_cluster_client]Copy the code

L, Machine learning node

Machine learning nodes run jobs and process machine learning API requests.

Roles: [ML, remote_cluster_client]Copy the code

M, Transform node

The transformation node runs the transformation and processes the transformation API requests

Roles: [transform, remote_cluster_client]Copy the code

3. ES installation and construction

3.1. Install JDK

A note of caution: elasticsearch7 and later ships with the JDK. There is no need to install the JDK separately

Download the JDK at jdk.java.net/19/

The ES version corresponds to the JDK version. If the ES version does not match the JDK version, the installation will be abnormal.

For the corresponding relationship, please refer to the official document:

The installation process

Unzip the JDK package: Gz Then configure the environment variable vim /etc/profile. Add export JAVA_HOME= [Your Java path] export JAVA_BIN= [your Java path] /bin export PATH=$PATH:$JAVA_HOME/bin export CLASSPATH=.:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar export JAVA_HOME JAVA_BIN PATH CLASSPATH Java -version Check whether the installation is OKCopy the code

3.2. Install ES

Note: You cannot use root account to run ES, so create an account for ES and download the version you need from the official website

Unzip ES package:

tar -zxvf elasticsearch-xxx-linux-x86_64.tar.gz

A brief introduction to the ES directories:

Go to the config directory and modify the elasticSearch. yml file

All related authentication is closed here. How to configure authentication to be sent separately

cluster.name: my-es
node.name: es-node-1
node.roles: [ data, master ]
path.data: /data/es/es_data
path.logs: /data/es/es_log
network.host: x.x.x.97
http.port: 9200
transport.port: 9300
ingest.geoip.downloader.enabled: false

discovery.seed_hosts: ["x.x.x.97:9300","x.x.x.162:9300","x.x.x.165:9300"]
cluster.initial_master_nodes: ["es-node-1","es-node-2","es-node-3"]

xpack.security.enabled: false

xpack.security.enrollment.enabled: false

xpack.security.http.ssl:
  enabled: false
  keystore.path: certs/http.p12

xpack.security.transport.ssl:
  enabled: false
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
Copy the code

Then go to the decompressed directory, go to the bin directory, and run the background startup command:

./elasticsearch -d

You can see the progress of ES

Then access port 9200 of the server to see the relevant information, and the installation is successful

With three installed in the same way, you can form a cluster

3.3 ES Configuration

Configuration File Introduction

Elasticsearch. yml: ES configuration
Jvm. options: ES Specifies the JVM to use
Log4j2. properties: ES log configuration

This section describes how to configure ElasticSearch. yml

The configuration item format can be written in two ways:

Path: data: /var/lib/elasticSearch logs: /var/log/elasticSearch data: /var/lib/elasticSearch logs: /var/log/elasticSearch /var/log/elasticsearchCopy the code

Common important configuration items are as follows (this is version 8.1) :

There will be more details on the official website

4, index,

4.1. What is an inverted index

Also known as reverse indexing, it is an index method used to store a mapping of a word’s location in a document or group of documents under full-text search. It is the most commonly used data structure in document retrieval systems.

General index creation method:

[Document –> keywords] mapping process (forward indexing, disadvantages: low efficiency, need to go through the document)

The following figure

Reverse index:

[Key words] document mapping, reconstruct the result of forward index into reverse index (reverse index)

As shown in the figure below, there are five documents, and the word segmentation of the document is carried out first

Number both words and documents

Form the matrix according to the word set and document set:

Horizontal :(jobs, (document1, document2, document3)); (Rebus, (Document1, Document2, Document3, document4))

Vertical :(document 1, (jobs, rebus)); Document 5 (Yellow Chapter, Apple)

This is then converted into an inverted list, as shown below, for example: “Jobs” in document 1 at 3 and 11, document 2 at 7, and document 3 at 9

You can also add more information. For example, the frequency of words, such as “Jobs” in position 3 and 11 of document 1, is 2 in document 1

Do you want to do something with curl?

A. Initialize index Settings

For tt_index, the number of initialization fragments is 5 and the number of copies is 1

PUT /tt_index/
{
  "settings":{
      "number_of_shards":5,
      "number_of_replicas":1
  }
}
Copy the code

Note here:

Number_of_shards can only be specified during index creation and cannot be changed later

The number_of_replicas can be changed later

B. View index Settings

View Settings for a single index

GET /test_index/_settings
Copy the code

View Settings for multiple indexes

GET /tt_index,tt_index2/_settings
Copy the code

View the Settings for all indexes

GET /_all/_settings
Copy the code

C. Add data to index

Do not specify ID, random ID

POST /tt_index2/_doc/
{
  "name":"dragon",
  "age":"18",
  "like":{
    "food":"apale",
    "color":"white"
  }
}
Copy the code

If ID exists, it will be overwritten

PUT /tt_index2/_doc/2
{
  "name":"dragon",
  "age":"18",
  "like":{
    "food":"apale",
    "color":"white"
  }
}

Copy the code

If ID exists, an error message is displayed

PUT /tt_index2/_create/2
{
  "name":"dragon",
  "age":"18",
  "like":{
    "food":"apale",
    "color":"white"
  }
}

Copy the code

D. Query index data

Select * from index;

GET /tt_index2/_search
Copy the code

Query how much data there is in an index

GET /tt_index2/_count
Copy the code

Query the ID of an index

GET /tt_index2/_doc/2
Copy the code

Filter the query index data by conditions

POST /tt_index/_search
{
    "query": {
        "range": {
            "age": {
                "lt": 19
            }
        }
    }
}
Copy the code

View the distribution of index data

GET /_cat/shards/tt_index
Copy the code

E. Update index data

PUT /tt_index2/_doc/2

{
  "name":"dragon23333",
  "age":"18",
  "like":{
    "food":"apale",
    "color":"white"
  }
}
Copy the code

F. Delete index data

Deletes the data with the specified ID

DELETE /tt_index2/_doc/2
Copy the code

Delete index data that meets the conditions


POST /tt_index/_doc/_delete_by_query
{
    "query": {
        "range": {
            "age": {
                "gt": 19
            }
        }
    }
}
Copy the code

G, delete index

DELETE /tt_index2
Copy the code

5, the Mapping

5.1. What is Mapping

Mapping is the process of defining how a document and the fields it contains are stored and indexed.

5.2 Types of Mapping

A, Dynamic mapping

ES is really smart, you can create an index, do not set the Mapping, when you insert data will automatically parse the Mapping

After the index is directly created, the mapping is empty

After inserting the data, the win Mapping is automatically resolved

** It should be noted that the field type cannot be changed once mapped. **

The ES dynamic resolution rules are as follows

B, Explicit mapping

Explicit maps allow you to choose exactly how to define them, rather than being created automatically by ES parsing.

Which string fields should be treated as full-text fields
Which fields should be numeric, date, or geolocation information
What is the format of the date type field
Whether all fields of the document need to be indexed to the _all field

A simple example of the mapping is shown below:

5.3. Field Type

Text, keyword, date, long, double, Boolean, IP. As you can see, there is no String. The String is of type text, and all fields of type text are full-text indexed. There are two numeric types, long and double.

More information about the types can be found in the official documentation

5.4 Mapping related apis

Here are some common ones

A. Set the Mapping when creating indexes

PUT /mapping_index2
{
  "mappings": {
    "properties": {
      "age":    { "type": "integer" },  
      "email":  { "type": "keyword"  }, 
      "name":   { "type": "text"  }     
    }
  }
}
Copy the code

B. View the mapping Settings of the index

GET /mapping_index/_mapping
Copy the code

C. Add field types to mapping

PUT /mapping_index2/_mapping
{
  "properties":{
    "add-field":{
      "type": "keyword",
      "index": false
    }
  }
}
 
Copy the code

6. Aliases

6.1. Aliases for indexes

An alias is a secondary name for a set of data streams or indexes. Most Elasticsearch apis accept an alias instead of a data stream or index name.

The data stream or index of the alias can be changed at any time. If you use an alias in your application’s Elasticsearch request, you can reindex the data without downtime or changes to your application’s code. The diagram below:

A. Create an alias

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "logs-nginx",
        "alias": "logs"
      }
    }
  ]
}
Copy the code

B. Delete the alias

POST _aliases
{
  "actions": [
    {
      "remove": {
        "index": "logs-nginx",
        "alias": "logs"
      }
    }
  ]
}
Copy the code

C. Change the index pointing to by alias in the same operation

POST _aliases
{
  "actions": [
    {
      "remove": {
        "index": "logs-nginx",
        "alias": "logs"
      }
    },
    {
      "add": {
        "index": "logs-nginx2",
        "alias": "logs"
      }
    }
  ]
}
Copy the code

D, create index alias

PUT index-aliases-1
{
  "aliases": {
    "index-aliases": {}
  }
}
Copy the code

E. View the alias

View aliases for all indexes

GET _alias

Copy the code

View the alias for the specified index

GET index-aliases-1/_alias

Copy the code

View the index corresponding to the specified alias

GET _alias/index-aliases
Copy the code

6.2. Aliases for fields

One type of a field is alias

An alias map defines an alternative name for a field in an index. An alias can replace the target field in a search request

The target of aliases has some limitations:

1. The target must be a specific field, not an object or another field alias.
2. The target field must already exist when the alias is created.
3. If nested objects are defined, the field alias must have the same nested scope as its target.

In addition, a field alias can have only one target. This means that you cannot query multiple target fields in a clause using field aliases.

Define alias fields for fields in Mapping

PUT trips
{
  "mappings": {
    "properties": {
      "distance": {
        "type": "long"
      },
      "route_length_miles": {
        "type": "alias",
        "path": "distance" 
      },
      "transit_mode": {
        "type": "keyword"
      }
    }
  }
}
Copy the code

Of course, alias fields are not supported in some apis

— — — — — — — — — — — — — — — — — — — — — — — — — – finish this chapter

Continuous love, continuous curiosity — Uncle Long