This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
Hopefully by the end of this article you will have a basic and comprehensive understanding of ElasticSearch.
(Note that this article mainly covers the content after ES 7.)
1 What is ElasticSearch?
A restful API standard test extension and high availability of real-time data analysis of the full text search tool
2 ES Concept, architecture, and principles
2.1 Basic Concepts
-
Node: a single server with ES installed.
-
Cluster: An ES Cluster consists of one or more nodes (each of which is a peer, decentralized relationship).
-
Document: A Document is a basic unit of information that can be indexed
-
Index: An Index is a collection of documents with similar characteristics
-
Type: In an index, you can define one or more types (removed from 7.x and later)
-
Filed (field) : Filed is the smallest unit of ES, which is equivalent to a column of data
-
Shards: ES divides an index into multiple pieces, each of which is a Shard
-
Replicas: Replicas indexes one or more copies
2.2 Comparison of ES and relational database concepts
7. Relationship comparison before version X
7. Relationship comparison after x version (excluding type)
2.3 ES architecture
-
Gateway represents the persistent storage mode for ElasticSearch indexes. By default, ElasticSearch stores indexes in memory in the Gateway and persists them to the Gateway when memory is full. When the ES cluster shuts down or restarts, it reads index data from the Gateway. For example, LocalFileSystem, HDFS, and AS3.
-
DistributedLucene Directory, which is a Directory of index files in Lucene. It is responsible for managing these index files. It includes data reading, writing, index adding and merging, etc.
-
River stands for data source. Exists as a plug-in in ElasticSearch.
-
Mapping, which means Mapping, is very similar to data types in static languages. For example, if we declare a variable of type int, then the variable can only store data of type int.
-
Search Moudle, Search module
-
Index Moudle, Index module
-
Disvcovery is mainly responsible for cluster master node discovery. For example, when a node suddenly leaves or comes in, a shard is sharded and re-sharded.
-
Scripting, a Scripting language. There are a lot of them that I won’t go into here. Such as Mvel, JS, Python, etc.
-
Transport, representing the ElasticSearch internal node, represents the client that interacts with the cluster. These protocols include Thrift, Memcached, and Http
-
RESTful Style API, through RESTful way to achieve API programming.
-
3rd Plugins, representing third-party plug-ins.
-
Java(Netty), is the development framework.
-
JMX is for monitoring.
2.4 Primary Shards and Replicas
If an index is set to 3 shards and 1 replica, it might be stored in a cluster of 3 data nodes like the one below.
2.5 ES Data Writing Principle
2.6 Node Types
The node type is determined by the Node. roles configuration item in the configuration file. A cluster must have at least master and date nodes.
As the cluster grows, especially when there are a lot of machine learning tasks or continuous transformations, it makes sense to consider separating dedicated master nodes from dedicated data, machine learning, and transformation nodes.
However, most node types are not generally used.
A, Master eligible node 【 candidate primary node 】
The master node is responsible for lightweight cluster-wide operations, such as creating or deleting indexes, tracking which nodes are part of the cluster, and deciding which shards to assign to which nodes. Cluster health requires a stable primary node. The primary node is generated from the candidate host point
Any node that has the role of candidate primary node can be selected as the primary node during the election process.
Roles: [master] or Node.roles: [master,data]Copy the code
B. Voting-only master-eligible nodes
This candidate primary node will only vote in the election phase, but will not be voted as the primary node (only nodes with the master role can be marked as having the voting_only role).
Roles: [data, master, voting_only] or Node.roles: [master, voting_only]Copy the code
C, Data node
Data nodes hold shards of documents that have been indexed. Handles data-related operations such as CRUD, search, and aggregation. These operations are I/O, memory, and CPU intensive, so it is important to monitor these resources and add more data nodes for horizontal expansion if they become overloaded.
Roles: [data]Copy the code
D. Content data node
The content data node holds user-created content. They support operations such as CRUD, search, and aggregation.
Roles: [data_content]Copy the code
E. Hot Data node
Hot data nodes store time series data. The hot layer requires high read/write speed and occupies a large number of hardware resources, such as SSDS.
Roles: [data_hot]Copy the code
F, Warm data node
Warm data nodes store indexes that are not updated regularly but are still being queried. The frequency of the query volume is usually lower than when the index is in the hot layer. Lower-performing hardware is typically available for nodes in this layer.
Roles: [data_warm]Copy the code
G, Cold data node
Cold data nodes store read-only indexes, which are rarely accessed. This layer uses poorly performing hardware and can minimize the resources required with searchable snapshot indexes.
Roles: [data_cold]Copy the code
H. Frozen Data node
The freeze layer stores only partial installed exponents. You are advised to use dedicated nodes.
Roles: [data_frozen]Copy the code
I, Ingest node
An ingestion node can execute a preprocessing pipeline consisting of one or more ingestion processors. Depending on the type of operation the ingestion processor performs and the resources required, it may make sense to use dedicated ingestion nodes that will only perform this particular task.
Roles: [ingest]Copy the code
J, Coordinating only node
If a node does not handle primary tasks, does not save data, does not pre-process documents, and does not have any role, then the node can be said to be a coordination node that can only route requests, handle the search reduce phase, and distribute bulk indexes. Essentially, it’s like an intelligent load balancer.
Coordinating nodes can benefit large clusters. They join the cluster and monitor the full cluster state, just like every other node, using the cluster state to route requests directly to the appropriate node.
Roles: []Copy the code
K, remote-eligible nodes
Eligible remote nodes act as cross-cluster clients and connect to the remote cluster. Once connected, you can use cross-cluster search to search for remote clusters. You can also use cross-cluster replication to synchronize data between clusters.
Roles: [remote_cluster_client]Copy the code
L, Machine learning node
Machine learning nodes run jobs and process machine learning API requests.
Roles: [ML, remote_cluster_client]Copy the code
M, Transform node
The transformation node runs the transformation and processes the transformation API requests
Roles: [transform, remote_cluster_client]Copy the code
3. ES installation and construction
3.1. Install JDK
A note of caution: elasticsearch7 and later ships with the JDK. There is no need to install the JDK separately
Download the JDK at jdk.java.net/19/
The ES version corresponds to the JDK version. If the ES version does not match the JDK version, the installation will be abnormal.
For the corresponding relationship, please refer to the official document:
The installation process
Unzip the JDK package: Gz Then configure the environment variable vim /etc/profile. Add export JAVA_HOME= [Your Java path] export JAVA_BIN= [your Java path] /bin export PATH=$PATH:$JAVA_HOME/bin export CLASSPATH=.:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar export JAVA_HOME JAVA_BIN PATH CLASSPATH Java -version Check whether the installation is OKCopy the code
3.2. Install ES
Note: You cannot use root account to run ES, so create an account for ES and download the version you need from the official website
Unzip ES package:
tar -zxvf elasticsearch-xxx-linux-x86_64.tar.gz
A brief introduction to the ES directories:
Go to the config directory and modify the elasticSearch. yml file
All related authentication is closed here. How to configure authentication to be sent separately
cluster.name: my-es
node.name: es-node-1
node.roles: [ data, master ]
path.data: /data/es/es_data
path.logs: /data/es/es_log
network.host: x.x.x.97
http.port: 9200
transport.port: 9300
ingest.geoip.downloader.enabled: false
discovery.seed_hosts: ["x.x.x.97:9300","x.x.x.162:9300","x.x.x.165:9300"]
cluster.initial_master_nodes: ["es-node-1","es-node-2","es-node-3"]
xpack.security.enabled: false
xpack.security.enrollment.enabled: false
xpack.security.http.ssl:
enabled: false
keystore.path: certs/http.p12
xpack.security.transport.ssl:
enabled: false
verification_mode: certificate
keystore.path: certs/transport.p12
truststore.path: certs/transport.p12
Copy the code
Then go to the decompressed directory, go to the bin directory, and run the background startup command:
./elasticsearch -d
You can see the progress of ES
Then access port 9200 of the server to see the relevant information, and the installation is successful
With three installed in the same way, you can form a cluster
3.3 ES Configuration
Configuration File Introduction
- Elasticsearch. yml: ES configuration
- Jvm. options: ES Specifies the JVM to use
- Log4j2. properties: ES log configuration
This section describes how to configure ElasticSearch. yml
The configuration item format can be written in two ways:
Path: data: /var/lib/elasticSearch logs: /var/log/elasticSearch data: /var/lib/elasticSearch logs: /var/log/elasticSearch /var/log/elasticsearchCopy the code
Common important configuration items are as follows (this is version 8.1) :
There will be more details on the official website
4, index,
4.1. What is an inverted index
Also known as reverse indexing, it is an index method used to store a mapping of a word’s location in a document or group of documents under full-text search. It is the most commonly used data structure in document retrieval systems.
General index creation method:
[Document –> keywords] mapping process (forward indexing, disadvantages: low efficiency, need to go through the document)
The following figure
Reverse index:
[Key words] document mapping, reconstruct the result of forward index into reverse index (reverse index)
As shown in the figure below, there are five documents, and the word segmentation of the document is carried out first
Number both words and documents
Form the matrix according to the word set and document set:
Horizontal :(jobs, (document1, document2, document3)); (Rebus, (Document1, Document2, Document3, document4))
Vertical :(document 1, (jobs, rebus)); Document 5 (Yellow Chapter, Apple)
This is then converted into an inverted list, as shown below, for example: “Jobs” in document 1 at 3 and 11, document 2 at 7, and document 3 at 9
You can also add more information. For example, the frequency of words, such as “Jobs” in position 3 and 11 of document 1, is 2 in document 1
Do you want to do something with curl?
A. Initialize index Settings
For tt_index, the number of initialization fragments is 5 and the number of copies is 1
PUT /tt_index/
{
"settings":{
"number_of_shards":5,
"number_of_replicas":1
}
}
Copy the code
Note here:
Number_of_shards can only be specified during index creation and cannot be changed later
The number_of_replicas can be changed later
B. View index Settings
View Settings for a single index
GET /test_index/_settings
Copy the code
View Settings for multiple indexes
GET /tt_index,tt_index2/_settings
Copy the code
View the Settings for all indexes
GET /_all/_settings
Copy the code
C. Add data to index
Do not specify ID, random ID
POST /tt_index2/_doc/
{
"name":"dragon",
"age":"18",
"like":{
"food":"apale",
"color":"white"
}
}
Copy the code
If ID exists, it will be overwritten
PUT /tt_index2/_doc/2
{
"name":"dragon",
"age":"18",
"like":{
"food":"apale",
"color":"white"
}
}
Copy the code
If ID exists, an error message is displayed
PUT /tt_index2/_create/2
{
"name":"dragon",
"age":"18",
"like":{
"food":"apale",
"color":"white"
}
}
Copy the code
D. Query index data
Select * from index;
GET /tt_index2/_search
Copy the code
Query how much data there is in an index
GET /tt_index2/_count
Copy the code
Query the ID of an index
GET /tt_index2/_doc/2
Copy the code
Filter the query index data by conditions
POST /tt_index/_search
{
"query": {
"range": {
"age": {
"lt": 19
}
}
}
}
Copy the code
View the distribution of index data
GET /_cat/shards/tt_index
Copy the code
E. Update index data
PUT /tt_index2/_doc/2
{
"name":"dragon23333",
"age":"18",
"like":{
"food":"apale",
"color":"white"
}
}
Copy the code
F. Delete index data
Deletes the data with the specified ID
DELETE /tt_index2/_doc/2
Copy the code
Delete index data that meets the conditions
POST /tt_index/_doc/_delete_by_query
{
"query": {
"range": {
"age": {
"gt": 19
}
}
}
}
Copy the code
G, delete index
DELETE /tt_index2
Copy the code
5, the Mapping
5.1. What is Mapping
Mapping is the process of defining how a document and the fields it contains are stored and indexed.
5.2 Types of Mapping
A, Dynamic mapping
ES is really smart, you can create an index, do not set the Mapping, when you insert data will automatically parse the Mapping
After the index is directly created, the mapping is empty
After inserting the data, the win Mapping is automatically resolved
** It should be noted that the field type cannot be changed once mapped. **
The ES dynamic resolution rules are as follows
B, Explicit mapping
Explicit maps allow you to choose exactly how to define them, rather than being created automatically by ES parsing.
- Which string fields should be treated as full-text fields
- Which fields should be numeric, date, or geolocation information
- What is the format of the date type field
- Whether all fields of the document need to be indexed to the _all field
A simple example of the mapping is shown below:
5.3. Field Type
Text, keyword, date, long, double, Boolean, IP. As you can see, there is no String. The String is of type text, and all fields of type text are full-text indexed. There are two numeric types, long and double.
More information about the types can be found in the official documentation
5.4 Mapping related apis
Here are some common ones
A. Set the Mapping when creating indexes
PUT /mapping_index2
{
"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"name": { "type": "text" }
}
}
}
Copy the code
B. View the mapping Settings of the index
GET /mapping_index/_mapping
Copy the code
C. Add field types to mapping
PUT /mapping_index2/_mapping
{
"properties":{
"add-field":{
"type": "keyword",
"index": false
}
}
}
Copy the code
6. Aliases
6.1. Aliases for indexes
An alias is a secondary name for a set of data streams or indexes. Most Elasticsearch apis accept an alias instead of a data stream or index name.
The data stream or index of the alias can be changed at any time. If you use an alias in your application’s Elasticsearch request, you can reindex the data without downtime or changes to your application’s code. The diagram below:
A. Create an alias
POST _aliases
{
"actions": [
{
"add": {
"index": "logs-nginx",
"alias": "logs"
}
}
]
}
Copy the code
B. Delete the alias
POST _aliases
{
"actions": [
{
"remove": {
"index": "logs-nginx",
"alias": "logs"
}
}
]
}
Copy the code
C. Change the index pointing to by alias in the same operation
POST _aliases
{
"actions": [
{
"remove": {
"index": "logs-nginx",
"alias": "logs"
}
},
{
"add": {
"index": "logs-nginx2",
"alias": "logs"
}
}
]
}
Copy the code
D, create index alias
PUT index-aliases-1
{
"aliases": {
"index-aliases": {}
}
}
Copy the code
E. View the alias
View aliases for all indexes
GET _alias
Copy the code
View the alias for the specified index
GET index-aliases-1/_alias
Copy the code
View the index corresponding to the specified alias
GET _alias/index-aliases
Copy the code
6.2. Aliases for fields
One type of a field is alias
An alias map defines an alternative name for a field in an index. An alias can replace the target field in a search request
The target of aliases has some limitations:
-
1. The target must be a specific field, not an object or another field alias.
-
2. The target field must already exist when the alias is created.
-
3. If nested objects are defined, the field alias must have the same nested scope as its target.
In addition, a field alias can have only one target. This means that you cannot query multiple target fields in a clause using field aliases.
Define alias fields for fields in Mapping
PUT trips
{
"mappings": {
"properties": {
"distance": {
"type": "long"
},
"route_length_miles": {
"type": "alias",
"path": "distance"
},
"transit_mode": {
"type": "keyword"
}
}
}
}
Copy the code
Of course, alias fields are not supported in some apis
— — — — — — — — — — — — — — — — — — — — — — — — — – finish this chapter
Continuous love, continuous curiosity — Uncle Long