ElasticSearch is an example of how to use the REST API. You will learn about ElasticSearch in the following sections: document, index, cluster, node, and sharding

ElasticSearch term

Indexes and documents are more logical concepts, while nodes and sharding are more physical concepts.

First, what is a document?

Document

ElasticSearch (ES for short) is document-oriented, and a document is the smallest unit of all searchable data.

Here are a few examples to help you visualize what a document is:

  • Log entries in log files
  • Details about a movie, details about a record
  • A song on an MP3 player, a specific content in a PDF document
  • One customer data, one commodity classification data, one order data

You can think of a document as a record in a relational database.

In ES, documents are serialized to JSON format and stored in ES. JSON objects consist of fields, each of which has a corresponding field type (string/array/Boolean/date/binary/range type).

In ES, each document has a Unique ID, which can be specified or generated automatically through ES.

In the last article, how to build ELK real-time log analysis platform by hand, we talked about importing data into ES through Logstash. Part of the test data set and the corresponding converted format are shown as follows:

movieId,title,genres
193585,Flint (2017),Drama
193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
193609,Andrew Dice Clay: Dice Rules (1991),ComedyCopy the code

We read the movie data of RowData one by one from the CSV file of the test data set, and then transform it into ES through the Logstash transformation, which is the JSON format.

JSON each field has its own data type, ES can help you automatically calculate a data type, and ES also supports arrays and nesting of data.

Each document has a corresponding metadata, which is used to annotate the relevant information of the document. Let’s see what the metadata contains:

{" _index ":" movies ", "_type" : "_doc", "_id" : "2035", "_score" : 1.0, "_source" : {" title ": "Blackbeard's Ghost", "genre" : [ "Children", "Comedy" ], "id" : "2035", "@version" : "1", "year" : 1968 } }Copy the code

_index indicates the index name to which the document belongs. _type indicates the name of the type to which the document belongs. _ID is the unique document ID. _source is the original JSON data of the document. The _source field is returned by default when searching the document. @version indicates the version information of the document, which can be used to solve the problem of version conflict. _score is the correlation score, which is the score of the document in this query.

Having introduced the document, let’s take a look at the index:

Index

Index is simply a collection of similar structure document, such as there can be an index of the customer, commodity classification index, order index, the index has a name, an index that can contain a lot of documents, an index is a class of similar or the same document, such as establishing a commodity index, the goods may be stored inside the all data, That is, all product documentation. Each index is its own Mapping definition file, which is used to describe and contain the types of document fields. Shard represents the concept of physical space, and the data in the index is scattered on the Shard.

In an index, you can set Mapping and Setting, Mapping defines the type structure of all document fields in the index, Setting mainly specifies how many shards to use and how the data is distributed.

An index can mean different things in different contexts. For example, in ES, an index is a collection of documents, which is a noun. The process of saving a document to ES is also called indexing. Dropping ES, indexing, b-tree indexing, or inverted indexing is an important data structure in ES, covered in a future article.

Next, we’ll talk about types:

Type (Type)

Prior to 7.0, it was possible to set multiple Types for each index, and each Type would have the same structure of documents, but since 6.0, Type has been abolished, and since 7.0, only one Type could be created for each index, _doc.

Each index can have one or more types. Type is a logical data classification in the index, and documents under a Type all have the same Field. For example, blog system has an index that can define user data Type, blog data Type, comment data Type, etc.

So far, we’ve looked at documents, indexes, and types. What is clustering? What is a node? What is sharding?

Let’s start with the concept of clusters.

Cluster

ES cluster is actually a distributed system, which needs to meet high availability. High availability refers to the normal operation of the whole service when the node service stops responding in the cluster, that is, service availability. In other words, if some nodes in the whole cluster are lost, there will be no data loss, that is, data availability.

When the user requests more and more, the increase of data is more and more, the system needs to disperse the data to other nodes, finally to achieve horizontal expansion. When a node in the cluster becomes faulty, services in the whole cluster are not affected.

The default name is ElasticSearch. You can change the name in the configuration file or run the -e cluster.name=wupx command to specify the name.

An ES cluster has three colors to indicate health:

  • Green: Master shards and replicas are allocated normally
  • Yellow: Primary fragments are normally allocated, but duplicate fragments are not normally allocated
  • Red: Primary sharding failed to be allocated (for example, a new index was created when the server’s disk capacity exceeded 85%)

Now that we know about clusters, let’s take a look at what nodes are.

Node (Node)

A node is essentially an ES instance, which is essentially a Java process. Multiple ES processes can run on a machine, but production environments generally recommend running only one ES instance on a machine.

Each node has its own name, which is very important (for o&M management operations) and can be specified in the configuration file or during startup. -e node.name=node1 After each node is started, it is assigned a UID and stored in the data directory.

The default node will be added to a cluster named ElasticSearch. If you start many nodes, they will automatically form a elasticSearch cluster. A node can form a elasticSearch cluster.

Eligible Master Nodes

Master: false Is displayed in the configuration file. Master-eligible nodes are eligible to join the main selection process and become Master nodes. When the first node starts, it elects itself as the Master node.

The cluster status is stored on each node. Only the Master node can modify the cluster status. If any node can modify the cluster status, data inconsistency will occur.

Cluster State: Maintains necessary information about a Cluster, including the following information:

  • All node information
  • All indexes and their associated Mapping and Setting information
  • Fragmented routing information

Now what are Data nodes and Coordinating nodes?

Data Node & Coordinating Node

As the name implies, the Node that can store Data is called Data Node, which is responsible for storing all Data stored in the fragment. When the cluster cannot store the existing Data, Data nodes can be added to solve the storage problem, which plays a crucial role in Data expansion.

A Coordinating Node is responsible for receiving Client requests, distributing them to the appropriate nodes, and finally aggregating the results back to the Client. By default, each Node functions as a Coordinating Node.

There are other node types that you can look at:

Other Node types

  • Hot & Warm nodes: Hot nodes are high-configuration nodes with better disk throughput and CPU, while Warm nodes store older nodes with lower machine configurations. Data nodes with different hardware configurations are used to implement the Hot & Warm architecture and reduce the cost of cluster deployment.
  • Machine Learning Node (Machine Learning Node) : Responsible for running Machine Learning work, used to do anomaly detection.
  • Tribe Nodes: Connect to different ES clusters and allow them to be treated as a single cluster.
  • Ingest Node: Preprocessing operations allow some conversion and enrichment of data through a predefined series of processors and pipelines before indexing the document, that is, before writing the data.

Each node will decide what role to play at startup by reading the elasticSearch.yml configuration file, so let’s configure the node type.

Configuring the Node Type

A node can play multiple roles in a development environment.

A production environment should have dedicated nodes for a single role.

So with nodes out of the way, what is sharding?

Shard

Since a single machine cannot store a large amount of data, ES can split the data in an index into multiple shards and store them on multiple servers. Sharding allows you to scale horizontally, store more data, distribute search and analysis across multiple servers, and improve throughput and performance.

The relationship between index and sharding is shown in the figure above. An ES index contains many sharding, and a sharding is a Lucene index, which itself is a complete search engine and can independently perform index building and search tasks. The Lucene index in turn consists of a number of segments, each of which is an inverted index. Each ES Refresh generates a new fragment containing data for several documents. Within each section, the different fields of the document are individually indexed. The value of each field consists of a number of words (Term), which is the end result of the original text content after lexical segmentation and language processing (for example, punctuation removal and conversion to a root).

Shards are divided into two types: Primary Shard and Replica Shard.

Primary fragmentation is mainly used to solve the problems of the scale, through the main subdivision, data can be distributed to all the nodes on the cluster, a primary shard is a running instances of Lucene, when we are in the create index ES, you can specify subdivision number, but the main subdivision number specified in the index creation time, later not allowed to change, Unless modified using Reindex.

Copy sharding is used to solve the problem of high availability of data, that is, when a node in the cluster fails hardware, data will not be lost through the way of copy, because the copy sharding is the copy of the master fragment. The number of copy sharding can be dynamically adjusted in the index. By increasing the number of copies, The performance of service queries (read throughput) can be improved to some extent.

Here’s an example of how master and replica sharding can spread data across different nodes in a cluster:

PUT /blogs
{
    "settings" :{
        "number_of_shards" : 3,
        "number_of_repicas" : 1
    }
}Copy the code

The blogs index is defined above, where number_of_shards in Settings indicates that the number of primary shards is 3, and number_of_repicas indicates that there is only one copy.

The figure above is a cluster of WUPX, which has a total of 3 nodes. According to the configuration of blogs index above, when data comes in, ES will disperse the master shard to three nodes and the copy of each shard to other nodes. When some nodes in the cluster fail, In ES, there will be a mechanism of failover, which will be explained in future articles. In the figure above, we can see that three master shards are scattered to three nodes. If a node is added to the cluster at this time, can the availability of the system be increased?

With that in mind, let’s take a look at the sharding setup:

Sharding setup

Shard setup in a production environment is very important, need a lot of capacity planning in advance, because the Lord shard in index creation need to preset, and cannot be modified after the event, in the example above, an index was divided into 3 main fragmentation, even more nodes are added, the cluster index can only scattered on the three nodes.

Too many sharding Settings will also bring side effects. On the one hand, it will affect the scoring of search results and the accuracy of statistical results. On the other hand, too many sharding Settings on a single node will lead to resource waste and affect performance. From version 7.0, the default master shard number for ES was changed from 5 to 1, which also solved the over-sharding problem.

After understanding the terminology of ES, let’s make an analogy with the relational database we are familiar with, so that we can understand.

RDBMS & ES

I’m sure you’re familiar with relational databases (RDBMS), so let’s make an analogy between relational databases and ES easier for you to understand:

From the table, it is easy to see that the relational database and ES have the following correspondence:

  • A Table in a relational database compares an Index in ES
  • Each Row in a relational database corresponds to a Document in ES
  • Field (Column) in a relational database corresponds to field (Filed) in ES
  • Table definitions in relational databases correspond to mappings in ES
  • Relational databases can be queried through SQL, and DSL is also provided in ES

ES is suitable for full-text retrieval or scoring of search results, but relational databases and ES are used in combination when the requirements for data transactionality are high.

In order to facilitate the integration of other languages, ES provides REST API to call other programs. When our program needs to integrate with ES, we only need to send HTTP request to get corresponding results. Next, the basic API will be introduced:

REST API

To open Kibana, first open the Management menu of Kibana, which provides index Management. You can see indexes in index Management, which are movies indexes imported from the previous article. Click indexes, and you can see the Setting and Mapping information of indexes. How to set this up will be covered in a later article.

Without further ado, let me show you the REST API:

Then open Kibana Dev Tools, movies as the index. Now enter GET Movies and click Execute to view the information related to the index, including the Mapping and Setting of the index.

Enter GET movies/_count and click Execute to see the total number of documents indexed as follows:

{
  "count" : 9743,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}Copy the code

Enter the following code

POST movies/_search
{
  
}Copy the code

Click Execute to view the top 10 documents and understand the document format.

GET /_cat/indices/mov*? V&s =index, you can view the matched index.

Using the GET / _cat/indices? V&s =docs.count:desc, which can be sorted by the number of documents.

Using the GET / _cat/indices? V&health =green, you can view the index whose status is green.

Using the GET / _cat/indices? V&h = I, TM&S = TM :desc, you can view the memory occupied by each index.

ES also provides an API to check the health status of a cluster. Run the GET _cluster/health command to check the health status of a cluster.

{ "cluster_name" : "wupx", "status" : "green", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 2, "active_primary_shards" : 10, "active_shards" : 10, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_OF_IN_FLight_FETCH" : 0, "task_max_WAITing_IN_queue_millis" : 0, "ACTIVE_SHARDS_PERCENT_AS_number" : 100.0}Copy the code

You can see that the cluster name is Wupx and the cluster status is green. There are two nodes in total, both of which are Data nodes. There are also 10 master shards.

That’s the REST API, and you can explore the REST for yourself.

If you are careful, you will notice how Kibana has become a Chinese interface. In fact, After version 7.0, Official cabin localization resource file (located in node_modules Kibana directory/x – pack/plugins/translations/translations), you can change in the config directory Kibana. Yml file, Add the configuration item i18N. locale: “zh-cn” to the file, and then restart Kibana.

conclusion

, indexes, this paper mainly studied the document cluster, node, such as concept, understanding to each node in each cluster can assume different roles, also know what is the main copy of fragmentation and lamination and they played a role in a distributed system, also by analogy, and relational databases do let it be easier to understand, it also introduced how to use REST API, Finally, I give you a mind map of ES terminology summarized by myself. The source file of mind map can be obtained at the public account “Wu Peixuan reply” ES.

reference

Elasticsearch technical Tutorial

Source Code optimization for Elasticsearch

Elasticsearch core technology and actual combat

Elasticsearch Top Master series

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/index.html