The basic concept

Elasticsearch has a few core concepts, so take a few minutes to learn about them.

NRT

Near Realtime is a search and analysis operation based on Elasticsearch that takes up to a second to write a piece of data until it can be searched.

Cluster

A cluster contains one or more nodes. The name of each node is determined by the cluster name (the default name is ElasticSearch). If the cluster name is incorrect, the consequences are serious. It is suggested to use different names for R&D, test environment, quasi-production environment, and production environment to increase differentiation. For example, ES-Dev for R&D, ES-test for testing, ES-STG for quasi-production, and ES-Pro for production environment. For small – and medium-sized applications, the cluster can have only one node.

Node

A single instance of Elasticsearch server is called a node, which is part of a cluster. Each node has its own name. The Elasticsearch cluster is managed and communicates with each other by node name. A node can be added to only one Elasticsearch cluster. The cluster provides full data storage, indexing, and search functions.

shard

Shard is a single Lucene index. The storage capacity of a single machine is limited (e.g., 1TB), and the data of Elasticsearch index can be very large (PB level, 30GB/ day write volume), so the data of a single machine cannot store all the data. Storage is distributed on multiple servers. Shard is a good way to scale out, store more data, distribute search and analysis operations across multiple servers, and improve the overall throughput and performance of the cluster. If you specify the number of shards in the index, the rest is done by Elasticsearch. If you specify the number of shards in the index, you cannot change it.

replica

The relationship between shard and replica may be one-to-many. A shard may have one or more replicas, and the replica data under the same shard are identical. As a copy of the SHard, replicas undertake the following three tasks:

  1. Shard failure or downtime, one replica can be upgraded to shard.
  2. Replica ensures data is not lost (redundancy mechanism) to ensure high availability.
  3. Replica shares search requests to improve the throughput and performance of the cluster.

The full name of a shard is primary shard, and the total name of a replica shard is replica shard. The number of primary shards is specified during index creation and cannot be changed later. The replica Shard can be changed later. By default, the primary shard value and replica shard value of each index are 5 and 1, meaning there are 5 primary shards and 5 replica shards, altogether 10 shards. Therefore, the minimum high availability configuration for Elasticsearch is 2 servers.

Index

An index, a collection of documents with the same structure, is similar to a database instance of a relational database (after type was deprecated in version 6.0.0, the concept of an index was reduced to the level equivalent to a database table). Multiple indexes can be defined in a cluster, such as customer information index, commodity index, commodity index, order index, review index, and so on, each defining its own data structure. The index name must be in lower case. The index name is used in index creation, search, update, and deletion operations.

type

Type, the original is in the logic of subdivided within the Index (Index), but later found enterprise research and development in order to enhance the readability and maintainability, formulate the standard constraints, few will use again under the same Index type logically split (as an Index of both orders and have comments data), so after 6.0.0 version, This definition is deprecated.

Document

The smallest data storage unit of Elasticsearch. The JSON data format is similar to that of a relational database table record (a row of data). The structure definition is diversified.

The working principle of

Take a quick look at how Elasticsearch works.

The boot process

When the Elasticsearch node starts up, it uses broadcast to find and connect to other nodes in the cluster. If the cluster already exists, one of the nodes has a special role called coordinate Node, which is responsible for managing the state of the cluster node. When a new node is added, the cluster topology information is updated. If the current cluster does not exist, the started node becomes a coordinate Node by itself.

The application communicates with the cluster

Elasticsearch has set up a Coordinate Node to manage the cluster, but this setting is transparent to the client. The client can request any Node it knows, if it is the cluster’s current Coordinate, Then it will forward the request to the corresponding Node for processing, if the Node is not Coordinate, then the Node will forward the request to the Coordinate Node, and then the Coordinate will forward it, All data returned by each node are handed over to Coordinate Node for summary and finally returned to the client.

Check the validity of nodes in the cluster

In normal operation, Coordinate Node will communicate with the Node in the topology regularly to check whether the instance works normally. If the Node does not respond within the specified time period, the cluster will consider the Node to be down. The cluster is rebalanced:

  1. Reallocate the faulty Node. The replica shard of the Node exists in other nodes. Select a replica shard and upgrade it to the primary shard.
  2. Relocate the new Shard.
  3. Topology updates, and requests sent to the Node are remapped to the current healthy Node.

summary

Elasticsearch this section introduces the basic concepts and working principles of Elasticsearch. In the following sections, you will learn how to use Elasticsearch.

Focus on Java high concurrency, distributed architecture, more technical dry goods to share and experience, please pay attention to the public account: Java architecture community