There are some important concepts we need to understand when we start using Elasticsearch. Understanding these concepts will be important when we use Elastic stacks in the future. In today’s post, we’ll cover some of the most important concepts in Elastic stacks.

First, let’s take a look at the following figure:

Cluster

Cluster means Cluster. The Elasticsearch cluster consists of one or more nodes and can be identified by their cluster name. The name of the Cluster can be set in the Elasticsearch configuration file. By default, if our Elasticsearch is already running, it will automatically generate a cluster called “Elasticsearch”. We can customize the name of our cluster in config/ elasticSearch.yml:

An Elasticsearch cluster looks like the following layout:

We can do this by:

GET _cluster/state
Copy the code

To get the status of the entire cluster. This state can only be changed by the master node. The above interface returns the result:

{
  "cluster_name": "elasticsearch"."compressed_size_in_bytes": 1920,
  "version": 10,
  "state_uuid": "rPiUZXbURICvkPl8GxQXUA"."master_node": "O4cNlHDuTyWdDhq7vhJE7g"."blocks": {},
  "nodes": {... },"metadata": {... },"routing_table": {... },"routing_nodes": {... },"snapshots": {... },"restore": {... },"snapshot_deletions": {...}
}
Copy the code

node

Single instance of Elasticsearch. In most environments, each node runs in a separate box or virtual machine. If each elasticSearch.yml configuration file has the same cluster.name and is on the same network, then they automatically form a cluster. A cluster consists of one or more nodes. In my test environment, I was able to run multiple nodes on a single server. In real deployments, most of the time you still need a node running on a server.

According to the functions of Node, it can be divided into the following types:

  • Master-eligible: Eligible to serve as a master node. Once a master node, it can manage Settings and changes for the entire cluster: create, update, and delete indexes; Add or remove nodes; Assign shards to nodes

  • Data: data node

  • Ingest: Data access (e.g. Pipepline)

  • machine learning (Gold/Platinum License

In general, a node can have one or more of these functions. Elasticsearch can be defined on the command line or in the Elasticsearch configuration file (elasticSearch.yml) :

The Node type Configuration parameters The default value
master-eligible node.master true
data node.data true
ingest node.ingest true
machine learning node.ml True (except for OSS distributions)

You can also have a Node do its own functions and roles. If the node configuration parameter above is not configured, then the node can be considered to be a Coordinating node. In this case, it can accept external requests and forward them to the appropriate nodes for processing.

In practice, we can send requests to the data node, but not to the master node.

In the Elastic architecture, the relationship between Data nodes and clusters is described as follows:

Document

Elasticsearch is document-oriented, which means that the smallest unit of data you can index or search for is a document. Documents in Elasticsearch have some important properties:

  • It’s independent. The document contains fields (names) and their values.

  • It can be layered. Think of it as a document within a document. The value of a field can be simple, just as the value of a location field can be a string. It can also contain other fields and values. For example, the location field might contain city and street addresses.

  • Flexible structure. Your document does not depend on a predefined schema. For example, not all events require a description value, so you can omit this field entirely. But it might require new fields, such as the latitude and longitude of the location.

A document is usually a JSON representation of data. JSON over HTTP is the most widely used way to communicate with Elasticsearch, and it is the method we use in this book. For example, events in your party site can be represented in the following documents:

    {
         "name": "Elasticsearch Denver"."organizer": "Lee"."location": "Denver, Colorado, USA"
    }
Copy the code

Many people think of a document as being compared to a relational database, which corresponds to each record in it.

type

A type is a logical container for a document, similar to a table is a container for rows. You put documents with different structures (schemas) in different types. For example, you can use one type to define aggregation groups and another type for events when people gather.

Each type of field definition is called a map. For example, name will map to a string, but the Geolocation field under Location will map to the special geo_point type. (We discuss how to use the geospatial data in Appendix A.) Each field is handled differently. For example, you search for words in the name field and then search for groups by location to find groups located near where you live.

Many people think that Elasticsearch is schemaless. Everyone even thinks that the database in Elasticsearch doesn’t need mapping. In fact, this is a wrong concept. You don’t need to define a table in a type relational database to use the database. In Elasticsearch, we don’t have to start defining a mapping and just write to the index we specify. The index mapping is dynamically generated. Each data type of the data item is dynamically identified. Such as time, string, etc., although some data types still need to be manually adjusted, such as geo_point and other geolocation data. In addition, it also has another meaning. For the same type, we may add new data items in the future data input to produce a new mapping. This is also dynamically adjusted.

For some reason, after Elasticsearch 6.0, an Index can only contain one Type. The reason for this is that fields with the same name in different mapping types of the same index are the same; In the Elasticsearch index, fields with the same name in different mapping types are supported by the same field in Lucene. The default is _doc. In future 8.0 releases, type will be completely removed.

index

In Elasticsearch, an index is a collection of documents.

Each Index consists of one or more documents, and these documents can be distributed among different shards.

Many people think of index as similar to a database in a relational database. There is some truth in this, but it is not quite the same. One important reason is that documents in Elasticsearch can have object and nested structures. An index is a logical namespace that maps to one or more primary shards and can have zero or more replica shards.

For a multi-node cloud storage like index, it looks more like this:

Every time a document comes in, it is automatically hash based on the id of the document and stored in the calculated shard instance. This result ensures that all shards have balanced storage and some shards are not too busy.

shard_num = hash(_routing) % num_primary_shards
Copy the code

By default, the _routing above is both the _id of the document. With routing involved, these documents may only be stored in a particular shard. The advantage is that in some cases we can quickly synthesize the results we need without having to get requests across nodes. We can also see from the above formula, our shard number can not be dynamically modified, otherwise the corresponding shard number will not be found later. It is important to note that the replica number can be modified dynamically.

shard

Because Elasticsearch is a distributed search engine, indexes are typically split into elements called shards distributed across multiple nodes. Elasticsearch automatically manages the sorting of these shards. It also rebalances sharding as needed, so users don’t have to worry about details.

An index can store a large amount of data beyond the hardware limit of a single node. For example, an index with 1 billion documents takes up 1 TERabyte of disk space, while no node has that much disk space. Or a single node processes a search request and responds too slowly.

To solve this problem, Elasticsearch provides the ability to split an index into multiple pieces, called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent “index” that can be placed on any node in the cluster.

Sharding is important for two main reasons:

  • Allows you to split/expand your content horizontally

  • Allows you to perform distributed, parallel operations on shards (potentially, on multiple nodes), thereby improving performance/throughput

There are two types of shards: primary shard and Replica Shard.

  • Primary shard: Each document is stored in a Primary SHAr. When indexing a document, it first indexes on the Primary SHAr and then on all copies of the shard. An index can contain one or more primary shards (default is 5). This number determines the scalability of the index relative to the size of the index data. After an index is created, the number of primary shards in the index cannot be changed.

  • Replica Shard: Each master shard can have zero or more copies. A replica is a copy of the master shard and serves two purposes: increase failover: If the master fails, the replica shard can be promoted to master

Improved performance: GET and search requests can be handled by the master or replica shard.

By default, there is one copy for each master shard, but the number of copies can be dynamically changed on existing indexes. Replica shards are never started on the same node as their master shard.

The figure below shows 5 shards and 1 replica in an index

We can set the corresponding shard value for each index:

curl -XPUT http://localhost:9200/another_user? pretty -H'Content-Type: application/json' -d ' { "settings" : { "index.number_of_shards" : 2, "index.number_of_replicas" : 1 } }'
Copy the code

For example, in the REST interface above, we set 2 shards and one replica for the index another_user. Once the number of shards is set, we cannot change it. This is because Elasticsearch will allocate the document to the shard based on the document ID and the number of shards. If this number is changed later, the corresponding shard may not be found each time a search is performed.

We can view the Settings in our index through the following interface:

curl -XGET http://localhost:9200/twitter/_settings? prettyCopy the code

Above we can get the setting information of The Twitter index:

{
  "twitter" : {
    "settings" : {
      "index" : {
        "creation_date" : "1565618906830"."number_of_shards" : "1"."number_of_replicas" : "1"."uuid" : "rwgT8ppWR3aiXKsMHaSx-w"."version" : {
          "created" : "7030099"
        },
        "provided_name" : "twitter"}}}}Copy the code

replica

By default, Elasticsearch creates five master shards and one copy for each index. This means that each index will contain five master shards, and each shard will have one copy.

Allocating multiple shards and copies is the essence of the distributed search functionality design, providing high availability and fast access to documents in the index. The main difference between a master copy and a replica shard is that only the master shard can accept index requests. Both replicas and master shards can provide query requests.

In the figure above, we have an Elasticsearch cluster consisting of two nodes in the default sharding configuration. Elasticsearch automatically arranges and splits five master shards on two nodes. There is one replica shard for each master shard, but the arrangement of the replica shards is completely different from the arrangement of the master shard. Again, think about distribution.

Let’s be clear: Remember that the number_OF_SHards value is related to the index, not the entire cluster. This value specifies the number of shards per index (not the total number of primary shards in the cluster).

We can obtain the health status of an index by using the following interface:

http://localhost:9200/_cat/indices/twitter
Copy the code

The above interface can return the following information:

Further inquiry, we can see that:

If an index is red, it indicates that some nodes in the index are broken and some shards and replicas are inaccessible. If the value is green, it indicates that each SHard in the INDEX is backed up and its backup is successfully replicated in the corresponding Replica shard. If one node is faulty, the replica or primary shard of the corresponding node takes effect so that data is not lost.

Next step: If you want to know more about Lucene’s data stores in Shard, read Elasticsearch: Inverted Index, DOC_values and Source.