Click to jump to the Elastic Search column directory

ElasticSearch is a distributed document retrieval engine developed in the Java language. Document storage is also supported, and instead of storing information as column data like SQL, it stores complex data structures of serialized JSON documents. ElasticSearch also supports cluster deployment. If there are multiple ElasticSearch nodes in a cluster, the stored documents are distributed across the cluster and can be accessed immediately from any node.

ElasticSearch is a Lucene based search server that provides a standard RESTful Web interface and is distributed as open source under the Apache license.

Related terminology conventions

  • Shard: shard
  • Primary shard: Primary shard
  • Replica Shard
  • Shard copy: A shard in the index data, regardless of primary or secondary shards
  • Fragment configuration: Shard Allocation
  • Cluster status: Cluster state
  • Allocation decision: Allocation decision
  • Allocation awareness: Allocation Awareness
  • Allocation IDs: Allocation IDs
  • Tracking: tracking
  • Transaction log: Translog
  • Synchronization set: in-sync set

Understand ES meta fields and concepts

For example, the _index, _type, and _id fields are used to identify a document. For example, the _index, _type, and _id fields are used to identify a document.

  • _index: logical namespace that points to one or more physical shards
  • _type: meta-field identification of different subdivisions of data in the same set
  • _ID: indicates a unique identifier

A Relational Database Management System (RDBMS) reference is used for beginners to help understand these meta-fields, understanding _index as a Database and _type as a table, but in ES, When you delete the _type, you do not free the memory space, and the existence of the meta field is meaningless, so in 6. After the version, only one _type field exists, which will be deleted in 7. Therefore, we should use _index as the delete unit when releasing data.

Cluster Status Description

  • Green: All primary and secondary shards are healthy
  • Yellow: All primary shards are running properly, but not all secondary shards are running properly, indicating a single point of failure
  • Red: Master shard is not running properly

Of course, this state also exists in the state description between indexes. If a sub-shard is modified under an index, the index to which the fragment belongs and the whole cluster will become Yellow, while other indexes remain in the original state.

Shard relationship

In order to deal with the problem of concurrent update, ES adopts the master-slave mode. The data of the master shard is regarded as the authoritative data, and the writing process is written to the master shard first, and then to the deputy shard after successful execution. In the data recovery stage, the data of the master shard is also taken as the benchmark.

Where, the relationship between data sharding and data copy is as follows:

In ES, sharding is the basic read and write unit at the bottom level. Its function is to divide huge index data so that read and write operations can be executed in parallel. Therefore, the inclusion relation between index and sharding is as follows:An ES index contains many fragments, a fragment is a Lucene index, a Lucene index contains many fragments, a fragment is composed of multiple inverted indexes, an inverted index contains several document data, a document data is composed of several words, Word is the final result of word segmentation and language processing.

Real time of ES

In ES, for the concept of real time, is actually a second level the concept of synchronous data, for example when you write data query immediately (less than 1 second) set the query time, you will be less than the data query, because you write ES the existence of the data in the temporary a cache, in the second level units unified written ES memory, Therefore, in many cases, when data is queried immediately after being written, the illusion of data writing failure exists.

ES Word Analyzer

The function of a word splitter is to cut and separate words in the text according to certain rules, corresponding to the Analyzer class, which is an abstract base class. The specific rules of word splitter are realized by subclass inheritance, so different word splitters are needed for different languages. The tokenizers are used in both index creation and data search, and the tokenizers used must be uniform, otherwise the data retrieval will be empty.

Built-in word segmentation:

tokenizer logical name description
standard tokenizer standard
edge ngram tokenizer edgeNGram
keyword tokenizer keyword Regardless of the word
letter analyzer letter According to the word points
lowercase analyzer lowercase letter tokenizer, lower case filter
ngram analyzers nGram
whitespace analyzer whitespace Split with a space separator
pattern analyzer pattern Defines a regular expression for delimiters
uax email url analyzer uax_url_email Do not split URL and email
path hierarchy analyzer path_hierarchy Handles strings similar to path/to/somthing
Among these built-in participles and others,
For example,IK participle.
This word splitter has a good effect on Chinese word segmentation,
Many users are not satisfied with the Chinese word segmentation effect of the official word segmentation machine.
But IK solves this problem.

When the available environment is available, the relevant features of ES and related operations will be shown in the way of practical screenshot.