ElasticSearch (1)

Click to jump to the Elastic Search column directory

ElasticSearch is a distributed document retrieval engine developed in the Java language. Document storage is also supported, and instead of storing information as column data like SQL, it stores complex data structures of serialized JSON documents. ElasticSearch also supports cluster deployment. If there are multiple ElasticSearch nodes in a cluster, the stored documents are distributed across the cluster and can be accessed immediately from any node.

ElasticSearch is a Lucene based search server that provides a standard RESTful Web interface and is distributed as open source under the Apache license.

Related terminology conventions

Shard: shard
Primary shard: Primary shard
Replica Shard
Shard copy: A shard in the index data, regardless of primary or secondary shards
Fragment configuration: Shard Allocation
Cluster status: Cluster state
Allocation decision: Allocation decision
Allocation awareness: Allocation Awareness
Allocation IDs: Allocation IDs
Tracking: tracking
Transaction log: Translog
Synchronization set: in-sync set

Understand ES meta fields and concepts

For example, the _index, _type, and _id fields are used to identify a document. For example, the _index, _type, and _id fields are used to identify a document.

_index: logical namespace that points to one or more physical shards
_type: meta-field identification of different subdivisions of data in the same set
_ID: indicates a unique identifier

A Relational Database Management System (RDBMS) reference is used for beginners to help understand these meta-fields, understanding _index as a Database and _type as a table, but in ES, When you delete the _type, you do not free the memory space, and the existence of the meta field is meaningless, so in 6. After the version, only one _type field exists, which will be deleted in 7. Therefore, we should use _index as the delete unit when releasing data.

Cluster Status Description

Green: All primary and secondary shards are healthy
Yellow: All primary shards are running properly, but not all secondary shards are running properly, indicating a single point of failure
Red: Master shard is not running properly

Of course, this state also exists in the state description between indexes. If a sub-shard is modified under an index, the index to which the fragment belongs and the whole cluster will become Yellow, while other indexes remain in the original state.

Shard relationship

In order to deal with the problem of concurrent update, ES adopts the master-slave mode. The data of the master shard is regarded as the authoritative data, and the writing process is written to the master shard first, and then to the deputy shard after successful execution. In the data recovery stage, the data of the master shard is also taken as the benchmark.

Where, the relationship between data sharding and data copy is as follows:

In ES, sharding is the basic read and write unit at the bottom level. Its function is to divide huge index data so that read and write operations can be executed in parallel. Therefore, the inclusion relation between index and sharding is as follows:An ES index contains many fragments, a fragment is a Lucene index, a Lucene index contains many fragments, a fragment is composed of multiple inverted indexes, an inverted index contains several document data, a document data is composed of several words, Word is the final result of word segmentation and language processing.

Real time of ES

In ES, for the concept of real time, is actually a second level the concept of synchronous data, for example when you write data query immediately (less than 1 second) set the query time, you will be less than the data query, because you write ES the existence of the data in the temporary a cache, in the second level units unified written ES memory, Therefore, in many cases, when data is queried immediately after being written, the illusion of data writing failure exists.

ES Word Analyzer

The function of a word splitter is to cut and separate words in the text according to certain rules, corresponding to the Analyzer class, which is an abstract base class. The specific rules of word splitter are realized by subclass inheritance, so different word splitters are needed for different languages. The tokenizers are used in both index creation and data search, and the tokenizers used must be uniform, otherwise the data retrieval will be empty.

Built-in word segmentation:

tokenizer	logical name	description
standard tokenizer	standard	–
edge ngram tokenizer	edgeNGram	–
keyword tokenizer	keyword	Regardless of the word
letter analyzer	letter	According to the word points
lowercase analyzer	lowercase	letter tokenizer, lower case filter
ngram analyzers	nGram	–
whitespace analyzer	whitespace	Split with a space separator
pattern analyzer	pattern	Defines a regular expression for delimiters
uax email url analyzer	uax_url_email	Do not split URL and email
path hierarchy analyzer	path_hierarchy	Handles strings similar to path/to/somthing
Among these built-in participles and others,
For example,IK participle.
This word splitter has a good effect on Chinese word segmentation,
Many users are not satisfied with the Chinese word segmentation effect of the official word segmentation machine.
But IK solves this problem.

When the available environment is available, the relevant features of ES and related operations will be shown in the way of practical screenshot.

Click to jump to the Elastic Search column directory

Related terminology conventions

Understand ES meta fields and concepts

Cluster Status Description

Shard relationship

Real time of ES

ES Word Analyzer

Related Posts

Stash away dozens of programmers’ must-have productivity tools: All at once!

【 Xiaobai algorithm 】2. Sparse array

You all said taobao is the most difficult website to climb? Use me this method package you learn!