This is the 10th day of my participation in Gwen Challenge

This article introduces some basic concepts of Elasticsearch and what a DSL is.

1. Basic concepts

1.1 nearly real-time

Elasticsearch is a search service that provides near-real time queries, which means there is a slight delay (in about a second) between indexing documents and actually being searchable. In the next series of articles we will talk about the principles of data search (or read) and persistence in ES, which are related to flush writes to memory, but we need to know that there is a delay of about a second, and that the agent can enforce flush.

1.2 Participle and inverted index

To explain inverted indexes, we need to understand participles first. As the name implies, participle, as a verb, is decomposed into words, according to different rules, the results of participle are not the same. For example, we often use Chinese word segmentation plugin IK participle:

Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “The People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the people, the people, the people, the Republic, the republic, and the guo guo, the national anthem”, which will exhaust all possible combinations, suitable for Term Query;

Ik_smart: Splits “National anthem of the People’s Republic of China” into “National anthem of the People’s Republic of China”, which is suitable for the Phrase query.

It can be seen that we need to adjust different word segmentation modes according to different query targets. That’s about it.

Inverted index: for example, if we want to retrieve some articles according to some keywords, we will first encode all articles, also can be said to mark the page number, inverted index is equivalent to creating a keyword (word segmentation results) directory, recording which words are included in which articles, as follows:

keywords The page number
movement 1,2,3,5,7,8
activity 3, 4
life 1, 2,
chenqionghe 8

When we search for articles with “movement”, we first go to the keyword category, find pages 1,2,3,5,7,8, and then directly turn to those pages to get the relevant content. If we want to search for “movement life”, we have to break it down into “movement” and “life”, and then go to the directory separately (so segmentation is a great art in search engines).

The principle of inverted index is actually as simple as that.

1.3 Document, Index and Type

Document: Elasticsearch is document oriented, which means it can store entire objects or documents. It does more than store, however, it indexes the contents of each document to make it searchable. In Elasticsearch you can index, search, sort, and filter documents instead of rows and columns of data. Elasticsearch uses Javascript Object Notation, also known as JSON, as the document serialization format. In general, we can think of objects and documents as equivalent. However, there are differences: An Object is a JSON structure — like a hash, a HashMap, a dictionary, or an associative array; Objects may also contain other objects. In Elasticsearch, the term document has a special meaning. It specifically refers to the JSON data serialized from the top-level structure or root object (identified by a unique ID and stored in Elasticsearch).

Indexing: Indexing is the action of storing data in Elasticsearch. In Elasticsearch, documents belong to a type that exists in an index. We can draw some simple comparisons to traditional relational databases:

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices   -> Types  -> Documents -> Fields
Copy the code

An Elasticsearch cluster can contain multiple indexes, each index can contain multiple types, each type can contain multiple documents, then each document can contain multiple Fields.

Distinction in meaning of “index”

You may have noticed that the word index has a different meaning in Elasticsearch, so it’s worth making a distinction here:

  • As mentioned above, an index is like a database in a traditional relational database. It is the place where related documents are stored. The plural of index is indices or indexes.
  • Index (verb) To index a document means to store a document in an index (noun) so that it can be retrieved or queried. This is much like the INSERT keyword in SQL, except that if the document already exists, the new document overwrites the old one.
  • Traditional databases add an index to a specific column, such as a B-tree index, to speed up retrieval. Elasticsearch and Lucene use a data structure called Inverted Indexes for the same purpose.

Type: Versions prior to 6.0 were used to distinguish different types under the same index. Include_type_name = true in version 7.0; include_type_name = true in version 7.0; include_type_name = true This will default to false, meaning that type information will not be included, which is a switch for type to remove. In other words, the comparison graph will be updated to look like this:

Relational DB -> Databases -> Tables -> Rows -> Columns Elasticsearch -> Databases -> Documents -> FieldsCopy the code

1.4 Nodes and Clusters

A node is a running instance of Elasticsearch, which you can think of as a single server. A cluster is a collection of one or more nodes that work together, share data, and provide failover and scaling capabilities. High availability of the entire Elasticsearch service.

1.5 Sharding (Shards)

In theory, indexes can store as much data as possible, but performance is often poor in this case, or common disk capacity constraints do not allow for this. So Elasticsearch provides sharding functionality similar to that found in MongoDB, which can subdivide an index into multiple shards. Each shard is itself a fully functional and independent “index” that can be hosted on any node in the cluster.

Similarly, the presence of sharding technology to deal with the rapid growth of data volumes means that replication technology is needed to deal with the data security issues in this process (and not just in this process, but in any case, with security awareness). Elasticsearch allows you to convert one or more copies of an index shard into a so-called replica shard. Replication technology provides us with high availability of data and scalability of search throughput. Note, however, that the replica shard is never assigned to the same node as the original/master shard copied from it.

In summary, each index can be split into multiple shards. Indexes can also be copied to zero (meaning no copies) or more times. Once replicated, each index will have a master shard (the original shard copied from the index) and a replica shard (a copy of the master shard). Developers can define the number of shards and replicas for each index at index creation time. The number of copies can be dynamically changed at any time after an index is created, but the number of shards cannot be changed immediately after the process.

2. What is DSL?

Domain-specific languages (English: DSL) are computer languages that focus on an application domain. Also translated as domain specific language. This is the language for Elasticsearch. These specialized languages are typically used to access ES directly rather than manipulating ES data through, say, Java clients that call interfaces. DSLS are not the focus of this series, so instead of describing them, you can use the Quick Start documentation to do some exercises.

conclusion

In this article we have introduced some basic concepts about Elasticsearch and what the DSL language is. In the next article we will start building the project and gradually introduce some of the underlying principles of Elasticsearch.

link

  • Elasticsearch inversion index principle
  • Elasticsearch Quick Start