preface
Only a bald head can be strong.
Star: github.com/ZhongFuChen…
I don’t know if you use Elasticsearch in your company, but my company has it in use. Listening to colleagues chatting, I can’t avoid the technology stack I don’t know, such as putting data in the engine, taking data out of the engine and so on.
If you don’t know anything about engines, you don’t know what they’re talking about. Elasticsearch is a search engine.
This article is primarily a simple introduction to Elasticsearch, with no in-depth knowledge or use. At the very least, I want my colleagues to know what they’re talking about when they talk about engines.
What is Elasticsearch?
Elasticsearch is a real-time, distributed storage, search, and analytics engine
Elasticsearch is a real-time distributed storage, search, and analysis engine.
There are a few key words there:
- real-time
- distributed
- search
- Analysis of the
We need to know how Elasticsearch works in real time and how its architecture is (distributed). Store, search, and analyze (you need to know how Elasticsearch stores, searches, and analyzes)
All of these issues will be addressed in this article.
I have written more than 200 original technical articles, and I will write big data related articles in the future. If you want to read my other articles, please follow me. Public account: Java3y
If you think my article is good and helpful to you, don’t be stingy with your likes!
Why Elasticsearch
Before you can learn a technology, you must first understand why it is being used. So why use Elasticsearch? In our daily development, databases can do the same (real-time, storage, search, analysis).
The great thing about Elasticsearch over databases is that it can be queried fuzzily.
Some students may say: my database how can not fuzzy query? I’ll write you an SQL:
select * from user where name like '% public Java3y%'
Copy the code
This can not put the public number Java3y related content search out?
Yes, it can. The name like %Java3y% query is indexless, which means that as long as you have a large database (100 million entries), your query will be in seconds
If you are not familiar with database indexing, I suggest you revisit my previous articles. I think I did a pretty good job (hahaha)
GitHub search keyword: “index”
And even if you get a fuzzy match from the database, it’s going to give you a lot of data, and you don’t need that much data. Maybe 50 records is enough.
One more thing: the user’s input is often not quite as accurate as I typed ElastcSeach from Google, but Google still figured out what I wanted Elasticsearch to be
Elasticsearch is a search engine designed to solve this problem.
- Elasticsearch is very good at fuzzy search (very fast)
- – You can filter most of the data from Elasticsearch by rating, just return the highest rating to the user (sorting is supported native)
- Search for relevant results with less accurate keywords (matching relevant records)
Let’s see how Elasticsearch can do this.
Elasticsearch data structure
As you know, if you want to spend less time querying, you need to know what the underlying data structure is; Here’s an example:
- The search time of a tree is usually order logn.
- The search time for linked lists is usually order n.
- The search time of a hash table is usually order one.
- . Different data structures take different amounts of time, so if you want to find things faster, you need to have the underlying data structure to support it
For Elasticsearch, fuzzy query is very fast. Let’s see.
We look up a record based on a “complete condition” called a forward index; The chapter table of a book is a forward index, which finds the corresponding page number by chapter name.
First we need to know why Elasticsearch is a fast “fuzzy match”/” correlation query “. In fact, when you write data to Elasticsearch, there is a word segmentation.
For example, the word “algorithm” appears four times in the above picture. Can we find its corresponding directory according to this word? This is exactly what Elasticsearch does, and if we did it from above, we would get something like this:
- algorithm
->
2,13,42,56
This means that the word “algorithm” must have appeared on pages 2, 13, 42, and 56. This search for a corresponding record based on a word (incomplete condition) is called an inverted index.
Take a look at the picture below to get a feel for it:
There are so many languages in the world, how will Elasticsearch split these words? Elasticsearch has some word dividers built in
- Standard Analyzer. Word segmentation, lowercase words
- Simple Analyzer. Filter words by non-alphabetic (symbols are filtered out) and lower case
- WhitespaceAnalyzer. Divide by space, no lowercase
- . Etc etc.
Elasticsearch word segmentation consists of three parts:
- Character Filters (text Filters, without HTML)
- Tokenizer (shards by rules, such as Spaces)
- TokenFilter (to process the shred word, such as lowercase)
– Elasticsearch is written by foreigners, and the built-in word segmentation is in English, while our users usually search in Chinese, and currently the most used word segmentation is IK.
What is the data structure of Elasticsearch? Take a look at the graph below:
We enter a piece of text and Elasticsearch splits our text according to the word spliter (Ada/Allen/Sara..). , these sub-words are collectively called Term Dictionary, and we need to find the corresponding records through the segmentation, and these document ids are stored in the PostingList
Since there are so many words in Term Dictionary, we will sort them. When we need to search, we can search through dichotomy, without going through the whole Term Dictionary
Because there are so many words in the Term Dictionary, it is impossible to store all the words in the Term Dictionary in memory, so Elasticsearch also selects a layer called Term Index, which only stores the prefixes of some words. Term Index will be stored in memory (retrieval will be very fast)
Term Index is saved in memory in the form of FST (Finite State Transducers), which is characterized by very memory saving. FST has two advantages:
- 1) Small space occupancy. By reusing the prefixes and suffixes of words in the dictionary, the storage space is reduced.
- 2) Fast query speed. O(len(STR)) query time complexity.
Term Index is stored in memory, Elasticsearch is saved in FST (Finite State Transducers) format (saving memory space). Term Dictionary is also sorted for Elasticsearch, and PostingList is optimized for that as well.
PostingList uses the Frame Of Reference (FOR) encoding technology to compress data in it, saving disk space.
PostingList stores document ids. When we search, we often need to perform intersection and union operations on these document ids (such as multi-condition query). PostingList uses Roaring Bitmaps to perform union operation on document ids.
The benefits of using Roaring Bitmaps are space savings and rapid results of intersection sets.
Elasticsearch data structure features:
Elasticsearch terminology and architecture
Elasticsearch is a distributed storage system for Elasticsearch. If you have read my article, you will be familiar with the concept of distributed storage.
If you are not familiar with distribution, I suggest you review my previous articles. I think I did a pretty good job (hahaha)
GitHub search keywords: “SpringCloud “,” Zookeeper”,”Kafka”,” single sign-on”
Elasticsearch before we get into Elasticsearch’s architecture, let’s look at some common terms for Elasticsearch.
- Index: The Index of Elasticsearch is the same as the Table of the database
- Type: this has been removed from the new Elasticsearch version (in previous Elasticsearch versions, multiple types under one Index were supported — somewhat similar to the concept of multiple groups under one topic for message queues)
- Document: Document is equivalent to a row of database records
- Field: corresponds to the concept of a database Column
- Mapping: a concept equivalent to a database Schema
- DSL: database equivalent SQL (give us the API to read Elasticsearch data)
Some of the terms for Elasticsearch should be understood by now. What is the architecture of Elasticsearch? Here’s a look:
An Elasticsearch cluster can have multiple Elasticsearch nodes, which are actually machines on which the Elasticsearch process is running.
Among the many nodes, there is a Master Node that maintains index metadata, switches Master shard and replica shard identities, and elects a new Master Node if the Master Node fails.
The outermost layer of Elasticsearch is the Index; An Index data can be distributed to different nodes for storage. This operation is called sharding.
For example, if I have 4 nodes in the cluster, and I have an Index, and I want to store the Index on 4 nodes, we can set it to 4 shards. The sum of the four shards is the Index data
Why sharding? The reason is simple:
- If the amount of data in an Index is too large, only one fragment is stored on one node. As the amount of data increases, one node may not be able to store an Index.
- Multiple shards that can be written or queried in parallel (read and write data from each node to improve throughput)
Now the question is, if a node fails, is that part of the data lost? Obviously Elasticsearch will think of this as well, so shards will be split into master shards and replica shards (for high availability)
When data is written to the master shard, the replica shard copies the data from the master shard and reads the data from both the master shard and the replica shard.
The number of shards and duplicate shards required for Index can be set by configuration
If a Node fails, the previously promoted Master Node will promote the corresponding replica shard to the Master shard, so that data will not be lost even if the Node fails.
Elasticsearch’s architecture can be summarized as follows:
Process for Elasticsearch to write
We already know that when we write data to Elasticsearch, it is written to the master shard, so we can see more details.
The client writes data to the Elasticsearch cluster and the node handles the request:
Each node on the cluster is a coordinating node, which indicates that the node can be used for routing. For example, node 1 receives a request, but discovers that the request data should be processed by node 2 (because the master shard is on node 2), so it forwards the request to node 2.
- The coodinate node uses a hash algorithm to figure out which master shard it is on and then routes to the corresponding node
shard = hash(document_id) % (num_of_primary_shards)
Routing to the corresponding node and the corresponding master shard does the following:
- Write data to memory cache
- The data is then written to the Translog cache
- Data is refreshed from buffer to FileSystemCache every 1s, and the segment file is generated. Once the segment file is generated, it can be queried by index
- After refresh, the memory buffer is cleared.
- Every 5s, translog is flushed from buffer to disk
- Periodically/quantitatively from FileSystemCache, combined with translog content
flush index
To disk.
Explain:
- Elasticsearch writes data to the memory buffer and then flushs it to the file system cache every 1s (the data can only be retrieved after being flushed to the file system buffer). Therefore, data written by Elasticsearch takes 1s to be queried
- Elasticsearch writes data to the log file in case the node goes down and loses data in memory. However, the log file is initially written to the buffer, and the buffer is flushed to disk every 5 seconds. If a node of Elasticsearch is suspended, 5 seconds of data may be lost.
- When the translog file on the disk becomes large enough or exceeds 30 minutes, the COMMIT operation is triggered, and the selog files in the memory are asynchronously brushed to the disk to complete the persistence operation.
To put it simply: write memory buffers (regular to generate selog, generate translog), to make data can be indexed and persisted. Finally, the persistence is done once through commit.
After the master fragment is written, data will be sent to the nodes of the replica set in parallel. When all nodes are written successfully, ack will be returned to the coordination node, which will return ACK to the client to complete a write.
Elasticsearch updated and deleted
Update and delete Elasticsearch
- To the corresponding
doc
Record on.del
Logo, if it is a delete operationdelete
Status, if the update operation is the originaldoc
Mark isdelete
And then write a new piece of data
As mentioned above, a set of seinterfaces files will be generated every 1s, and then the seinterfaces files will keep increasing and increasing. Elasticsearch will have a merge task, which will integrate multiple seINTERFACES files into one seINTERFACE file.
During the merge process, doc with the DELETE state is physically deleted.
Elasticsearch query
The easiest way to query us can be divided into two types:
- Query doc by ID
- Query matching doc based on query (search term)
public TopDocs search(Query query, int n);
public Document doc(int docID);
Copy the code
The procedure for querying a doc by ID is as follows:
- Retrieves the Translog file in memory
- Retrieve the Translog file for the hard disk
- Retrieves the seINTERFACES file for the hard disk
The process for matching doc against query is as follows:
- At the same time to query memory and hard disk seinterfaces file
Get (Doc by ID is real-time) Query (Doc by Query is near real-time)
- Because seinterfaces files are generated only once every second
Elasticsearch queries can be divided into three stages:
-
QUERY_AND_FETCH (return the whole Doc content after query)
-
QUERY_THEN_FETCH (query Doc ID and then match Doc ID to corresponding document)
-
DFS_QUERY_THEN_FETCH (score first, query later)
- “Here refers to the Term Frequency and Document Frequency. As we all know, the higher the Frequency, the stronger the correlation.”
The one we use most often is QUERY_THEN_FETCH. The first one returns the whole Doc content (QUERY_AND_FETCH) is only suitable for requests that require only one shard.
The overall flow of QUERY_THEN_FETCH is as follows:
- Client requests are sent to a node in the cluster. Every node in the cluster is a coordinate node
- The coordinating node then forwards the search request to all shards (both master and replica shards)
- Each shard will search for its own results
(doc id)
It returns to the coordination node, which performs data merging, sorting, paging and other operations to produce the final result. - The coordination node then bases on
doc id
Go to the nodesPull the actualthedocument
Data, which is eventually returned to the client.
Query Phase:
- Coordinate nodes to send queries to target shards (forward requests to master or replica shards)
- Data nodes (filtering, sorting, and so on within each shard), and returns
doc id
To coordinate node
The Fetch Phase does the following:
- Coordinate nodes to get what the data nodes return
doc id
For thesedoc id
Do the aggregation, then slice the target data and send fetch command (hopefully get the whole Doc record) - The data node is sent by the coordination node
doc id
, pull the data actually needed and return it to the coordination node
As Elasticsearch is distributed, all data needs to be pulled from each node and aggregated to the client
It’s just that Elasticsearch does all the work and we don’t know when we’re using it.
The last
This article is a brief entry into Elasticsearch, and I’m sure there will be a lot of pitfalls in actual use, but I’ll leave it there for now.
If there are any mistakes in the article, please kindly correct them. After the year will continue to update the introduction of big data related articles, interested welcome to pay attention to my public number. Think this article is ok, can give me a thumbs-up 👍
References:
- Talk about Elasticsearch’s inverted index
- Why Elasticsearch
- Lucene dictionary implementation principle – FST
- Elasticsearch performance optimization
- Delve into the writing process of Elastic Search
- Elasticsearch kernel parsing – Query
If you want to follow my updated articles and shared dry goods in real time, you can follow my official account “Java3y”.
- 🔥 Massive video resources
- 🔥Java beautiful brain map
- 🔥Java Learning path
- 🔥 Develop common tools
- 🔥 beautifully organized PDF ebook
Reply “888” under the public number to obtain!!
This has been included in my GitHub featured articles, welcome to Star: github.com/ZhongFuChen…
Ask for praise ask for attention ️ ask for share 👥 ask for a message 💬 for me really very useful!!