This paper sorting since: neway6655. Making. IO/elasticsear…

Recently, I participated in the design of a scheme based on Elasticsearch as the underlying data framework to provide a large amount of data (hundreds of millions of levels) of real-time statistics query. I spent some time learning the basic theoretical knowledge of Elasticsearch. For those of you who are interested in Elasticsearch. At the same time, I also hope to find the content is not correct or there is a question, please point out, discuss, study, progress.

introduce

Elasticsearch is a distributed and scalable real-time search and analysis engine.

Elasticsearch is a search engine based on Apache Lucene(TM), a full-text search engine. Elasticsearch is more than Just Lucene. It includes full text search and can do the following:

  • Distributed real-time file storage and indexing every field so that it can be searched.
  • Distributed search engine for real-time analysis.
  • Scalable to hundreds of servers, processing petabytes of structured or unstructured data.

The basic concept

Elasticsearch is a document oriented database that uses JSON as its serialization format. For example:

{
    "name" :     "John"."sex" :      "Male"."age" :      25."birthDate": "1990/05/01"."about" :    "I love to go rock climbing"."interests": [ "sports"."music"]}Copy the code

For Elasticsearch, this is a document. Of course, this document is of the same type as User, and various types exist in the same index. Here’s an easy way to compare Elasticsearch with relational data terms:

Relational database tail Database tail

Elasticsearch index

An Elasticsearch cluster can contain multiple indexes (databases), which means it contains many types (tables). These types contain many documents (rows), which in turn contain many fields (columns) within each document.

For example, if you want to insert a record, you can simply send an HTTP request to Elasticsearch:

PUT /megacorp/employee/1
{
    "name" :     "John",
    "sex" :      "Male",
    "age" :      25,
    "about" :    "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
Copy the code

For details about how to update Elasticsearch, see the Elasticsearch Authority guide


The index

The key to Elasticsearch is to provide a very powerful index. InfoQ’s secret of time series database (2) is to provide a very good index.

The essence of Elasticsearch index

Everything is designed to improve search performance

In order to improve the performance of the search, it is inevitable to sacrifice some other aspects, such as insert/update, otherwise the other database will not mix 🙂

Insert a record into Elasticsearch (); insert a json object with multiple fields (name, sex, age, about, interests); [^1] insert the data into Elasticsearch, and then insert the index into Elasticsearch.

How does Elasticsearch index quickly

[InfoQ] Elasticsearch uses an inverted index faster than a relational database’s b-tree index.

What is a B-tree index?

When WE were in college, the teacher taught us that the efficiency of binary tree search is logN, and it is not necessary to move all nodes when inserting new nodes at the same time. Therefore, using tree structure to store indexes can give consideration to both insertion and query performance.

Therefore, on this basis, combined with the disk read characteristics (sequential read/random read), the traditional relational database uses b-tree /B+Tree data structure:

To improve the query efficiency and reduce the number of disk seek times, multiple values are stored as an array through a continuous range. Multiple data can be read in a seek and the height of the tree is also reduced.

What is an inverted index?

Continuing with the above example, assume that there are several items of data (excluding the about and interests fields for simplicity):

ID Name Age Sex
1 Kate 24 female
2 John 24 Male
3 Bill 29 Male

Elasticsearch creates the following index for Elasticsearch:

Name:

Term Posting List
Kate 1
John 2
Bill 3

Age:

Term Posting List
24 [1, 2]
29 3

Sex:

Term Posting List
female 1
Male [2, 3]
Posting List

For example, [1,2] is Posting List. For example, [1,2] is Posting List. For example, [1,2] is Posting List. Posting List is an array of ints that store all document ids that conform to a certain term.

Don’t think this is the end, the good part is just beginning…

For example, if you want to look for students with age=24, Ming raised his hand and answered, “I know, I know, I know. But what if there were tens of millions of records? What if you want to look it up by name?

Term Dictionary

Term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary = term Dictionary SQL > select * from b-tree; SQL > select * from b-tree;

Term Index

B-tree by reducing the number of disk seek to improve query performance, Elasticsearch is also the same idea, directly through the search term memory, do not read disk, but if the term is too much, the term dictionary will be very big, put the memory is not reality, hence the term Index, Term index is A tree, like the index page in A dictionary:

This tree does not contain all terms, it contains some prefixes to terms. Term index allows you to quickly locate an offset in the Term Dictionary, and then look up from there.

So term index does not need to store all terms, but just the mapping between some of their prefixes and the blocks of term Dictionary, combined with Finite State Transducers compression technology (FST), Term index can be cached in memory. After finding the block position of the corresponding term dictionary from the term index, you can search for the term on the disk, which greatly reduces the number of random disk reads.

At this time love to ask the small Ming raised his hand again :” that FST is what east east?”

It was obvious that Xiao Ming was a careless student like me in college. The teacher of data structure must have told me what FST was. But there is no way, I also forgot, here again fill the class:

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output.

Suppose we now want to map mop, moth, pop, star, stop and top(the term prefix in the term index) to the sequence number: 0,1,2,3,4,5 (the block location in the term dictionary). The simplest way to do this is to define a Map

, and everyone can find their own location to sit, but from the point of view of less memory, is there a better way? The answer is: FST(here’s the theory, but I’m sure 99% of you won’t read it).
,>

⭕️ indicates a status

–> Indicates the status change process. The letters and digits above indicate the status change and weight

Separate the word into individual letters by using ⭕️ and –>. Zero weight is not displayed. If ⭕️ is followed by a branch, the weights are marked, and the weights along the entire path add up to the number of the word.

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output.

FST stores all terms in bytes. This compression can effectively reduce the storage space so that term index can fit into memory, but it also requires more CPU resources for lookups.

After the more exciting, look tired students can drink a cup of coffee……


Compression techniques

For Elasticsearch, you can use FST to compress term indexes. For Posting list, you can use FST to compress term indexes. After drinking his coffee, Ming raised his hand again. “Posting List already stores only document ids. Still need compression?”

Well, let’s go back to the original example, if Elasticsearch wanted to index the gender of the classmate (at this point, traditional relational databases are in the toilet…) What happens? If there are millions of students, and there are only two genders in the world, each Posting List will have at least a million document ids. How does Elasticsearch compress these document ids effectively?

Frame Of Reference

Delta encoding compression, the large number into decimal, by byte storage

For example, Elasticsearch requires Posting to be ordered (even if you want to make it better).

If math isn’t taught by a physical education teacher, it’s easier to see this compression technique.

And the idea is to do it by increments, by changing a large number to a decimal and just storing the increments, and then doing the math and arranging it in bits, and then storing it in bytes, rather than being careless and storing it in ints even though it’s 2.

Roaring bitmaps

Roaring Bitmaps, we have to start with bitmaps. A Bitmap is a data structure that assumes some Posting list:

,3,4,7,10 [1]

The corresponding bitmap is:

,0,1,1,0,0,1,0,0,1 [1]

Is intuitive, with 0/1 indicates a value exists, such as 10 this value corresponding to the 10th, the corresponding bit value is 1, so using a byte can represent eight document id, the old version before (5.0) of Lucene is in this way to compression, but it still not enough efficient compression way, if there are 100 million documents, That’s 12.5MB of storage for just one index field (we tend to have many). So someone came up with Roaring Bitmaps, a more efficient data structure.

The downside of Roaring Bitmaps is that the storage space grows linearly with the number of documents. To break this spell, Roaring Bitmaps must use some exponential features:

Block the Posting list by 65535, for example, the first block contains document ids ranging from 0 to 65535, the second block has ids ranging from 65536 to 131071, and so on. < quotient, remainder > to represent each group of ids, so that each group of ids in the range of 0 to 65535, the rest is easy, since each group of ids will not become infinite, then we can use the most efficient way to store the id here.

Careful Xiao Ming raised his hand again at this time :” why is 65535 as the limit?”

In addition to 1024, 65535 is also a classic value in the programmer’s world, because it =2^16-1, which is exactly the maximum number in two words, the unit of storage for a short, If a block has more than 4096 values, encode as a bit set, Otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value “, otherwise as a simple array using 2 bytes per value.

So why use 4096 to distinguish between an array and a bitmap threshold?

The space required by bitmap is constant: 65536/8 = 8192bytes. If short[] is used, the space required is: 2*N(N is the number of array elements)


Joint index

If multiple field indexes are used in a joint query, how can an inverted index meet the requirements of a fast query?

  • Use the Skip list data structure to quickly do and, or
  • Use the bitset “and” mentioned above

Let’s look at the data structure of the hop table:

Will an ordered list level0, pick a few elements to level1 and level2 stores, each level up, elected to a pointer to the element is less, when find in turn from high level to low lookup, 55, for example, find level2 stores 31, first find level1 47, 55, finally found There are three lookups, and the lookups are about as efficient as a binary tree, but with a certain amount of spatial redundancy.

Suppose you have the following three Posting Lists that need to be indexed jointly:

If you use bitset, it’s pretty straightforward, you just press bitwise and, and you get the final intersection.


Summary and reflection

Select * from Elasticsearch;

Move the contents of the disk into memory as much as possible, reduce the number of random disk reads (while also taking advantage of the sequential disk read feature), and use memory with a very strict attitude, combined with a variety of clever compression algorithms.

When indexing with Elasticsearch, note:

  • Fields that do not need indexes must be clearly defined, as they are automatically indexed by default
  • In the same way, fields of type String that do not need to be analyzed need to be explicitly defined, because analysis is done by default
  • It is important to choose regular ids; too random ids (such as Java UUID) are not good for queries

On the last point, I personally think there are several factors:

One (and perhaps not the most important) factor is that many of the compression algorithms we’ve seen in Posting list are used to compress a large number of ids. For example, if the ids are sequential or have regular ids with common prefixes, the compression ratio is higher.

Another factor: The most important part of Elasticsearch is to look for Document data by Posting the ID in the Posting list. The efficiency of locating segments based on the large Term ID directly affects the performance of the final query. If the ID is regular, you can quickly skip segments that do not contain the ID to reduce unnecessary disk reads. For details, see how to select an efficient global ID scheme.

In the future, we will share more content based on the actual development and tuning work, please look forward to it!

Personal wechat official Account:

Individual making:

github.com/jiankunking

Personal Blog:

jiankunking.com