preface
Lucene is a java-based full-text information retrieval toolkit. At present, the mainstream search systems Elasticsearch and Solr are based on Lucene’s index and search capabilities. To understand how the search system works, you need to dig deep into Lucene to see how lucene stores the data that needs to be retrieved and how it performs efficient data retrieval.
Because of the existence of indexes in the database, it can also support many efficient query operations. However, compared with Lucene, the query ability of the database is still much weaker, this paper will explore what queries Lucene supports, and will focus on the selection of several types of queries to analyze how Lucene internal implementation. In order to facilitate your understanding, we will briefly introduce some basic concepts in Lucene, and then expand several data storage structures in Lucene. After understanding their storage principles, we can conveniently know how to realize efficient search based on these storage structures. This article focuses on how Lucene can do the traditional database is difficult to do the query, for word segmentation, scoring and other functions will not be introduced.
This paper will be divided into the following parts:
- Lucene’s data model is described in the lucene Data Model article for details.
- Describes how to store terms to be searched in Lucene.
- Introduces how to store lucene’s inverted chain and how to realize the fast search of DOCID.
- Introduces how Lucene realizes inversion chain merge.
- This section describes how Lucene performs range query and prefix matching.
-
Describes how Lucene optimizes range queries for numeric classes.
Lucene data model
Lucene contains four basic data types:
Index: an Index consisting of a number of documents. Document: consists of many fields and is the smallest unit of Index and Search. Field: consists of many terms, including Field Name and Field Value. Term: Consists of many bytes. Generally, each smallest unit after a Field Value segmentation of type Text is called Term.
In Lucene, the read and write paths are separated. An IndexWriter is created for writing and an IndexSearcher is created for reading. Here is a simple code example of how to use Lucene’s IndexWriter to build an index and how to use indexSearch for search queries.
Analyzer analyzer = new StandardAnalyzer();
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;
//assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println(hitDoc.get("fieldname"));
}
ireader.close();
directory.close();Copy the code
As you can see from this example, Lucene writes and writes have separate action classes. This article focuses on read logic. Using the IndexSearcher class requires a DirectoryReader and QueryParser, where DirectoryReader needs the corresponding Directory implementation at the time of writing. QueryParser is mainly used to parse your queries. For example, if you want to query “A and B”, Lucene has A mechanism to parse the intersection of terms A and B. Specify a maximum number of documents to be returned when executing Search, because there may be too many hits, we can limit the maximum number of documents to be returned by words, and do paging returns.
The following details the steps that an index query takes and the optimizations lucene has made for each step.
Lucene query process
Queries in Lucene are based on segments. Each segment can be regarded as a separate subindex. During the indexing process, Lucene continuously flushes the data in memory to create a new segment. Multiple segments will also be merged into one large segment, and old segments and queries will not be deleted when they are read, and seinterfaces that are not read and integrated will be deleted. This procedure is similar to the LSM database merge procedure. Let’s focus on implementing efficient queries within a segment.
In order to facilitate everyone to understand, we take the name, age, student number as an example, how to realize search a name (have the same name) list.
docid | name | age | id |
---|---|---|---|
1 | Alice | 18 | 101 |
2 | Alice | 20 | 102 |
3 | Alice | 21 | 103 |
4 | Alan | 21 | 104 |
5 | Alan | 18 | 105 |
In Lucene, to query such a condition as name=XXX, an inversion chain based on name is established. For example, the inversion chain is as follows: Name
[1, 2, 3] Alice | — – | — – | Alan | (4, 5) if we want to query by age, for example, want to check my age = 18 list, we can also set up another inverted chain:
18 | [1, 5) | — – | — – | 20 [2] 21 | [3, 4]
In this case, Alice, Alan, 18, these are terms. Therefore, inversion is essentially a reverse list based on term, which facilitates attribute lookup. Here we have a natural question, if the term is very many, how to quickly get the inverted chain? Lucene introduced the concept of term dictonary, which is the dictionary of term. In the term dictionary, we can sort by term, so we can use a binary lookup to determine the address of the term. Such complexity is logN. When there are too many terms for storage, the efficiency still needs to be further improved. You can use a hashMap, and when a term comes in, the hash continues to look for the inverted chain. Here the hashmap can be seen as an index of term Dictionary. Since Lucene4, in order to facilitate the implementation of rangequery or prefix, suffix and other complex query statements, Lucene uses FST data structure to store the term dictionary. The FST storage structure is described below.
FST
Let’s take the two words Alice and Alan as examples to see how FST is constructed. So let’s start by sorting all the words “Alice”, “Alan”.
-
Insert “Alan”
- Insert “Alice”
So you have a directed acyclic graph, a data structure that allows you to quickly find out if someone’s name exists. FST may not be significantly better at single-term queries than HashMap, or even slower. But there are obvious advantages in scope, prefix search and compression rate.
After locating the inverted chain via FST, there is one thing that needs to be done, which is the merging of the inverted chain. Because there may be more than one query, for example above we want to find the list name=” Alan “and age=”18”. How does Lucene combine inverted chains? This is where we need to look at the data structures stored in inverted chains
SkipList
To quickly find docid, Lucene uses the SkipList data structure. SkipList has the following features:
- Element sort, corresponding to our inverted chain, Lucene sort by doCID, from small to large.
- Jumps have a fixed interval, which is specified when the SkipList needs to be created, as shown in figure 3
- The SkipList hierarchy refers to the number of SkipList layers
With SkipList, if you want to find docid=12, you might have to scan the original list, 1,2,3,5,7,8,10,12. Once you have SkipList, go to the first layer and see yes and then greater than 12, go to the 0 layer, go to 3,8, find 15 is greater than 12, then go to 8 of the original list and continue down through 10 and 12.
With the introduction of FST and SkipList, we can roughly draw the following diagram to illustrate how Lucene implements the entire inversion structure:
With this picture, we can understand why it is possible to do invert chain lookups and doCID lookups quickly based on Lucene. Let’s take a look at how to do invert chain merges to return the final result.
Inverted merger
If our query condition is name = “Alice”, then according to the previous introduction, first locate whether the term exists in the term dictionary, if so, enter the inversion chain of this term, and return the paging return result according to the parameter setting. This kind of query can also be satisfied by using secondary indexes in the database, so where is lucene’s advantage? If we have multiple conditions, for example, we need to query by name or age separately, and we also need to combine name = “Alice” and age = “18”, then using the traditional secondary index scheme, you might want to create two index tables, and then query the results separately and merge them. So if there are too many results with age = 18, query merge can be time-consuming. So how do these two inverted chains merge in Lucene? Suppose we have the following three inverted chains that need to be combined.
In Lucene, the following order is used for merging:
- At termA, the first element is docId=1
- Set currentDocId=1
-
In termB, search(currentDocId) = 1 (return a doc greater than or equal to currentDocId),
- Because currentDocId ==1, go ahead
- If currentDocId and returned are not equal, execute 2 and continue
- TermC still matches and returns the result
- CurrentDocId = nextItem of termC
- Then continue with step 3. Until some inverted chain comes to the end.
During the whole merging process, I can find that if a certain chain is very short, the number of comparisons will be greatly reduced. In addition, due to the existence of SkipList structure, the speed of locating a certain DOCID in a certain inversion will be relatively fast without the need to traverse one by one. Can quickly return the final result. From the inverted positioning, query, merge the whole process of Lucene query process, compared with the index of traditional database, the optimization of Lucene merge process reduces the IO of reading data, the flexibility of inverted merge also solves the problem that the traditional index is difficult to support multi-condition query.
BKDTree
In Lucene, if you want to do range search, it can be seen from the above FST model that you need to traverse FST to find a point containing the range and then enter the corresponding inversion chain, and then perform the union operation. But if it is a numeric type, such as a floating-point number, then the potential term may be very large, which can be inefficient to query. So in order to support efficient numerical classes or multidimensional queries, Lucene introduced the BKDTree class. BKDTree is a binary tree based on KDTree, which divides data according to dimensions to ensure a balanced number of nodes on both sides of the tree. In the one-dimensional case, KDTree will degenerate into a binary search tree, where if we want to find an interval, the complexity of logN will access the leaf node and get the corresponding inverted chain. As shown below:
If it is multidimensional, the kDTree setup process changes a bit. For example, taking two dimensions as an example, the establishment process is as follows:
- Determine the segmentation dimension, where the dimension selection order is the data in this dimension method of the largest dimension first. A direct understanding is that the more widely scattered data dimensions, we give priority to segmentation.
- Choose the middle point of this dimension.
- Recursively for step 1,2, we can set a threshold, the number of points is less than how many points will not be sharded again, until all points are sharded.
Here is an example:
BKDTree is a variant of KDTree, because it can be seen that if new nodes are added to KDTree, or the nodes are modified, the consumption is still relatively high. Similar to LSM’s merge idea, BKD is also multiple KDtrees, and then continuously merge into one. However, we can see that if you merge a term type with a BKDTree index type, it is not as efficient to merge a term type with a normal inverted chain, so there is a balance here. One idea is to add another term type as a dimension to the BKDTree index.
How to implement return results for sorting aggregation
As can be seen from the previous introduction, Lucene implements term search through the inverted storage model. Sometimes we need to get the value of another attribute for aggregation, or we want to return results sorted according to another attribute. Before Lucene4, you need to fetch all the results and then read the text for sorting. This is inefficient and takes up a lot of memory. To speed up Lucene’s implementation of FieldCache, it puts read fields into memory. This reduces the amount of repetitive IO, but it also introduces a new problem, which is that it takes up more memory. The new version of Lucene introduces DocValues, a doCid-based column store. When we get a bunch of docid, we sort it and we can use this column store, combined with a heap sort. Of course, extra column storage takes up extra space, so Lucene can choose whether or not it needs DocValue storage and which fields it needs to store when building the index.
Lucene code directory structure
After introducing several main data structures and search principles in Lucene, we look at the code structure of Lucene, and then we can go into the code to understand the details. Lucene has the following directories:
- The analysis module is mainly responsible for lexical analysis and language processing to form Term.
- The Codecs module is responsible for the implementation of some of the data structures mentioned earlier, as well as some coding compression algorithms. Including skiplist, DocValue, etc.
- Document module mainly includes lucene definition of various data types.
- The index module is responsible for creating indexes, including IndexWriter.
- The Store module is responsible for reading and writing indexes.
- The search module is mainly responsible for searching indexes.
- Geo module mainly for geo query related class implementation
- Util module is BKD, FST and other data structure implementation.
The last
This article introduced some of the main data structures in Lucene and how to leverage them for efficient lookups. We hope that this introduction will help us understand the difference between inverted indexes and traditional database indexes, which can sometimes use search engines to achieve richer query semantics. In addition, as a search library, how to score, query statements how to parse these we have not expanded, interested students can dig into lucene source code for further understanding.