I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details.
Solr and ElasticSearch are implemented using Lucene.
I. Introduction to Lucene
Lucene was originally developed by Doug Cutting and is available for download on SourceForge’s website. Joined the Apache Software Foundation’s Jakarta family as a high-quality open source Java product in September 2001.
It is a full text search engine architecture, provides a complete create index and query index, and some of the text analysis engine, software developers Lucene’s purpose is to provide a simple and easy to use toolkit, to facilitate in the target system can realize the function of full text retrieval, or is on this basis, establish a complete full text search engine, Lucene is a classic ancestor in the field of full-text search. Now many search engines are created on its basis, and their ideas are the same.
Second, index principle
2.1 Data query method
Generally speaking, we query data by sequential scanning, we need to go through each document from beginning to end until we find the data we need. However, this efficiency is extremely low and requires the scanning of all documents.
But sequential scanning also has its advantages: accurate data; The downside is equally obvious: as the volume of data increases, it becomes less efficient.
Over time, faster indexing methods have emerged: mysql’s B+ Tree, HashMap’s zip-up (also known as open hash), redis’s skip list for set types (also used by Lucene), and full-text search’s inverted index. This article focuses on inverted indexes.
2.1.2 Inverted index
Lucene uses inverted indexes to create documents and query data.
When data is stored, keywords are broken down and extracted to create an index (like a dictionary’s table of contents); When a query is made, it will search the directory based on the query content until it finds the document where the query content is located.
Advantages: 1) High accuracy 2) not because of the increase in the amount of data will lead to a significant decline in the query speed. (For both Chinese and English dictionaries, the number of lists is fixed and basically unchanged, even if the amount of data increases, the number of lists remains the same.)
Disadvantages: Index files take up extra disk space.
Usage scenario: Suitable for massive data query.
2.2 Index Process
2.3 Documents and index documents
2.3.1 document
The logical diagram of the document collection is shown below:
Represents two document records with ids 1 and 2, and each document has its own field attribute.
2.3.2 Index documents
As indicated in the above document, we will partition the content of the above document during storage to build the index document.
1, 2, Zhang SAN, Li Si, Heilongjiang, province, Heilongjiang, Shuangyashan, Harbin, city, Shuangyashan, Harbin.
The index document structure is as follows:
Keywords are on the left, and ids for storing document data are on the right.
2.4 Underlying Storage Structure
Its storage structure is shown as follows:
In the figure above, the Segment composition data file is real, with indexes and documents as its internal logical structure. The real path is shown in the following figure:
There is also a lock in the figure above, which is a measure to ensure data correctness when controlling concurrent writes.
In the above storage structure, there are several key points:
Index
In Lucene, an index is stored in a folder.
Segment (segment)
An index consists of multiple segments. Multiple segments can be merged to reduce disk I/O.
Lucene writes data to the buffer first. When a certain amount of data is written into a segment, the segment is flushed to disk. Each seinterfaces has its own independent index, which can be queried separately.
The segment will not be modified, and data is written in batches to avoid random write and improve throughput. Seinterfaces can be deleted, but, instead of modifying the seinterfaces files, file IDS will be composed of other files that record the interfaces that need to be deleted.
The index query is the query of several seinterfaces files, including the processing of the deleted files, and the integration of the query results. For query optimization, Lucene has policies to optimize multiple segments.
Document
The document is our basic unit of indexing. Different documents are stored in different sections. Multiple documents can be contained in a single section.
Newly added documents are stored in a new segment and are eventually merged into the same segment as the segment merge strategy is implemented.
Domain (feild)
A document can contain many different types of information, which can be indexed in different ways and indexed separately. Such as text, time, value, etc. Different fields can be indexed differently.
2.5 Dictionary data structure
The dictionary data structure refers to the inverted index mentioned earlier, where we split the keywords of the document to form a dictionary-like index structure. Here are some common dictionary structures:
The name of the | The characteristics of |
---|---|
Jump table | Small memory, adjustable, not friendly to fuzzy query; For the principle of skip list, please refer to my redis type article. The storage structure of Zset in the case of large data volume is skip list:www.jianshu.com/p/47f431c64… |
Sort lists (arrays, lists) | Use binary search, unbalanced |
The dictionary tree | The query efficiency depends on the length of the string and is only suitable for English dictionaries.Baike.baidu.com/item/%E5%AD… |
Hash table | High efficiency, large memory consumption |
Binary dictionary tree | Small memory footprint, suitable for Chinese dictionaryBaike.baidu.com/item/%E5%8F… |
FST (Finite State Transducers) | Finite state transfer machine, lucene uses this method |
B tree | Disk index, used mostly for relational databases |
2.5.1 Introduction to FST
Advantages: Low memory usage, high compression ratio (2 to 30 times), fast query, and good fuzzy query support. Disadvantages: complex structure, orderly input requirements, update is not easy.
For example, there are abD, ABC, ACF and ACE, then the insertion method is roughly as follows:
2.6 Efficiency Optimization
You can optimize the disk in either of the following ways: 1) Set the cache size to reduce flush times and adjust the disk I/O problem. 2) Index storage location: Currently, there are three methods as follows: MMap is recommended
The name of the | The characteristics of |
---|---|
SimpleFSDirectory | Simple implementation, poor concurrency |
NIOFSDirectory | Concurrency is strong, Windows platform has bugs |
MmapFSDirectory | Read and write operations are memory based |