I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details.

Solr and ElasticSearch are implemented using Lucene.

I. Introduction to Lucene

Lucene was originally developed by Doug Cutting and is available for download on SourceForge’s website. Joined the Apache Software Foundation’s Jakarta family as a high-quality open source Java product in September 2001.

It is a full text search engine architecture, provides a complete create index and query index, and some of the text analysis engine, software developers Lucene’s purpose is to provide a simple and easy to use toolkit, to facilitate in the target system can realize the function of full text retrieval, or is on this basis, establish a complete full text search engine, Lucene is a classic ancestor in the field of full-text search. Now many search engines are created on its basis, and their ideas are the same.

Second, index principle

2.1 Data query method

Generally speaking, we query data by sequential scanning, we need to go through each document from beginning to end until we find the data we need. However, this efficiency is extremely low and requires the scanning of all documents.

But sequential scanning also has its advantages: accurate data; The downside is equally obvious: as the volume of data increases, it becomes less efficient.

Over time, faster indexing methods have emerged: mysql’s B+ Tree, HashMap’s zip-up (also known as open hash), redis’s skip list for set types (also used by Lucene), and full-text search’s inverted index. This article focuses on inverted indexes.

2.1.2 Inverted index

Lucene uses inverted indexes to create documents and query data.

When data is stored, keywords are broken down and extracted to create an index (like a dictionary’s table of contents); When a query is made, it will search the directory based on the query content until it finds the document where the query content is located.

Advantages: 1) High accuracy 2) not because of the increase in the amount of data will lead to a significant decline in the query speed. (For both Chinese and English dictionaries, the number of lists is fixed and basically unchanged, even if the amount of data increases, the number of lists remains the same.)

Disadvantages: Index files take up extra disk space.

Usage scenario: Suitable for massive data query.

2.2 Index Process

2.3 Documents and index documents

2.3.1 document

The logical diagram of the document collection is shown below:

Represents two document records with ids 1 and 2, and each document has its own field attribute.

2.3.2 Index documents

As indicated in the above document, we will partition the content of the above document during storage to build the index document.

1, 2, Zhang SAN, Li Si, Heilongjiang, province, Heilongjiang, Shuangyashan, Harbin, city, Shuangyashan, Harbin.

The index document structure is as follows:

Keywords are on the left, and ids for storing document data are on the right.

2.4 Underlying Storage Structure

Its storage structure is shown as follows:

In the figure above, the Segment composition data file is real, with indexes and documents as its internal logical structure. The real path is shown in the following figure:

There is also a lock in the figure above, which is a measure to ensure data correctness when controlling concurrent writes.

In the above storage structure, there are several key points:

Index

In Lucene, an index is stored in a folder.

Segment (segment)

An index consists of multiple segments. Multiple segments can be merged to reduce disk I/O.

Lucene writes data to the buffer first. When a certain amount of data is written into a segment, the segment is flushed to disk. Each seinterfaces has its own independent index, which can be queried separately.

The segment will not be modified, and data is written in batches to avoid random write and improve throughput. Seinterfaces can be deleted, but, instead of modifying the seinterfaces files, file IDS will be composed of other files that record the interfaces that need to be deleted.

The index query is the query of several seinterfaces files, including the processing of the deleted files, and the integration of the query results. For query optimization, Lucene has policies to optimize multiple segments.

Document

The document is our basic unit of indexing. Different documents are stored in different sections. Multiple documents can be contained in a single section.

Newly added documents are stored in a new segment and are eventually merged into the same segment as the segment merge strategy is implemented.

Domain (feild)

A document can contain many different types of information, which can be indexed in different ways and indexed separately. Such as text, time, value, etc. Different fields can be indexed differently.

2.5 Dictionary data structure

The dictionary data structure refers to the inverted index mentioned earlier, where we split the keywords of the document to form a dictionary-like index structure. Here are some common dictionary structures:

The name of the The characteristics of
Jump table Small memory, adjustable, not friendly to fuzzy query; For the principle of skip list, please refer to my redis type article. The storage structure of Zset in the case of large data volume is skip list:www.jianshu.com/p/47f431c64…
Sort lists (arrays, lists) Use binary search, unbalanced
The dictionary tree The query efficiency depends on the length of the string and is only suitable for English dictionaries.Baike.baidu.com/item/%E5%AD…
Hash table High efficiency, large memory consumption
Binary dictionary tree Small memory footprint, suitable for Chinese dictionaryBaike.baidu.com/item/%E5%8F…
FST (Finite State Transducers) Finite state transfer machine, lucene uses this method
B tree Disk index, used mostly for relational databases

2.5.1 Introduction to FST

Advantages: Low memory usage, high compression ratio (2 to 30 times), fast query, and good fuzzy query support. Disadvantages: complex structure, orderly input requirements, update is not easy.

For example, there are abD, ABC, ACF and ACE, then the insertion method is roughly as follows:

2.6 Efficiency Optimization

You can optimize the disk in either of the following ways: 1) Set the cache size to reduce flush times and adjust the disk I/O problem. 2) Index storage location: Currently, there are three methods as follows: MMap is recommended

The name of the The characteristics of
SimpleFSDirectory Simple implementation, poor concurrency
NIOFSDirectory Concurrency is strong, Windows platform has bugs
MmapFSDirectory Read and write operations are memory based