This article introduces the content and working principle of Lucene full-text retrieval, as well as the structure of the index, aiming to let readers who have not understood Lucene before can have a simple cognition of Lucene in a short time, did not introduce specific code, read this article can know what Lucene is, what specific applications, we have been saying what the index is.


Introduction and application of Lucene

Apache Lucene is currently the most popular open source full text search toolkit, written based on the JAVA language.

There are currently open source search engines based on the toolkit, the mature and well-known ones are Solr and Elasticsearch. Since 2010 Lucene and Solr have both been developed by the same Apache Software Foundation development team, so the versions we see are usually synchronized. The difference is that Lucene is a toolkit, while Solr is an enterprise search application built on Lucene. In addition, we commonly used Eclipse, help system search function is also based on Lucene implementation.

Lucene’s two jobs

Chinese dictionary and full-text index are very similar in our daily objects. Let’s take pinyin for example. First we find the number of pages of the word we want to look up through pinyin. Then we turn to the page and read a detailed explanation of the word.

In the example above, we mentioned two elements: one is the dictionary, and the other is the process of looking up words. Corresponding to Lucene’s functions, one is to create a dictionary, a process called indexing, and the other is to query based on the index according to the search term.

indexing

1. Document Preparation

The document is the original text that we’re searching for.

2. Tokenizer

Cut the first step of the document into words, remove punctuation, remove no words, such as “yes”, “of”, etc. Commonly used open source Chinese phrase components include MMSEG4J, IKAnalyzer, etc. The sliced words are called tokens.

3. Linguistic Processor

Process the lexical elements obtained in the previous step, such as English uppercase to lowercase, plural to singular, word to form in the past tense, etc. The result that you get is called a Term.

4. Index components

The index component takes the words from the previous step, generates an index and a dictionary, and stores them on disk. The index component first turns Term into a dictionary and then sorts the dictionary, which then merges the same words to form an inverted list. The list stores the corresponding Document Id (Document Frequency) for each word and how many times the word appears in the Document (Term Frequency).

search

1. Enter a query word

2. Lexical analysis and language processing

Split the input word, keyword recognition (AND,NOT) AND so on. The language processing of the split word is the same as the language processing of the dictionary. Syntax trees are generated from keywords and processed words.

3. Search the index to obtain documents that conform to the syntax tree

For example, the syntax tree formed by A and B not C will search the document list containing A, B and C, and then use the document list of A and B to do the intersection, and the result set and C to do the difference set, and the result is the document list that meets the search conditions

4. Sort search results by relevance

Through the algorithm of vector space model, the correlation of the results is obtained. The relatively simple implementation description is as follows: During index establishment, Document Frequency and Term Frequency are obtained. The higher Term Frequency is, the higher the correlation of documents is. The higher the Document Frequency, the weaker the correlation. This algorithm can be implemented by itself.

5. Return the document according to the sorting result above.

Index structure

Lucene’s index structure is hierarchical. Here is an example


The Index (Index)

In the database analogy, an index is like a table in a database.

In Lucene an index is placed in a folder. So it makes sense to index the contents of the entire folder.

Segment (Segment)

Using a database analogy, segments are similar to partitions of tables.

Indexing introduces the concept of Segment. An index can have multiple segments. Generate segment files when flush or commit. There are two segments 0 and 1 in the screenshot. Segments. Gen and Segments_5 are metadata files for segments that hold information about their attributes. The other files correspond to each section of the file, the usefulness of which will be explained later.

Indexes are written sequentially and can only be appended, not modified. When an index is to be dropped, the corresponding docId is written to the.del file. This docId will be filtered when querying. In addition to the index modification, Document is deleted after the append. This design ensures high throughput.

The segment design ensures efficient query. If the segment size is too large, I/O consumption will be high. If a segment is too small, too many segments need to be queried. So Lucene merges segments, and the deleted data is filtered out during the merge. Before 4.0, the default merge policy is LogMergePolicy. This policy will merge adjacent segments smaller than the specified value. If two adjacent segments, one with a size of 1G and the other with a size of 1K, will rewrite a file with a size of 1G and consume a lot of resources. After 4.0, the default policy is changed to TieredMergePolicy. This policy will first sort segments by size, calculate the deletion ratio of segments, and merge the smaller segments first. Large segments are merged when the system is free.

The Document (the Document)

In the database analogy, a document is like a row of data.

Document is the basic unit of indexing. A segment can have multiple documents

Domain (Field)

In a database analogy, a domain is the equivalent of a table’s fields.

Doument can have multiple fields. Lucene provides many different types of fields, such as StringField, TextField, LongFiled, or NumericDocValuesField.

Term (Term)

Term is the smallest unit of index. Term is generated by Field through Analyzer (participle).

Section file description

Section 3 describes section design and merge strategies in detail, and the contents of some section files are described in detail below.

Segments_N stores how many segments this index contains and how many documents each segment contains.

*.fnm

Saves how many fields this section contains, the name of each field, and the index method.

*. FDX, *. FDT

Saves all the documents contained in this section, how many fields each document contains, and what information each field holds.

*.tvx, *.tvd, *.tvf

Save information about how many documents this section contains, how many fields each document contains, how many words each field contains, the string and position of each word.

*. Tis, *. Tii

The Term Dictionary is saved, which is the order of all the words contained in this paragraph in Dictionary order.

*.frq

The inversion table, which is a list of document ids containing each word, is saved.

*.prx

Saves the position of each word in the inversion list in the document containing the word

*.del

As mentioned earlier, it is used to store the id of the deleted document.

Source: Creditease Technology Institute Author: Tian Liang