preface

I haven’t finished the last series yet. I’m here to open a new pit

Contact search/recommendation related work, also for two years. Work on lucene a lot of contact, but also not fine. Recently the work is not so busy, so I want to learn the source code, to carry out a systematic study of Lucene.

In addition, I have heard that Lucene source code is a model of object-oriented design, and I would like to absorb some code design/development knowledge from it. Recently, I always feel that there is something wrong with the code I wrote, but IT is very difficult to try to optimize it. Often, the improvement is very limited after one operation.

Introduction of lucene

The following is from Wikipedia:

Lucene is an open source library for full-text retrieval and search, supported and provided by the Apache Software Foundation. Lucene provides a simple yet powerful application interface that enables full-text indexing and searching. Lucene is now the most popular free Java information retrieval library.

Full Text Retrieval is an information Retrieval technology that takes all Text information as the Retrieval object. The most common full-text search engines are Google and Baidu, which provide us with second-level search experience by analyzing and indexing all web content on the Internet. Secondly, many of the current mobile apps have built-in search function, which is also the search implementation in the vertical field. The difference between them and Google/Baidu is that they only provide search for information within the current APP, rather than all pages on the Internet.

Suppose you have 10 articles, each with a title and body. What should we do when we want to find a corresponding article that contains atomic energy in the text?

First, the roughest way is to read each article sequentially and judge it character by character. If three characters in a row are atomic, we write down the title of the article and then scan all of them and we have a search.

This method is fairly straightforward and effective. This method can be used to search a gigabyte file on a very powerful computer (grep on Linux, you should know that even a simple search on a GIGAByte file is usually acceptable).

However, the amount of data is much larger than 1 GIGAByte, and the search requirements are more complex, not simple string matching, but a combination of multiple conditions. This is where full-text search is needed.

A search engine like Google can produce 1230 million results related to a “full-text search engine” in 0.5 seconds, apparently using a lucene full-text search rather than sequential character-by-character comparisons.

Lucene can query large amounts of data in seconds, relying on structures called indexes. There are many examples of indexes. For example, after many books, there is a keyword to page number mapping. This is a kind of index, which allows us to find the parts of the book we care about without having to read the whole book.

In the beauty of Mathematics, the author argues that the essence of full-text retrieval is Boolean algebra. With the gradual in-depth understanding of full text retrieval, more and more feel that this sentence is accurate, in the index/search stage of full text retrieval, the fundamental principle is the simplest Boolean algebra, the rest is just the complexity of engineering implementation.

lucene-beta

Lucene is currently working on version 9.0, which is a complex project divided into multiple modules.

Before learning Lucene source code, I have been thinking, should be what route to learn Lucene, always can’t find a random class to start to look, that is afraid will fall into the details of the vast sea.

The original idea was to start by building an index and follow the build index -> write to disk -> search request -> Query analysis -> relevance scoring -> return results route.

Then I had the audacious idea of trying to abstract the essence of full-text search, write a full-text search tool that is the simplest in all aspects (preferably one or two classes), and then discuss how each aspect of the tool evolved into a lucene equivalent. All kinds of defects lucene is how to improve, to learn Lucene.

That’s where the title of this section lucene-beta comes from.

In my expectation, this should have two advantages:

  1. Can be closer to the essence, not lost in local details.
  2. It makes more sense to move from a problem to a conclusion. Can more deeply feel the necessity of such. What’s the point of showing off your skills if you don’t have to?
public class LuceneBeta {
    private static final Logger logger = LoggerFactory.getLogger(LuceneBeta.class);

    public static void main(String[] args) {
        LuceneBeta beta = new LuceneBeta();
        String[] arr = new String[]{Institute of Atomic Energy."The atomic bomb was powerful."};
        Map<Character, int[]> index = beta.build(arr);
        int[] searchRet = beta.search('power', index);
        System.out.println(Arrays.toString(searchRet));
    }

    /** * builds a character-level index */ on the array of strings passed in
    public Map<Character, int[]> build(String[] arr) {
        Set<Character> all = new HashSet<>();
        for (String s : arr) {
            for (char c : s.toCharArray()) {
                all.add(c);
            }
        }
        Map<Character, int[]> index = new HashMap<>();

        for (Character c : all) {
            int[] perContains = new int[arr.length];
            for (int w = 0; w < arr.length; w++) {
                if (arr[w].contains(String.valueOf(c))) {
                    perContains[w] = 1;
                } else {
                    perContains[w] = 0;
                }
            }
            index.put(c, perContains);
        }

        logger.info("build {} strings. indexed: ", arr.length);
        for (Map.Entry<Character, int[]> e : index.entrySet()) {
            logger.info("{} = = > {}", e.getKey(), Arrays.toString(e.getValue()));
        }
        return index;
    }


    /** * query the string in which the target character appears */
    public int[] search(char target, Map<Character, int[]> index) {
        if(! index.containsKey(target)) {return null;
        }
        int[] ints = index.get(target);
        int[] tmp = new int[index.size()];
        int j = 0;
        for (int i = 0; i < ints.length; i++) {
            if (ints[i] == 1) { tmp[j] = i; j++; }}int[] ret = new int[j];
        System.arraycopy(tmp, 0, ret, 0, j);
        returnret; }}Copy the code

Simple, simple to the core, 70 lines of code. What does it do?

In a given series of strings, you can search for all the string numbers in which a character appears

Google can according to the keywords you give to find the corresponding page, the above code can according to the key characters you provide, find the corresponding string, the source code has been developed, such as financing listed, I am the next Google…

Although the above code is extremely simple, I will carefully summarize each step for the sake of subsequent lucene analysis.

The above program is divided into two parts, namely the two methods build and search.

The first is the build process:

  1. Iterate over the input string to get all the characters that appear.
  2. For each character, count an array of characters, where each bit represents whether the current character is present in the numbered string. 1 indicates presence, 0 indicates absence. If “original” appears in both input strings, its corresponding statistical array is [1,1].
  3. Returns all characters and their statistical array as an “index”.

The search process

  1. If the entered character does not exist, return null
  2. Retrieves the statistical array of corresponding characters and restores the original string number by binary representation.
  3. Returns all the string numbers in which the character appears.

Lucene source code architecture introduction

Lucene as a mature open source software, it includes a number of modules, the core of which is lucene.core package. It is divided into the following directories:

Among them:

  • Org, apache lucene. The analysis is mainly responsible for the lexical analysis and language processing.
  • Org.apache.lucene.codecs is responsible for encoding and decoding text content into inverted indexes.
  • Org, apache lucene. The document provides the content of abstract the document to the customer, and the Fields.
  • Org.apache.lucene. index is responsible for reading and writing indexes.
  • Org.apache.lucene.search is responsible for the search process.
  • Org.apache.lucene. store is responsible for index persistence and so on.
  • Org.apache.lucene. util Tool package.

conclusion

This article implements a minimalist version of Lucene-beta, but of course it is not intended to be a real replacement for Lucene. Just to the full text search to do a simple abstraction, with a simple function mapping Lucene excellent implementation. Learn one by one.

The last section briefly introduces several directories under the Lucene. core package. The following main source code learning will be guided by the problems in Lucene-beta, which will be carried out in modules step by step.

Lucene source code learning, officially began ~


To the end.

To contact me

Finally, welcome to follow my personal public account [Huyanshi], I will update many backend engineers’ study notes from time to time. Also welcome direct public number private letter or mailbox to contact me, must know all without reserve, endless.





All the above are personal thoughts, if there is any mistake welcome to comment.

Welcome to reprint, please sign and keep the original link.

Email address:[email protected]

For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten