This article will examine the principles of full text search (based on Lucene), how it creates indexes, how it queries indexes, and why it is so fast! Is it faster than you?
Full text retrieval principle description
There are two main methods to search unstructured data, namely full-text data:
One is Serial Scanning: The so-called sequential scanning, for example, to find a file containing a certain string, is a document by document, for each document, from the beginning to the end, if the document contains the string, the document is the file we are looking for, and then the next file, until all the files have been scanned.
If you use Windows search, you can also search for file content, but rather slowly. If you have an 80GB hard drive and you want to find a file with a certain string in it, you can’t do it for hours.
The grep command in Linux is also used in this way. We may feel that this method is primitive, but for small data files, this method is the most direct, the most convenient. But for a large number of files, this approach is slow.
Some people may say that the sequential scanning of unstructured data is slow, but the search of structured data is relatively fast (because the structured data has a certain structure, certain search algorithms can be adopted to speed up the speed), so it is ok to try to make our unstructured data have a certain structure.
This idea is very natural, but constitutes the basic idea of full-text retrieval, which is to extract some information from unstructured data, reorganize it, make it become a certain structure, and then search the data with a certain structure, so as to achieve the purpose of relatively fast search.
This piece of information extracted from unstructured data and then reorganized is called an index.
For example, the pinyin table and radical check list of dictionaries are equivalent to the index of dictionaries. The interpretation of each word is unstructured. If the dictionary does not have syllabary and radical check list, it can only scan a word in order in the vast sea of words. However, some information of the character can be extracted for structural processing, such as pronunciation, it is more structured, divided into initials and finals, respectively, only several can be enumerated, so the pronunciation is taken out in a certain order, each pronunciation points to the detailed explanation of the page number of the word. We search for sounds in structured pinyin, and then by the number of pages they lead to, we find our unstructured data — the interpretation of the word.
This process of first building an index and then searching the index is called full-text Search.
Full-text retrieval is generally divided into two processes, Indexing and Search.
Index creation: The process of extracting information from all structured and unstructured data in the real world to create an index.
Search index: is to get the user’s query request, search the index created, and then return the results of the process.
Therefore, there are three important problems in full-text retrieval:
-
What exactly is in the index? (Index)
-
How do I create an index? (Indexing)
-
How do I search an index? (Search)
Let’s examine each question in order.
What’s in the index?
What exactly do you need to store in an index?
Let’s start by looking at why sequential scanning is slow: it’s actually caused by a discrepancy between the information we want to search for and the information stored in unstructured data. The information stored in unstructured data is what strings each file contains, that is, known files, which are relatively easy to extract, that is, mappings from files to strings. And the information that we want to search for is what files contain this string, which is a known string, desired file, which is a mapping from string to file. It’s the opposite. So if the index can always hold a mapping from a string to a file, it will greatly speed up the search. Because the mapping from string to file is the reverse of the file-to-string mapping, an index that holds this information is called a reverse index.
The information stored in a reverse index is generally as follows: Suppose I have 100 documents in my document collection. For ease of representation, we number the documents from 1 to 100 and get the following structure:
On the left is a series of strings, called dictionaries.
Each string points to a Document List that contains the string, which is called an inversion List. With an index, the information saved is the same as the information to be searched, which can greatly speed up the search.
For example, if we want to find a document that contains both the string “lucene” and the string “solr”, we just need the following steps:
- Retrieves a linked list of documents containing the string “Lucene”.
- Retrieves a linked list of documents containing the string “solr”.
- Find the files that contain both “Lucene” and “solr” by merging the linked lists.
Looking at this, one might say that full-text search does speed up the search, but with the indexing process, the two don’t necessarily add up to much faster than sequential scanning. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when data volumes are small. Indexing a large amount of data is also a slow process.
However, there are differences between the two, sequential scanning is to scan every time, and the process of creating an index only needs one time, later is once and for all, each search, the process of creating an index does not have to go through, only the search created the index can be.
This is one of the advantages of full-text search over sequential scanning: one index, many uses.
How do I create an index?
The index creation process of full-text search generally has the following steps:
Step 1: Some original documents to index.
File 1: Students should be allowed to go out with their friends, but not allowed to drink beer. My friend Jerry went to school to see his students but found them drunk which is not allowed.Copy the code
Step 2: Pass the original document to the Tokenizer.
A Tokenizer does several things (this process is called Tokenize) :
-
Divide the document into individual words.
-
Get rid of punctuation.
-
Get rid of Stop words.
The so-called Stop words are some of the most common words in a language. Because they have no special meaning, they cannot be searched in most cases. Therefore, when creating an index, such words will be removed to reduce the size of the index.
In English, Stop word such as “the”, “a”, “this” and so on.
For each language Tokenizer, there is a set of stop words.
The result of a Tokenizer is called a Token.
In our example, we get the following tokens: "Students", "allowed", "go", "their", "friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", "see", "his", "Students", "found", "them", "drunk", "allowed".Copy the code
Step 3: Send the obtained Token to the language Processor.
The linguistic processor is responsible for linguistic processing of the obtained tokens.
For English, the Linguistic Processor will do the following:
-
Lowercase (Lowercase).
-
Reduce words to roots, such as cars to car. This operation is called stemming.
-
Change words into root forms, such as “drove” to “drive.” This operation is called leMMTRANSCEND.
Similarities and differences between Stemming and lemmidiom: Similarities: Both Stemming and lemmidiom enable words to become root form. The two methods are different: Stemming uses the "reduced" method: "cars" to "car", "driving" to "drive". Lemmuncanny uses a "transformation" approach: "drove" to "drove", "driving" to "drive". The two algorithms are different: Stemming mainly adopts some fixed algorithm to do this reduction, such as removing "S", removing "ing" and "e", changing "ational" to "ate" and "tional" to "tion". Lemmtranscend this transformation primarily through the way in which a dictionary is preserved. For example, if the dictionary had a mapping from "driving" to "drive," "drove" to "drive," or "am, is, are" to "be," you could just look it up in the dictionary. Stemming and Lemmatization are not mutually exclusive, but overlap. Some words can achieve the same transformation using both methods.Copy the code
The result of the language processing component (linguistic Processor) is called the Term.
In our example, after language processing, we get the following Term: "Student", "allow", "go", "their", "friend", "allow", "drink", "beer", "my", "friend", "Jerry", "go", "school", "see", "his", "Student", "find", "them", "drink", "allow".Copy the code
And it was because of the language processing that drove could be searched, and drive could be searched.
Step 4: Pass the resulting Term to the Indexer component.
An Indexer component does several things:
-
Create a dictionary with the resulting Term.
In our example the dictionary looks like this:Copy the code
-
Sort the dictionary in alphabetical order.
3. Merge the same Term into Posting List.
In this table, there are several definitions:
Document Frequency refers to the number of documents that contain this Term.
Frequency is the Term Frequency, indicating that the file contains several terms.
Therefore, for the word “allow”, there are two documents that contain the word, so the linked list of documents after the word “Term” has two items in total. The first item represents the first document that contains the word “allow”, namely document No. 1. In this document, “allow” appears twice. The second item indicates the second document that contains “allow”, document No. 2, in which “allow” appears once.
At this point, the index has been created and we can use it to quickly find the document we want.
And we were surprised to find that searches for “drive,” “driving,” “drove,” and “driven” could also be searched. Because in our index, “driving”, “drove” and “driven” could all be changed into “drive” through language processing. In the search, if you input “driving”, the input query statement would also go through one or three steps here, thus becoming the query “drive”. So you can search for the document you want.
How do I search an index?
At this point it seems we can declare “we found the document we were looking for”.
However, the matter is not over, found is only one aspect of the full text search. Isn’t it?
If only one or ten documents contain the string we are looking for, we do find it. But what if it turns out to be a thousand, or even thousands? Which file do you want most?
Step 1: The user enters a query statement.
Query statements, like our ordinary language, also have a certain syntax.
For example, the user enters the statement lucene AND learned NOT Hadoop. The user wants to find a document that contains Lucene and learned, but not Hadoop.Copy the code
The second step: lexical analysis, syntax analysis and language processing of query statements.
Since the query statement has syntax, it also needs to be analyzed, analyzed and processed.
Step 3: Search the index for documents that conform to the syntax tree.
There are several small steps to this:
1. Firstly, in the reverse index table, find the document linked list containing Lucene, Learn and Hadoop respectively.
2. Secondly, merge the linked list containing Lucene and Learn to obtain the document linked list containing both Lucene and Learn.
3. Then, the document linked list is separated from the document linked list of Hadoop to remove the documents containing Hadoop, so as to obtain the document linked list containing both Lucene and Learn and not containing Hadoop.
4. This document linked list is the document we are looking for.
Step 4: Sort the results based on the relevance of the resulting documents to the query statement.
Although we got the document we wanted in the previous step, the query results should be sorted by their relevance to the query statement, and the more relevant, the more advanced.
conclusion
1. Indexing process
1) There are a series of indexed files.
2) The indexed file is formed into a series of terms through grammatical analysis and language processing.
3) Form dictionary and reverse index table through index creation.
4) Write indexes to hard disks through index storage.
2. Search process:
A) User input query statement.
B) A series of terms are obtained from the query statements through syntax and language analysis.
C) Obtain a query tree through parsing.
D) Read indexes into memory through index storage.
E) Use the query tree to search the index, so as to get the document linked list of each Term, cross and cross the document linked list, and get the result document.
F) Rank the relevance of the query by the search result documents.
G) Return the query result to the user.
Reference: Lucene: The fundamentals of full-text retrieval
The next article will introduce solr and ElasticSearch selection.