First, the basic concept of search engine
Our daily use of search engine from the modern information retrieval system of the academic definition of modern information retrieval is generally: from the collection of large-scale unstructured data to find information to meet the needs of users, the process.
The term “unstructured” refers to a classic database where the records in the database have strict field definitions (Scheme), which is a typical example of “structured” data. For example, every dish has a name. When you want to eat fish, it is very efficient to query “boiled fish”.
“Unstructured”, on the other hand, there is no such strict definition, the computer storage of a large number of text is a typical representative of an article if there is no special treatment, for its description: and the name of the dish, need to prepare what ingredients, such as information, we are in the dark, naturally we also cannot be defined the content and has good matching database field, This is one of the reasons why databases are so weak at handling unstructured data.
At this point, the technology of information retrieval can greatly help us. The characteristic of unstructured data is that there is no strict definition of data format, which determines the following two core elements of information retrieval:
- Relevance: whether there is enough relevance between user queries and returned results lacks a strict definition, the meaning expressed by natural language is quite rich, how to make computers better ‘understand’ human information needs is the key.
- Timeliness: To solve the correlation, it is necessary to design a reasonable model, so the complexity of the model may increase, the corresponding amount of calculation will also become larger, so efficient system implementation is needed.
Ii. Relevance
2.1 Boolean Model
Let’s say I’m a fan of sichuanese poached fish, whose spiciness is unforgetable. If I want to read an article about Sichuanese cuisine, the easiest way to do so is to look for the key word “poached fish.” If there is (return true), then I think it is relevant; If it doesn’t (return false), then I consider it irrelevant.
If you can convert this query into a Boolean expression, it’s very simple, with one condition:
Boiled fish
Of course, you would think, Chinese food has a long history, how can the classic Sichuan food only boiled fish? Yeah, I’m also a big fan of steamed pork ribs, which were a staple of Chinese New Year’s celebrations when I was a kid so, change the terms
Boiled fish OR steamed pork ribs
Haha, now, not only water boiled fish, but also “steamed pork ribs” will be identified as sichuan cuisine culture. If you want to see where you can eat these food in Shanghai? Modify it again:
Boiled fish OR steamed pork Ribs Shanghai
Here, “Shanghai” is the necessary condition, and boiled fish or steamed pork ribs, one is ok. Finally, we can look at which articles mention restaurants in Shanghai that serve at least one of the two dishes of Sichuan cuisine: poached fish AND steamed pork ribs.
The most common is Proximity operator, which is used to ensure that keywords appear within a certain range. The assumption is that different search terms will be more relevant if they appear closer to each other.
In general, the advantages of Boolean model are simple and easy to understand, and the cost of system implementation is low. However, its weakness is the lack of correlation characterization. Relevant or not is a vague concept, some articles and query conditions are closely related, very in line with the user’s information needs, and some not so. Just by “true” and “false” 2 values, too absolute, there is no way to reflect the difference so, is there a better solution?
2.2 Ranked Boolean Model
To enhance the Boolean model, you need to consider how to score matched documents. The more relevant the document, the higher the score. The first and most intuitive idea is that each word is given a different weight in different documents, and you can use this to calculate your score. Here is the most commonly used TF-IDF mechanism.
Suppose we have a Collection of documents, c for the entire Collection, D for one of the articles, and T for a word. So here TF stands for Term Frequency, which is the number of times a word T appears in an article (or in a field of the article). The general assumption is that the higher the tf of a word in an article, the more important the word T is to the article. Of course, longer documents can have higher TF values, and we’ll discuss how calculations can make TF more reasonable later in Lucene’s introduction.
Meanwhile, another commonly used is IDF, which stands for Inverse Document Frequency. First, DF represents Document Frequency, that is, the number of documents in which a certain word T appears in Document set C. The general assumption is that the more documents a word t appears in in document set C, the lower its importance will be, and vice versa.
It may feel a little confusing at first, but it’s not hard to understand when you think about it. Words like “of, you, me, yes” are often used in documents, but they don’t mean anything. For another example, in a collection of documents discussing food, the word “food” may appear in tens of thousands of articles, but it doesn’t make one document special. On the other hand, if only three articles discussed boiled fish, the correlation between these three articles and boiled fish was much higher than the other articles. The term “poached fish” should have more weight in the document collection. In this regard, the inverse proportion index IDF of DF is usually used to indicate its importance, and the basic formula is as follows:
Where N is the number of articles in the whole document set, and log is to bury tf’s contribution to ensure that IDF score is not far higher than TF’s. The default value is 10. Therefore, the lower the DF of the word t, the higher its IDF, and the higher the importance of t. In conclusion, the basic formula of TF-IDF is as follows:
That is, the higher the frequency tf of a word t is in the document, and the higher the IDF is in the set as a whole, the more important t is to D.
2.3 Vector Space Model
The sorted Boolean model is a scoring mechanism based on weighted summation, and a vector space model slightly more complex than the sorted Boolean model will be introduced below. The focus of this model is to convert a document to a vector or the above article as an example.
If there are 50 Unique words, then the vector of the tip of the text is 50 latitude, where each latitude is the tF-IDF value of the corresponding word, it looks like this:
(Sichuan = 1.84, poached fish = 6.30, album = 6.80…)
Of course, when the system is implemented, the latitude will not be directly represented by the word, but by the word ID. In this way, a collection of documents is transformed into a set of vectors, each representing a document. This representation ignores the order in which words appear in the text, which can greatly simplify the computational complexity in many models while ensuring considerable accuracy. We usually call this processing method Bag Of Word, which has been mentioned in the text classification and clustering in the first article.
Similarly, user input queries can be converted to a set of vectors, but the dimensions of the query vector are very low compared to the document vector. Finally, the correlation problem is reduced to calculating the similarity between the query vector and the document vector. In practice, cosine distance is the most commonly used similarity measure. Because it is exactly a number between 0 and 1, if the disk is consistent, it is 1, if the orthogonal, it is 0, which also conforms to the similarity percentage characteristics, the specific calculation formula is as follows:
The graphical explanation is shown in the figure below:
Compared with the standard Boolean mathematical model, vector space model has the following advantages:
-
Simple model based on linear algebra, very easy to understand
-
The weights of phrases can be non-binary, such as tF-IDF
-
Allows the calculation of continuous similarity between documents and indexes, and ordering based on this, not limited to Boolean model “true” and “false” values
-
Allow partial matching of keywords
Of course, the basic vector space model also has many shortcomings:
For example, for very long documents, the similarity score is not very good; Not taking into account the semantic meaning of words, or limited to accurate matching; No consideration is given to the order in which words appear in the document, etc.
Three, timeliness
After the previous discussion, we found that to solve the correlation, it is necessary to design a reasonable model according to the requirements, and the more sophisticated the model, the higher the computational complexity. At the same time, we want to ensure very high query efficiency. In the Internet age, users are ordinary surfers, and a “refreshing” experience is crucial. Therefore, search engine results must be processed in seconds, usually no more than three seconds. It is unthinkable and unacceptable to sit in front of a computer and wait a few minutes just to find out what the weather will be like in Guangzhou tomorrow. The complexity and speed of correlation calculation seem to be a pair of irreconcilable contradictions. How to solve them?
It must be mentioned here that the most classical data structure design of search engine – inverted index (or reverse index). Let’s imagine for a moment that you are an avid reader. When you enter a library or bookstore, how do you quickly discover your favorite books? Yeah, it’s a label on a bookshelf. If you see a shelf labeled “Cooking, Local Cuisine,” you’re on your way to a book on Sichuan cuisine. Inverted indexes do the tagging thing. Take a look at the example below, which is raw data without inverted indexing, although the actual article would not be so short.
For the content of each article, Chinese word segmentation is carried out first, and then the word segmentation is used as the label of the article. For example, the word segmentation of the article ID 1 “The most addictive sichuan cuisine” can be divided into the following five words:
The most addictive, disgusting, sichuan cuisine
Then article 1 will have 5 keyword tags
Re-analyze the article ID 2 “Chef’s Must-read Series: Classic Sichuan Cuisine”, which can be divided into the following five words:
Chef, Must-read, Series, Classics, Sichuan Cuisine
If you combine it with the first tag result
Please note that the keyword “Sichuan cuisine” with ID 5 appears in both articles 1 and 2, so we will write “1, 2” in the corresponding document ID.
After this analysis of all five articles, we get the following table, which is the prototype of an inverted index.
There are a total of 18 non-repeating words, which we call the dictionary or Vocabulary of the document set. It can be seen from the above structure that the establishment of inverted index is to transform the relationship between document and keyword into the relationship between keyword and document set, and at the same time gradually establish the dictionary. With this data structure, you will find that searching for keywords is as quick and easy as searching for books based on tags in the library, and the efficiency is greatly improved. For example, we look for:
Sichuan cuisine AND tip of the tongue
By searching for “Sichuan cuisine”, the system will return article IDS: L, 2, 3; By looking for the tip of the tongue, the system returns article ids: 3, 4, and by taking the intersection, we find article 3 and return. The merging operation of intersection has been very mature in the field of computer and its speed is amazing.
Considering Proximity operations in the Boolean model or other calculation models, we can also add word positions and TF-IDF information to the data structure. We won’t go into details here. The most important conclusion is that we can maintain ultra-high efficiency through inverted indexes.