This is the 5th day of my participation in the August More Text Challenge
We’ve already talked about why WE chose ES, so let’s see how it works today; When you’re dealing with a lot of data, you have to use algorithms, better storage structures to manage that data, and it’s like a database, it has an index, which in ES is called an inverted index, or you could call it a transpose index;
Inverted index
Speaking of inverted index, naturally to separate under inverted index, index, index we are very clear, database clustering index, secondary index and so on, we have been greeted in the database, will not go into detail, computer constant law, time for space; Space for time! Index is the typical space, in time, (in inverted here is more serious, this space consumption is worth it, of course) the data in a specific format to record file storage location, the former database index is the index values or key ID for data retrieval, so no problem for the average volume of data, But for the web page so much data (massive data), how do you do, the database will slow to crash, if the user is fuzzy query, associated query, how to write your SQL I do not know; So inverted index using the keyword as a key file, find the file through the keyword or multiple file this keyword, but why still use to index (white), because it is impossible for data retrieval in real time to the index, but the records in the file to store (index), the next time it is good to take the record directly, Simple and efficient (the simplest judgment real-time is your latest baidu other platform information, at that time you may not be able to search, but after a while you may be able to search), inverted is a mapping structure in the form of “keyword – document”,
How does inversion work
First user input some key words, word segmentation system can analyze the user request keyword to produce a set of keywords, these keywords to the corresponding keyword matching index list, if the algorithm is good, can also calculate the keyword search out of the document related documents (as you type the wrong words, but sometimes search engines can still get the same to you). And then you sort the weights that come out of the document, and there’s another name for that, SEO, and that’s where you do it,
In the table above, I just briefly describe the meaning of word segmentation. The actual algorithm is very complex. You can also append the position of the word at the end.
Elasticsearch’s built-in word splitter
-
Standard Analyzer – Default word Analyzer, word segmentation, lowercase processing
-
Simple Analyzer – According to non-letter sharding (symbols are filtered), lowercase processing
-
Stop Analyzer – Lower case processing, Stop word filtering (the, A,is)
-
Whitespace Analyzer – Split by space, no lower case
-
Keyword Analyzer – Treats input directly as output, regardless of words
-
Patter Analyzer – Regular Expressions, default \W+(non-character split)
-
Language Analyzers – Provides multilingual specific analysis tools
-
Customer Analyzer Custom word Analyzer
The main components of word segmentation are:
- Character filter: Receives the original character stream and changes it by adding, deleting, or replacing it. For example, remove HTML tags from text. A character filter can have zero or more characters.
- Tokenizer: Break up a whole text into words. For example, in English, the sentence can be divided into words by space, but for Chinese, it cannot be realized in this way. With the help of a good algorithm engineer; In a tokenizer, there is only one tokenizeer
- Token filters: Add, remove, or change sharded words. For example, lower case all English words,or remove stop words such as a, and, the,or,or add some necessary words. In token filters, it is not possible to change the position or offset of the token. Also, there can be zero or more token filters in a tokenizer
The general concept is these, the next chapter we talk about, different word segmentation results are different, then may focus on English words, Chinese word segmentation simple text, because complex text relying on the built-in word segmentation, feel good difficult to get the ideal result ah;
Take your time, get the concept straight, and everything will fall into place