Inverted index
- Forward index: Document ID to word association
- Inverted index: Association of words to document IDS
Example: Construct an inverted index after removing the stop word for the following three documents
Inverted index – query process
Query documents that contain “Search engine”
- Obtain the list of document ids corresponding to “search engine” by inverting the index, with 1,3
- Query the complete contents of 1 and 3 through the forward index
- Return final result
Inverted index – composition
- Term Dictionary
- Posting List
Term Dictionary
The implementation of the word dictionary generally uses B+ Tree, B+ Tree Visualization process url: B+ Tree Visualization
B trees and B+ trees
- Wikipedia -B tree
- Wikipedia -B+ tree
- B tree and B+ tree insert, delete graphic details
Posting List
- The inverted list records the collection of documents corresponding to the word. It consists of Posting entries
- The inverted index contains the following information:
- The document ID is used to retrieve the original information
- Word Frequency (TF, Term Frequency) records the Frequency of occurrence of the word in the document for subsequent correlation score
- Posting is used to record the location of words in documents, which is used to search for words.
- Offset, which records the beginning and end positions of words in the document for highlighting
The internal nodes of the B+ tree store the index, and the leaf nodes store the data. The dictionary here is the B+ tree index, and the inverted list is the data, as shown below
Note: B+ tree index Unicode comparison or pinyin?
ES stores a JSON-formatted document containing multiple fields, each with its own inverted index
participles
Word segmentation is the process of converting text into a series of words or tokens. It can also be called text Analysis, which is called Analysis in ES
Word segmentation is
A word Analyzer is a component in ES that specializes in word segmentation. It consists of the following components:
- Character Filters: Process raw text, such as removing HTML tags
- Tokenizer: Splits the original text into words according to certain rules
- Token Filters: Reworks words processed by Tokenizer, such as lowercase, delete, or add new ones
Word segmentation call order
Analyze API
ES provides an API interface for testing word segmentation to verify the word segmentation effect. The endpoint is _analyze
- You can directly specify Analyzer for testing
- You can directly specify fields in the index for testing
POST test_index/doc
{
"username": "whirly"."age":22
}
POST test_index/_analyze
{
"field": "username"."text": ["hello world"]}Copy the code
- You can customize the word divider for testing
POST _analyze
{
"tokenizer": "standard"."filter": ["lowercase"]."text": ["Hello World"]}Copy the code
Predefined word dividers
ES has the following built-in participles:
- Standard Analyzer
- Default word divider
- According to the word segmentation, support multiple languages
- Lower case processing
- Simple Analyzer
- Divide by non-letter
- Lower case processing
- Whitespace Analyzer
- Whitespace character as separator
- Stop Analyzer
- More than Simple Analyzer remove please word processing
- Stop words refer to modifiers such as the, an, this, etc
- Keyword Analyzer
- Output the input directly as a word, regardless of the word
- Pattern Analyzer
- Customize delimiters through regular expressions
- The default is \W+, that is, non-word symbols as delimiters
- Language Analyzer
- Provides word segmentation for more than 30 common languages
Example: Stop the word participle
POST _analyze
{
"analyzer": "stop"."text": ["The 2 QUICK Brown Foxes jumped over the lazy dog's bone."]}Copy the code
The results of
{
"tokens": [{"token": "quick"."start_offset": 6,
"end_offset": 11."type": "word"."position": 1}, {"token": "brown"."start_offset": 12."end_offset": 17."type": "word"."position": 2}, {"token": "foxes"."start_offset": 18."end_offset": 23."type": "word"."position": 3}, {"token": "jumped"."start_offset": 24,
"end_offset": 30."type": "word"."position": 4}, {"token": "over"."start_offset": 31."end_offset": 35."type": "word"."position": 5}, {"token": "lazy"."start_offset": 40."end_offset": 44,
"type": "word"."position": 7}, {"token": "dog"."start_offset": 45,
"end_offset": 48."type": "word"."position": 8}, {"token": "s"."start_offset": 49."end_offset": 50."type": "word"."position": 9}, {"token": "bone"."start_offset": 51."end_offset": 55."type": "word"."position": 10}]}Copy the code
Chinese word segmentation
- The difficulties in
- Chinese word segmentation refers to the segmentation of a Chinese character sequence into separate words. In English, words are naturally delimited by Spaces, while in Chinese, there is no formal delimiter
- Word segmentation results in different contexts, such as cross ambiguity
- Common word segmentation systems
- IK: Achieve Chinese and English word segmentation, can customize the thesaurus, support hot update thesaurus
- Jieba: Support word segmentation and part-of-speech tagging, support traditional word segmentation, custom dictionary, parallel word segmentation, etc
- Hanlp: A Java toolkit consisting of a series of models and algorithms that aims to popularize natural language processing in production environments
- THUAC: Chinese word segmentation and pos tagging
Install the IK Chinese word segmentation plug-in
Run the command in the Elasticsearch installation directory and restart es
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
If the installation fails due to slow network, you can download the zip package, change the following command to the actual path, execute, and restart esBin/elasticsearch - plugin install file:///path/to/elasticsearch-analysis-ik-6.3.0.zipCopy the code
- Ik test – IK_SMART
POST _analyze
{
"analyzer": "ik_smart"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
"tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Around"."start_offset": 4."end_offset": 6,
"type": "CN_WORD"."position": 1}, {"token": "School bus"."start_offset": 6,
"end_offset": 8,
"type": "CN_WORD"."position": 2}, {"token": "Will"."start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR"."position": 3}, {"token": "Enjoy"."start_offset": 9,
"end_offset": 10,
"type": "CN_CHAR"."position": 4}, {"token": "The highest"."start_offset": 10,
"end_offset": 12."type": "CN_WORD"."position": 5}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 6}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 7}]}Copy the code
- Ik test – IK_MAX_word
POST _analyze
{
"analyzer": "ik_max_word"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
"tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Public security"."start_offset": 0."end_offset": 2."type": "CN_WORD"."position": 1}, {"token": "Department"."start_offset": 2."end_offset": 3."type": "CN_CHAR"."position": 2}, {"token": "Around"."start_offset": 4."end_offset": 6,
"type": "CN_WORD"."position": 3}, {"token": "School bus"."start_offset": 6,
"end_offset": 8,
"type": "CN_WORD"."position": 4}, {"token": "Will"."start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR"."position": 5}, {"token": "Enjoy"."start_offset": 9,
"end_offset": 10,
"type": "CN_CHAR"."position": 6}, {"token": "The highest"."start_offset": 10,
"end_offset": 12."type": "CN_WORD"."position": 7}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 8}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 9}]}Copy the code
- What is the difference between ik_max_word and IK_smart?
-
Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the Republic of China, and the guo Guo, the national anthem of the People’s Republic of China”, exhausting all possible combinations;
-
Ik_smart: Will do the coarsest split, such as “People’s Republic of China national anthem” to “People’s Republic of China national anthem”.
-
Custom participles
When the built-in word segmentation cannot meet the requirements, you can customize the word segmentation by defining Character Filters, Tokenizer, and Token Filters
Character Filters
- The original text is processed before Tokenizer, such as adding, deleting, or replacing characters
- The built-in ones are as follows:
- HTML Strip Character Filter: Strip HTML tags and transform HTML entities
- Mapping Character Filter: Performs Character replacement
- Pattern Replace Character Filter: Performs regular match replacement
- Affects subsequent position and offset information parsed by tokenizer
Character Filters test
POST _analyze
{
"tokenizer": "keyword"."char_filter": ["html_strip"]."text": ["I' m so happy!
"]}# the results
{
"tokens": [{"token": """ I'm so happy! """."start_offset": 0."end_offset": 32."type": "word"."position": 0}]}Copy the code
Tokenizers
- Split the original text into words (term or token) according to certain rules
- The built-in ones are as follows:
- Standard splits by word
- Letters are separated by non-character classes
- Whitespace is separated by space
- UAX URL Email is segmented according to standard, but the mailbox and URL are not segmented
- Ngram and Edge Ngram conjunction segmentation
- Path Hierarchy splits files based on file paths
Tokenizers test
POST _analyze
{
"tokenizer": "path_hierarchy"."text": ["/path/to/file"]}# the results
{
"tokens": [{"token": "/path"."start_offset": 0."end_offset": 5,
"type": "word"."position": 0}, {"token": "/path/to"."start_offset": 0."end_offset": 8,
"type": "word"."position": 0}, {"token": "/path/to/file"."start_offset": 0."end_offset": 13."type": "word"."position": 0}]}Copy the code
Token Filters
- Add, delete, modify, and more words to the output of the Tokenizer
- The built-in ones are as follows:
- Lowercase converts all terms to lowercase
- Stop Deletes the stop word
- Ngram and Edge Ngram conjunction segmentation
- Synonym Adds a Synonym to a term
Token Filters test
POST _analyze
{
"text": [
"a Hello World!"]."tokenizer": "standard"."filter": [
"stop"."lowercase",
{
"type": "ngram"."min_gram": 4."max_gram": 4}]}# the results
{
"tokens": [{"token": "hell"."start_offset": 2."end_offset": 7,
"type": "<ALPHANUM>"."position": 1}, {"token": "ello"."start_offset": 2."end_offset": 7,
"type": "<ALPHANUM>"."position": 1}, {"token": "worl"."start_offset": 8,
"end_offset": 13."type": "<ALPHANUM>"."position": 2}, {"token": "orld"."start_offset": 8,
"end_offset": 13."type": "<ALPHANUM>"."position": 2}]}Copy the code
Custom participles
Custom segmentation requires setting char_filter, Tokenizer, Filter, and Analyzer in index configuration
Example of custom word segmentation:
- Word divider name: my_custom\
- The filter converts the token to uppercase
PUT test_index_1
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom"."tokenizer": "standard"."char_filter": [
"html_strip"]."filter": [
"uppercase"."asciifolding"]}}}}}Copy the code
Custom segmentation tests
POST test_index_1/_analyze
{
"analyzer": "my_custom_analyzer"."text": ["I' m so happy!
"]}# the results
{
"tokens": [{"token": "I'M"."start_offset": 3."end_offset": 11."type": "<ALPHANUM>"."position": 0}, {"token": "SO"."start_offset": 12."end_offset": 14."type": "<ALPHANUM>"."position": 1}, {"token": "HAPPY"."start_offset": 18."end_offset": 27,
"type": "<ALPHANUM>"."position": 2}]}Copy the code
Participle instructions
Participles are used in the following two situations:
- When an Index Time document is created or updated, word segmentation is performed on the corresponding document
- In Search Time, the query statement is segmented
- The analyzer is used to specify the word divider during query
- Set the search_Analyzer implementation through index Mapping
- Generally, it is not necessary to specify the query time classifier, but to use the index classifier directly, otherwise there will be no match
Suggestions on participle use
- You can save space and improve write performance by specifying whether fields require word segmentation and setting type to keyword for fields that do not require word segmentation
- Use the _analyze API to view word segmentation results of documents
For more information please visit my personal website: laijianfeng.org
- Elasticsearch official document
- Moocs Elastic Stack from getting started to getting started
Welcome to follow my wechat official account