This is the 8th day of my participation in the August More text Challenge. For details, see: August More Text Challenge
If
my article is helpful, welcome to like, follow. This is the greatest encouragement for me to continue my technical creation. More past articles in my personal column
Elasticsearch Analysis analyzer
- Analysis – Text Analysis is the process of converting a full text into a series of terms/tokens, also known as participles
- Analysis is implemented through Analyzer
- You can use Elasticsearch’s built-in profilers/or customize them as needed
- In addition to converting terms when data is written, the same analyzer is used to analyze Query statements when matching them
Analyzer composition
Word segmentation is a component specialized in word segmentation and consists of three parts
- Character Filters(for raw text processing, such as HTML removal)
- Tokenizer installs rule participles
- Token Filter processes the segmented words into lower case, deletes stopWords, and adds synonyms
Use Analyzer for word segmentation
Analyzer:
- Simple Analyzer – Split by non-letter (symbols are filtered), lowercase
- Stop Analyzer – Lowercase processing, Stop word filtering (the, a, is)
- Whitespace Analyzer – The Analyzer is divided by space and does not change to lower case
- Keyword Analyzer – Takes input as output without dividing words
- Patter Analyzer – regular expression, default \W+ (non-character separated)
- Language – Provides word segmentation for more than 30 common languages
See the effects of different analyzers
Standard analyzer (default)
GET _analyze
{
"analyzer": "standard"."text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."} =================== result V =================== {"tokens": [{"token" : "2"."start_offset": 0."end_offset" : 1,
"type" : "<NUM>"."position": 0}, {"token" : "running"."start_offset": 2."end_offset" : 9,
"type" : "<ALPHANUM>"."position": 1},... {"token" : "evening"."start_offset" : 62,
"end_offset" : 69,
"type" : "<ALPHANUM>"."position": 12}]}Copy the code
Stop Analyzer – Lowercase processing to Stop word filtering
GET _analyze
{
"analyzer": "stop"."text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."} =================== result V =================== {"tokens": [{"token" : "running"."start_offset": 2."end_offset" : 9,
"type" : "word"."position": 0}, {"token" : "quick"."start_offset" : 10,
"end_offset": 15."type" : "word"."position": 1},... {"token" : "evening"."start_offset" : 62,
"end_offset" : 69,
"type" : "word"."position": 11}]}Copy the code
More examples of participles
#simpe GET _analyze { "analyzer": "simple", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "stop", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #stop GET _analyze { "analyzer": "whitespace", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #keyword GET _analyze { "analyzer": "keyword", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "pattern", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #english GET _analyze { "analyzer": "english", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } POST _analyze { "analyzer": } POST _analyze {"analyzer": "standard", "text": } POST _analyze {"analyzer": "icu_analyzer", "text": "this apple is not delicious"}Copy the code
Note that the ICU_Analyzer analyzer; Ik analyzer included; Does not come with Elasticsearch 7.8.0. Run the./bin/ Elasticsearch -plugin install analysis-icu command to install elasticSearch and restart elasticSearch
Chinese word segmentation
ik
Support custom lexicon, support hot update word segmentation gitee.com/mirrors/ela…
THULAC
Tsinghua University Natural Language Processing and Social humanities Computing laboratory of a set of Chinese word segmentation gitee.com/puremilk/TH…
reading
- www.elastic.co/guide/en/el…
- www.elastic.co/guide/en/el…