1. Introduction
When using ES for Chinese search, the effect of word segmentation directly affects the search results. For the inability to research the word segmentation, or the general use of the scenario, will use the IK word segmentation plug-in. The basic usage of IK word dividers can be found in Elasticsearch. The main logic of ik word segmentation consists of three parts:
1) dictionary, dictionary is good or bad directly affect the segmentation result is good or bad, this article introduces construction and storage structure of the dictionary matching: 2) words have a dictionary, can the input string word by word, sentence and dictionary matching 3) disambiguation: by dictionary matching the segmentation method will have a variety of disambiguation is a way to find the most reasonable
2 Preparation
Before studying the principles of IK, you need to clone the ik source code to the local, github.com/medcl/elast… . Clone to the local checkout version 6.8.1 and write a method call to the IK toener to debug.
public class TestIK {
public static void main(String[] args) throws IOException {
testIkSegment();
}
public static void testIkSegment(a) throws IOException {
String t = "Get out of here, get out of here.";
Settings settings = Settings.builder()
.put("use_smart".false)
.put("enable_lowercase".false)
.put("enable_remote_dict".false)
.build();
Configuration configuration=new Configuration(null,settings).setUseSmart(false);
IKSegmenter segmenter = new IKSegmenter(new StringReader(t), configuration);
Lexeme next;
while((next = segmenter.next())! =null){ System.out.println(next.getLexemeText()); }}}Copy the code
The following error occurs during startup:The cause of the error is not specified the path of the Dictionary, because the tectonic Environment object is relatively complex, so adopt the method of path to the file instead of directly, modify org. Wltea. Analyzer. Dic. The Dictionary in the constructor code, the code can run fast
// this.conf_dir = cfg.getEnvironment().configFile().resolve(AnalysisIkPlugin.PLUGIN_NAME);
this.conf_dir = Paths.get("/Users/zjb/code/mine/commit/ik/elasticsearch-analysis-ik/config");
Copy the code
3. The dictionary
As you can see from the above preparations, dictionary configuration is very important. Ik provides three common dictionaries:
1) main.dic
2) quantifier. Dic: commonly used quantifier
dic
The dictionary can be expanded by users themselves by configuring ikAnalyzer.cfg. XML. How to load the dictionaryTrigger from the root node to hang a word on the tree. The beginning of the word is the child node of the root node, each word is the child node of the previous word, and the red node represents the end of the word. In the middle of the picture, front door is a word and door is the end; Front door giant tiger is also a word, tiger end. Since main.dic has a large number of words and the user can customize the extension, the tree can be very large. If the memory is stored in map, it will waste space, so IK uses a compromise. DictSegment. Java code:
class DictSegment implements Comparable<DictSegment>{
// Public dictionary table, store Chinese characters
private static final Map<Character , Character> charMap = new ConcurrentHashMap<Character , Character>(16 , 0.95 f);
// Upper limit of array size
private static final int ARRAY_LENGTH_LIMIT = 3;
//Map storage structure
private Map<Character , DictSegment> childrenMap;
// Store the structure in array mode
private DictSegment[] childrenArray;
// The character stored on the current node
private Character nodeChar;
// The number of segments stored by the current node
//storeSize <=ARRAY_LENGTH_LIMIT: array storage is used; storeSize >ARRAY_LENGTH_LIMIT: Map storage is used
private int storeSize = 0;
// Current DictSegment state, default 0, 1 indicates the path from the root node to the current node represents a word
private int nodeState = 0;
}
Copy the code
Ik code comments are relatively complete, and are in Chinese, but also easier to understand. The main idea is to adjust the storage structure according to the number of child nodes. If the number of child nodes is less than or equal to 3, array storage is adopted. If the number of child nodes is greater than 3, Map storage is adopted. NodeState is used to mark the current node (that is, whether the current character is the end of a word).
With this structure in place, we see how words are loaded from file to memory. The dictionary is loaded in the Configuration constructor and the DictSegment#fillSegment method is called:
private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
// Get the Chinese character object in the dictionary table
Character beginChar = Character.valueOf(charArray[begin]);
Character keyChar = charMap.get(beginChar);
// If the word is not in the dictionary, add it to the dictionary
if(keyChar == null){
charMap.put(beginChar, beginChar);
keyChar = beginChar;
}
// Search the storage of the current node, query the corresponding keyChar, if there is no keyChar, create
DictSegment ds = lookforSegment(keyChar , enabled);
if(ds ! =null) {// Handle the segment corresponding to keyChar
if(length > 1) {// The ternary is not yet fully added to the dictionary tree
ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
}else if (length == 1) {// Set the status of the current node to Enabled,
// Enabled =1 indicates a complete word, and enabled=0 indicates that the current word is blocked from the dictionaryds.nodeState = enabled; }}}Copy the code
This is the process of recursively inserting words into the dictionary, using the lookforSegment method to lookfor existing words in the current map, and creating a DictSegment if none exists. It then determines if the word being processed is finished, recurses if not, and marks the word as the end of the word.
4. Cut the word
Once you have a dictionary and input statements, you can cut words. There are two ik word cutting modes: SMART mode and IK_max_word mode. Take Bao Jianfeng from Honed as an example:
Results of non-SMART mode segmentation: Bao Jianfeng from Li Out, Bao Jianfeng, sword, From, feng, from, Grind, out
It can be seen from the results of non-SMART word segmentation that there are many ways for a statement to be segmented. Non-smart means that all possible word segmentation results are given. Smart mode is to find the most reasonable word segmentation mode among these modes.
From the perspective of processing, the smart mode is used to select a word segmentation after word segmentation, which is commonly referred to as disambiguation.
Ik implements three word dividers by default, which are CJKSegmenter (Chinese-Japanese and Korean word divider), CN_QuantifierSegmenter (Chinese quantifier word divider), and LetterSegmenter (English character and Arabic digit subdivider).
The main logic of word segmentation is as follows, in a form similar to lazy loading, word segmentation is performed only when segmenter.next() is called for the first time to get the result of word segmentation.
While ((l = context.getNextlexeme ()) == null){/* * Read data from reader, fill buffer * */ int available = context.fillBuffer(this.input); If (available <= 0){// Reader has read context.reset(); return null; }else{// Initialize the pointer context.initcursor (); For (ISegmenter Segmenter: segmenters){segmenter.analyze(context); If (context.needrefillBuffer ()){break; if(context.needrefillBuffer ()){break; } // Move the pointer forward}while(context.movecursor ()); For (ISegmenter Segmenter: segmenters){segmenter.reset(); }} this.arbitrator.process(context, configuration.isusesmart ()); // Outputs the split result to the result set, and processes the unsplit single CJK character context.outputToResult(); // Record the buffer offset of this word context.markBufferoffset (); }Copy the code
The main word segmentation logic is in the DO loop, where segmenters are three participles. Context stores the current input statement, moves the pointer one at a time, one character at a time, and then iterates through three participles, using each participle to process the current word.
The first is LetterSegmenter. LetterSegmenter is a simple word segmenter that takes consecutive English characters or consecutive data.
And then CN_QuantifierSegmenter, the quantifier. It is mainly used to determine whether the current character is a numerator and a quantifier. It will divide the connected numerator and quantifier into one word.
The most important is CJKSegmenter, which is based on dictionary matching. Before I introduce the main logic, I need to introduce a class Lexeme that represents a word segmentation result, that is, a word element
public class Lexeme implements Comparable<Lexeme>{
/ / lexemeType constants
/ / unknown
public static final int TYPE_UNKNOWN = 0;
/ / English
public static final int TYPE_ENGLISH = 1;
/ / digital
public static final int TYPE_ARABIC = 2;
// A mix of English numbers
public static final int TYPE_LETTER = 3;
// Chinese lexical elements
public static final int TYPE_CNWORD = 4;
// Chinese single character
public static final int TYPE_CNCHAR = 64;
// In Japanese and Korean
public static final int TYPE_OTHER_CJK = 8;
//
public static final int TYPE_CNUM = 16;
// Chinese quantifiers
public static final int TYPE_COUNT = 32;
// Chinese quantifiers
public static final int TYPE_CQUAN = 48;
// The initial shift of the element
private int offset;
// The relative starting position of the element
private int begin;
// The length of the element
private int length;
// meta text
private String lexemeText;
// Primitive type
private int lexemeType;
}
Copy the code
The main fields of this class are begin and length. Begin indicates the start position of the word in the input statement, and length indicates the length of the word.
The main logic of participle is in the CJKSegmenter#analyze method
public void analyze(AnalyzeContext context) {
if(CharacterUtil.CHAR_USELESS ! = context.getCurrentCharType()){// Handle hits in tmpHits first
if(!this.tmpHits.isEmpty()){
// Process the word segment queue
Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);
for(Hit hit : tmpArray){
hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);
if(hit.isMatch()){
// Outputs the current word
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);
context.addLexeme(newLexeme);
if(! hit.isPrefix()){// Not a word prefix, hit does not need to continue to match, removed
this.tmpHits.remove(hit); }}else if(hit.isUnmatch()){
//hit is not a word, remove
this.tmpHits.remove(hit); }}}//********************************* upper half
//********************************* lower half
// Match the characters at the current pointer position
Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
if(singleCharHit.isMatch()){// the first word is a word
// Outputs the current word
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);
context.addLexeme(newLexeme);
// Is also a word prefix
if(singleCharHit.isPrefix()){
// Prefix matches are put into the hit list
this.tmpHits.add(singleCharHit); }}else if(singleCharHit.isPrefix()){// The first word is a word prefix
// Prefix matches are put into the hit list
this.tmpHits.add(singleCharHit); }}else{
// A CHAR_USELESS character is encountered
// Clear the queue
this.tmpHits.clear();
}
// Check whether the buffer has been read
if(context.isBufferConsumed()){
// Clear the queue
this.tmpHits.clear();
}
// Determine whether the buffer is locked
if(this.tmpHits.size() == 0){
context.unlockBuffer(SEGMENTER_NAME);
}else{ context.lockBuffer(SEGMENTER_NAME); }}Copy the code
Let’s look at the second half of the code, which basically takes the characters in the current loop and decides if they match.
1) If match means that the character matches the end of the word in the dictionary (i.e. the red dot described in section 3), then the current character can be used as the end of the word, then the bufferoffset will be used to derive the first unknown word, and then a new Lexeme will be created and put into the context. 2) If it is prefix, it means that the current character is not the end of the word but the prefix of the word, and it is put into tmpHits. TmpHits refers to the character that can be used as the prefix of the word in the previous traversal process. 3) If it is not in the dictionary at all, empty the temporary variable.
With timpHits, let’s look at the top half of the code: we first iterate through all the characters in timpHits. 1) If the current character matches the prefix in tipHits, create a new word and store it in the context. 2) Remove timpsHits from timpsHits if the current character is added instead of prefix
After a series of processes, we end up with a QuickSortSet of orgLexemes within contenxt, which itself implements the Comparable interface and is ordered from begin to begin. If begin is the same, Then sort by lenght from largest to smallest. That is, words that are placed in the front and of long length are placed in the front row.
Taking Bao Jianfeng from Honli as an example, the final four word elements 0-7, 0-3, 0-2, 4-6 were obtained. Namely bao Jianfeng from hone, bao Jianfeng, sword, hone these four words yuan. Here is just a way of cutting, and the final result: Bao Jianfeng from Grind out, Bao Jianfeng, sword, from, feng, from, grind, out. There are still some gaps.
5. Output the result
When useSmart is set to true, the above four words will be disambiguated, and there is only one element 0-7 left at last. I’ll do that logic later, assuming we didn’t set useSmart to true, we still have four words, and then we’re going to print the result and see what happened. The main logic is in the AnalyzeContext#outputToResult method.
void outputToResult(a){
int index = 0;
for(; index <=this.cursor ;) {// Skip non-CJK characters
if(CharacterUtil.CHAR_USELESS == this.charTypes[index]){
index++;
continue;
}
// Find the LexemePath corresponding to index position from pathMap
LexemePath path = this.pathMap.get(index);
if(path ! =null) {// Output lexeme from LexemePath to the Results set
Lexeme l = path.pollFirst();
while(l ! =null) {this.results.add(l);
// There are no words in the dictionary, but the words conflict, cut out the words in the previous words of the intersection words
int innerIndex = index + 1;
for (; innerIndex < index + l.getLength(); innerIndex++) {
Lexeme innerL = path.peekFirst();
if(innerL ! =null && innerIndex == innerL.getBegin()) {
this.outputSingleCJK(innerIndex - 1); }}// move index to lexeme
index = l.getBegin() + l.getLength();
l = path.pollFirst();
if(l ! =null) {// Output missing words between words inside path
for(; index < l.getBegin(); index++){this.outputSingleCJK(index); }}}}else{// Index cannot be found in pathMap
// Single word output
this.outputSingleCJK(index); index++; }}// Clear the current Map
this.pathMap.clear();
}
Copy the code
The main logic of this section is to output words in the form of single word output for those positions that are not divided into words. Careful readers should have found the final result: Bao Jianfeng from Grinding out, Bao Jianfeng, sword, from, feng, from, grinding, out. There are two slave words and only one slave in this input statement. This is actually a bug specific to the IK version, and 6.8.1, unfortunately, has this bug. About this bug, write an article later again specific analysis. // there is no word in the dictionary, but the word conflicts. This comment is caused by the large code, also an IK PR, but the latest version has been marked out.
6. Disambiguate
Now let’s go back and see what happens when you set up useSmart. The main logic is in the IKArbitrator#process method
void process(AnalyzeContext context , boolean useSmart){ QuickSortSet orgLexemes = context.getOrgLexemes(); Lexeme orgLexeme = orgLexemes.pollFirst(); LexemePath crossPath = new LexemePath(); while(orgLexeme ! = null){ if(! CrossPath. AddCrossLexeme (orgLexeme)) {/ / to find crossPath disjoint next crossPath if (crossPath. The size () = = 1 | |! UseSmart){//crossPath context.addLexemePath(crossPath); QuickSortSet.Cell headCell = crossPath. GetHead (); LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength()); JudgeResult context.addLexemePath(judgeResult); } // Add orgLexeme to the new crossPath crossPath = new LexemePath(); crossPath.addCrossLexeme(orgLexeme); } orgLexeme = orgLexemes.pollFirst(); } / / handle the final path if (crossPath. The size () = = 1 | |! UseSmart){//crossPath context.addLexemePath(crossPath); QuickSortSet.Cell headCell = crossPath. GetHead (); LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength()); JudgeResult context.addLexemePath(judgeResult); }}Copy the code
When useSmart is true, it is ambiguous; if it is false, it is not processed and output directly. LexemePath implements the Comparable interface. Words inside LexemePath do not want to cross and are sorted according to the sorting rules, which are as follows
public int compareTo(LexemePath o) {
// Compare the valid text length
if(this.payloadLength > o.payloadLength){
return -1;
}else if(this.payloadLength < o.payloadLength){
return 1;
}else{
// Compare the number of elements, the less the better
if(this.size() < o.size()){
return -1;
}else if (this.size() > o.size()){
return 1;
}else{
// The larger the path span, the better
if(this.getPathLength() > o.getPathLength()){
return -1;
}else if(this.getPathLength() < o.getPathLength()){
return 1;
}else {
// According to the statistical conclusion, the probability of reverse sharding is higher than that of forward sharding, so the lower the position is, the higher the priority is
if(this.pathEnd > o.pathEnd){
return -1;
}else if(pathEnd < o.pathEnd){
return 1;
}else{
The more average the word length, the better
if(this.getXWeight() > o.getXWeight()){
return -1;
}else if(this.getXWeight() < o.getXWeight()){
return 1;
}else {
// compare the position weight of the element
if(this.getPWeight() > o.getPWeight()){
return -1;
}else if(this.getPWeight() < o.getPWeight()){
return 1;
}
}
}
}
}
}
return 0;
}
Copy the code
According to this rule, the lexical link is sorted, and the lexical link with the first order is selected, which is the word segmentation that disambiguates the last word.
7. To summarize
Although the USE of IK word segmentation is relatively simple, but it is important to understand the idea of its internal principle, to help analyze and locate the problem.
For more exciting content, please pay attention to the public number