1. Introduction

When using ES for Chinese search, the effect of word segmentation directly affects the search results. For the inability to research the word segmentation, or the general use of the scenario, will use the IK word segmentation plug-in. The basic usage of IK word dividers can be found in Elasticsearch. The main logic of ik word segmentation consists of three parts:

1) dictionary, dictionary is good or bad directly affect the segmentation result is good or bad, this article introduces construction and storage structure of the dictionary matching: 2) words have a dictionary, can the input string word by word, sentence and dictionary matching 3) disambiguation: by dictionary matching the segmentation method will have a variety of disambiguation is a way to find the most reasonable

2 Preparation

Before studying the principles of IK, you need to clone the ik source code to the local, github.com/medcl/elast… . Clone to the local checkout version 6.8.1 and write a method call to the IK toener to debug.

public class TestIK {

    public static void main(String[] args) throws IOException {
        testIkSegment();
    }

    public static void testIkSegment(a) throws IOException {
        String t = "Get out of here, get out of here.";
        Settings settings =  Settings.builder()
                .put("use_smart".false)
                .put("enable_lowercase".false)
                .put("enable_remote_dict".false)
                .build();
        Configuration configuration=new Configuration(null,settings).setUseSmart(false);
        IKSegmenter segmenter = new IKSegmenter(new StringReader(t), configuration);
        Lexeme next;
        while((next = segmenter.next())! =null){ System.out.println(next.getLexemeText()); }}}Copy the code

The following error occurs during startup:The cause of the error is not specified the path of the Dictionary, because the tectonic Environment object is relatively complex, so adopt the method of path to the file instead of directly, modify org. Wltea. Analyzer. Dic. The Dictionary in the constructor code, the code can run fast

// this.conf_dir = cfg.getEnvironment().configFile().resolve(AnalysisIkPlugin.PLUGIN_NAME);
      this.conf_dir = Paths.get("/Users/zjb/code/mine/commit/ik/elasticsearch-analysis-ik/config");
Copy the code

3. The dictionary

As you can see from the above preparations, dictionary configuration is very important. Ik provides three common dictionaries:

1) main.dic

2) quantifier. Dic: commonly used quantifier

dic

The dictionary can be expanded by users themselves by configuring ikAnalyzer.cfg. XML. How to load the dictionaryTrigger from the root node to hang a word on the tree. The beginning of the word is the child node of the root node, each word is the child node of the previous word, and the red node represents the end of the word. In the middle of the picture, front door is a word and door is the end; Front door giant tiger is also a word, tiger end. Since main.dic has a large number of words and the user can customize the extension, the tree can be very large. If the memory is stored in map, it will waste space, so IK uses a compromise. DictSegment. Java code:

class DictSegment implements Comparable<DictSegment>{
   
   // Public dictionary table, store Chinese characters
   private static final Map<Character , Character> charMap = new ConcurrentHashMap<Character , Character>(16 , 0.95 f);
   // Upper limit of array size
   private static final int ARRAY_LENGTH_LIMIT = 3;

   
   //Map storage structure
   private Map<Character , DictSegment> childrenMap;
   // Store the structure in array mode
   private DictSegment[] childrenArray;
   
   
   // The character stored on the current node
   private Character nodeChar;
   // The number of segments stored by the current node
   //storeSize <=ARRAY_LENGTH_LIMIT: array storage is used; storeSize >ARRAY_LENGTH_LIMIT: Map storage is used
   private int storeSize = 0;
   // Current DictSegment state, default 0, 1 indicates the path from the root node to the current node represents a word
   private int nodeState = 0; 
}
Copy the code

Ik code comments are relatively complete, and are in Chinese, but also easier to understand. The main idea is to adjust the storage structure according to the number of child nodes. If the number of child nodes is less than or equal to 3, array storage is adopted. If the number of child nodes is greater than 3, Map storage is adopted. NodeState is used to mark the current node (that is, whether the current character is the end of a word).

With this structure in place, we see how words are loaded from file to memory. The dictionary is loaded in the Configuration constructor and the DictSegment#fillSegment method is called:

private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){
   // Get the Chinese character object in the dictionary table
   Character beginChar = Character.valueOf(charArray[begin]);
   Character keyChar = charMap.get(beginChar);
   // If the word is not in the dictionary, add it to the dictionary
   if(keyChar == null){
      charMap.put(beginChar, beginChar);
      keyChar = beginChar;
   }
   
   // Search the storage of the current node, query the corresponding keyChar, if there is no keyChar, create
   DictSegment ds = lookforSegment(keyChar , enabled);
   if(ds ! =null) {// Handle the segment corresponding to keyChar
      if(length > 1) {// The ternary is not yet fully added to the dictionary tree
         ds.fillSegment(charArray, begin + 1, length - 1 , enabled);
      }else if (length == 1) {// Set the status of the current node to Enabled,
         // Enabled =1 indicates a complete word, and enabled=0 indicates that the current word is blocked from the dictionaryds.nodeState = enabled; }}}Copy the code

This is the process of recursively inserting words into the dictionary, using the lookforSegment method to lookfor existing words in the current map, and creating a DictSegment if none exists. It then determines if the word being processed is finished, recurses if not, and marks the word as the end of the word.

4. Cut the word

Once you have a dictionary and input statements, you can cut words. There are two ik word cutting modes: SMART mode and IK_max_word mode. Take Bao Jianfeng from Honed as an example:

Results of non-SMART mode segmentation: Bao Jianfeng from Li Out, Bao Jianfeng, sword, From, feng, from, Grind, out

It can be seen from the results of non-SMART word segmentation that there are many ways for a statement to be segmented. Non-smart means that all possible word segmentation results are given. Smart mode is to find the most reasonable word segmentation mode among these modes.

From the perspective of processing, the smart mode is used to select a word segmentation after word segmentation, which is commonly referred to as disambiguation.

Ik implements three word dividers by default, which are CJKSegmenter (Chinese-Japanese and Korean word divider), CN_QuantifierSegmenter (Chinese quantifier word divider), and LetterSegmenter (English character and Arabic digit subdivider).

The main logic of word segmentation is as follows, in a form similar to lazy loading, word segmentation is performed only when segmenter.next() is called for the first time to get the result of word segmentation.

While ((l = context.getNextlexeme ()) == null){/* * Read data from reader, fill buffer * */ int available = context.fillBuffer(this.input); If (available <= 0){// Reader has read context.reset(); return null; }else{// Initialize the pointer context.initcursor (); For (ISegmenter Segmenter: segmenters){segmenter.analyze(context); If (context.needrefillBuffer ()){break; if(context.needrefillBuffer ()){break; } // Move the pointer forward}while(context.movecursor ()); For (ISegmenter Segmenter: segmenters){segmenter.reset(); }} this.arbitrator.process(context, configuration.isusesmart ()); // Outputs the split result to the result set, and processes the unsplit single CJK character context.outputToResult(); // Record the buffer offset of this word context.markBufferoffset (); }Copy the code

The main word segmentation logic is in the DO loop, where segmenters are three participles. Context stores the current input statement, moves the pointer one at a time, one character at a time, and then iterates through three participles, using each participle to process the current word.

The first is LetterSegmenter. LetterSegmenter is a simple word segmenter that takes consecutive English characters or consecutive data.

And then CN_QuantifierSegmenter, the quantifier. It is mainly used to determine whether the current character is a numerator and a quantifier. It will divide the connected numerator and quantifier into one word.

The most important is CJKSegmenter, which is based on dictionary matching. Before I introduce the main logic, I need to introduce a class Lexeme that represents a word segmentation result, that is, a word element

public class Lexeme implements Comparable<Lexeme>{
   / / lexemeType constants
   / / unknown
   public static final int TYPE_UNKNOWN = 0;
   / / English
   public static final int TYPE_ENGLISH = 1;
   / / digital
   public static final int TYPE_ARABIC = 2;
   // A mix of English numbers
   public static final int TYPE_LETTER = 3;
   // Chinese lexical elements
   public static final int TYPE_CNWORD = 4;
   // Chinese single character
   public static final int TYPE_CNCHAR = 64;
   // In Japanese and Korean
   public static final int TYPE_OTHER_CJK = 8;
   //
   public static final int TYPE_CNUM = 16;
   // Chinese quantifiers
   public static final int TYPE_COUNT = 32;
   // Chinese quantifiers
   public static final int TYPE_CQUAN = 48;
   
   // The initial shift of the element
   private int offset;
    // The relative starting position of the element
    private int begin;
    // The length of the element
    private int length;
    // meta text
    private String lexemeText;
    // Primitive type
    private int lexemeType;
}
Copy the code

The main fields of this class are begin and length. Begin indicates the start position of the word in the input statement, and length indicates the length of the word.

The main logic of participle is in the CJKSegmenter#analyze method

public void analyze(AnalyzeContext context) {
   if(CharacterUtil.CHAR_USELESS ! = context.getCurrentCharType()){// Handle hits in tmpHits first
      if(!this.tmpHits.isEmpty()){
         // Process the word segment queue
         Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);
         for(Hit hit : tmpArray){
            hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);
            if(hit.isMatch()){
               // Outputs the current word
               Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);
               context.addLexeme(newLexeme);
               
               if(! hit.isPrefix()){// Not a word prefix, hit does not need to continue to match, removed
                  this.tmpHits.remove(hit); }}else if(hit.isUnmatch()){
               //hit is not a word, remove
               this.tmpHits.remove(hit); }}}//********************************* upper half
      //********************************* lower half
      // Match the characters at the current pointer position
      Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
      if(singleCharHit.isMatch()){// the first word is a word
         // Outputs the current word
         Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);
         context.addLexeme(newLexeme);

         // Is also a word prefix
         if(singleCharHit.isPrefix()){
            // Prefix matches are put into the hit list
            this.tmpHits.add(singleCharHit); }}else if(singleCharHit.isPrefix()){// The first word is a word prefix
         // Prefix matches are put into the hit list
         this.tmpHits.add(singleCharHit); }}else{
      // A CHAR_USELESS character is encountered
      // Clear the queue
      this.tmpHits.clear();
   }
   
   // Check whether the buffer has been read
   if(context.isBufferConsumed()){
      // Clear the queue
      this.tmpHits.clear();
   }
   
   // Determine whether the buffer is locked
   if(this.tmpHits.size() == 0){
      context.unlockBuffer(SEGMENTER_NAME);
      
   }else{ context.lockBuffer(SEGMENTER_NAME); }}Copy the code

Let’s look at the second half of the code, which basically takes the characters in the current loop and decides if they match.

1) If match means that the character matches the end of the word in the dictionary (i.e. the red dot described in section 3), then the current character can be used as the end of the word, then the bufferoffset will be used to derive the first unknown word, and then a new Lexeme will be created and put into the context. 2) If it is prefix, it means that the current character is not the end of the word but the prefix of the word, and it is put into tmpHits. TmpHits refers to the character that can be used as the prefix of the word in the previous traversal process. 3) If it is not in the dictionary at all, empty the temporary variable.

With timpHits, let’s look at the top half of the code: we first iterate through all the characters in timpHits. 1) If the current character matches the prefix in tipHits, create a new word and store it in the context. 2) Remove timpsHits from timpsHits if the current character is added instead of prefix

After a series of processes, we end up with a QuickSortSet of orgLexemes within contenxt, which itself implements the Comparable interface and is ordered from begin to begin. If begin is the same, Then sort by lenght from largest to smallest. That is, words that are placed in the front and of long length are placed in the front row.

Taking Bao Jianfeng from Honli as an example, the final four word elements 0-7, 0-3, 0-2, 4-6 were obtained. Namely bao Jianfeng from hone, bao Jianfeng, sword, hone these four words yuan. Here is just a way of cutting, and the final result: Bao Jianfeng from Grind out, Bao Jianfeng, sword, from, feng, from, grind, out. There are still some gaps.

5. Output the result

When useSmart is set to true, the above four words will be disambiguated, and there is only one element 0-7 left at last. I’ll do that logic later, assuming we didn’t set useSmart to true, we still have four words, and then we’re going to print the result and see what happened. The main logic is in the AnalyzeContext#outputToResult method.

void outputToResult(a){
   int index = 0;
   for(; index <=this.cursor ;) {// Skip non-CJK characters
      if(CharacterUtil.CHAR_USELESS == this.charTypes[index]){
         index++;
         continue;
      }
      // Find the LexemePath corresponding to index position from pathMap
      LexemePath path = this.pathMap.get(index);
      if(path ! =null) {// Output lexeme from LexemePath to the Results set
         Lexeme l = path.pollFirst();
         while(l ! =null) {this.results.add(l);
            // There are no words in the dictionary, but the words conflict, cut out the words in the previous words of the intersection words
            int innerIndex = index + 1;
            for (; innerIndex < index + l.getLength(); innerIndex++) {
               Lexeme innerL = path.peekFirst();
               if(innerL ! =null && innerIndex == innerL.getBegin()) {
                  this.outputSingleCJK(innerIndex - 1); }}// move index to lexeme
            index = l.getBegin() + l.getLength();              
            l = path.pollFirst();
            if(l ! =null) {// Output missing words between words inside path
               for(; index < l.getBegin(); index++){this.outputSingleCJK(index); }}}}else{// Index cannot be found in pathMap
         // Single word output
         this.outputSingleCJK(index); index++; }}// Clear the current Map
   this.pathMap.clear();
}
Copy the code

The main logic of this section is to output words in the form of single word output for those positions that are not divided into words. Careful readers should have found the final result: Bao Jianfeng from Grinding out, Bao Jianfeng, sword, from, feng, from, grinding, out. There are two slave words and only one slave in this input statement. This is actually a bug specific to the IK version, and 6.8.1, unfortunately, has this bug. About this bug, write an article later again specific analysis. // there is no word in the dictionary, but the word conflicts. This comment is caused by the large code, also an IK PR, but the latest version has been marked out.

6. Disambiguate

Now let’s go back and see what happens when you set up useSmart. The main logic is in the IKArbitrator#process method

void process(AnalyzeContext context , boolean useSmart){ QuickSortSet orgLexemes = context.getOrgLexemes(); Lexeme orgLexeme = orgLexemes.pollFirst(); LexemePath crossPath = new LexemePath(); while(orgLexeme ! = null){ if(! CrossPath. AddCrossLexeme (orgLexeme)) {/ / to find crossPath disjoint next crossPath if (crossPath. The size () = = 1 | |! UseSmart){//crossPath context.addLexemePath(crossPath); QuickSortSet.Cell headCell = crossPath. GetHead (); LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength()); JudgeResult context.addLexemePath(judgeResult); } // Add orgLexeme to the new crossPath crossPath = new LexemePath(); crossPath.addCrossLexeme(orgLexeme); } orgLexeme = orgLexemes.pollFirst(); } / / handle the final path if (crossPath. The size () = = 1 | |! UseSmart){//crossPath context.addLexemePath(crossPath); QuickSortSet.Cell headCell = crossPath. GetHead (); LexemePath judgeResult = this.judge(headCell, crossPath.getPathLength()); JudgeResult context.addLexemePath(judgeResult); }}Copy the code

When useSmart is true, it is ambiguous; if it is false, it is not processed and output directly. LexemePath implements the Comparable interface. Words inside LexemePath do not want to cross and are sorted according to the sorting rules, which are as follows

public int compareTo(LexemePath o) {
   // Compare the valid text length
   if(this.payloadLength > o.payloadLength){
      return -1;
   }else if(this.payloadLength < o.payloadLength){
      return 1;
   }else{
      // Compare the number of elements, the less the better
      if(this.size() < o.size()){
         return -1;
      }else if (this.size() > o.size()){
         return 1;
      }else{
         // The larger the path span, the better
         if(this.getPathLength() >  o.getPathLength()){
            return -1;
         }else if(this.getPathLength() <  o.getPathLength()){
            return 1;
         }else {
            // According to the statistical conclusion, the probability of reverse sharding is higher than that of forward sharding, so the lower the position is, the higher the priority is
            if(this.pathEnd > o.pathEnd){
               return -1;
            }else if(pathEnd < o.pathEnd){
               return 1;
            }else{
               The more average the word length, the better
               if(this.getXWeight() > o.getXWeight()){
                  return -1;
               }else if(this.getXWeight() < o.getXWeight()){
                  return 1;
               }else {
                  // compare the position weight of the element
                  if(this.getPWeight() > o.getPWeight()){
                     return -1;
                  }else if(this.getPWeight() < o.getPWeight()){
                     return 1;
                  }
                  
               }
            }
         }
      }
   }
   return 0;
}
Copy the code

According to this rule, the lexical link is sorted, and the lexical link with the first order is selected, which is the word segmentation that disambiguates the last word.

7. To summarize

Although the USE of IK word segmentation is relatively simple, but it is important to understand the idea of its internal principle, to help analyze and locate the problem.

For more exciting content, please pay attention to the public number