Number has no message function, so I built a wechat communication group to facilitate communication, now need you to join, the group of big guy thief many! At present into the group full 100 red envelopes, give a blow than the opportunity to ah Sir, look forward to working with you mutually touted, common progress. By the way, I’ll see you at 18:20 in the future. WeChat: sowhat0125

1 brute force cracking method

Find the occurrence position of pattern string B in main string A, where n > m if A is length N and B is length M. When we do A violent match, the primary string A starts with positions 0, 1, 2… N -m n-m+1 substring of length m. \

Violent match

The corresponding code is:

#include<stdio.h>
#include<string.h>
int cnt=0;
int index(char s[], char sub[])
{
    int i=0;
    int j=0;
    while(i<strlen(s))
    {
        if(s[i]==sub[j]) 
        {  // if a single character is equal, both I and j are searched backward
            i++;
            j++;
        }
        else 
        {  // If there are character mismatches, I starts at the position following the previous one, and the pattern string j starts at 0
            i=i-j+1;
            j=0;
        }
        if(j==strlen(sub))
        {
            cnt++;
            return i-strlen(sub)+1; }}return - 1;
}
int main()
{
   // s="abcdabefgabefa", sub=" Abe ", return 5
    char s[20],sub[10];
    gets(s);
    gets(sub);
    printf("%d\n",cnt);
    printf("%d\n",index(s,sub));
    return 0;
} 
Copy the code

If the main string is BBB… BBB, the pattern string is BBBBC, so each string has to be compared m times, n minus m plus 1 times m times. Time complexity = O(n*m), generally simple matching can be used this method.

2 Rabin – Karp algorithm

Algorithm idea: the main string of n- M +1 substring hash value, and then compare with the template string hash value, if the same and then compare the string is the same. \

Rabin – Karp algorithm


But the Hash calculation of intermediate data can be optimized. Let’s use a simple string matching example to map a-z to 0 to 25. Then compute the hash value of a string in base 26, such as:

The Hash computation

But you can see that there is overlap between two adjacent substrings, for example, DAB and ABC overlap AB. In this way, the Hash value of the next data can be derived from the value of the previous data:


Optimize the computed hash value

The time complexity of RK algorithm consists of two parts. The first part is to iterate over all substrings to compute the Hash value, and the time complexity is O(n). The second part is comparing hashes, which is also O(n) time.

The core of the algorithm is to minimize the hash values are equal under the condition of different data to compare the, so the hash algorithm to as good as possible, if you feel in 123 corresponding letters ABC easy collision, that with prime Numbers to match is also OK, anyway, the purpose is the same, you can think this is a tricky way to deal with string matching problem.

3. Boyer-moore algorithm.

Boyer Moore algorithm is a very efficient string matching algorithm, its performance is 3 to 4 times that of the famous KMP algorithm. It’s not easy to understand, but it’s the most used string matching algorithm in engineering. In the past, we used to compare strings by moving from front to back. The core idea of BM algorithm is that when a character in the pattern string cannot match with the main string, we slide the pattern string back a few more bits to reduce unnecessary character comparison. \

Move comparison

On the whole, BM algorithm is quite complex compared with the previous two, which mainly includes ** bad character rule ** and good suffix rule.

3.1 Bad Character Rules

The bad character rule means to match from back to front according to the pattern string, when we find that a character can’t match. We call the character in the main string that does not match a bad character. \

Bad character


If c does not match any character in the pattern string, you can move the pattern string back 3 bits. Continue comparing from the end of the pattern string.

2 movement

It is found that the bad character is G, but there is a G in the pattern string, it can not move 3 more, move 2 positions. And then we match. What’s the pattern?

If a bad character exists in the pattern string, mark the first bad character from back to front as Xi. If the bad character does not exist in the pattern string, then Xi = -1. At this point, the mode string moves backwards by bits = si-xi. \

Bad character movement rule


If you hit the extreme main string = CCCDCCCDCCCD, mode string = CCCC, then the time complexity is O(n/m).

The optimal solution


But don’t count your chickens before they hatch! The following situation may cause the pattern string to move forward instead of backward.

Moving forward?

Therefore, BM algorithm also needs to use the good suffix rule.

3.2 Bad character codes

In order to avoid taking characters from the pattern string traversal search every time, at this time the hash table is used to store each character in the pattern string and its subscript, convenient for quick search.

Next, hash table rules. For example, if a character is a byte, we use an array of size 256 to record each character’s occurrence position in the pattern string. The array stores the occurrence position of the pattern string, and the table below the array contains the ASCII value of the character. \

Hash rules

Bad character rules:

// global variable SIZE
private static final int SIZE = 256// b= pattern string array, m is pattern string array length, sl is hash table, default -1
private void generateBC(char[] b, int m, int[] sl) {
  for (int i = 0; i < SIZE; i++) {
    sl[i] = - 1// Initialize sl
  }
  for (int i = 0; i < m; ++i) {
    int ascii = (int)b[i]; sl[ascii] = i; }}Copy the code

Next, the negative case of suffix rules and bad characters is not considered, and the BM algorithm code is written roughly first. \

BM \ bad characters

public int bm(char[] a, int n, char[] b, int m) {
  // Records the last position of each character in the pattern string
  int[] sl = new int[SIZE]; 
  // Build a bad character hash table
  generateBC(b, m, sl); 
  // I represents the first character that the main string aligns with the pattern string
  int i = 0; 
  while (i <= n - m) {
    int j;
    for (j = m - 1; j >= 0; j--) { 
      if(a[i+j] ! = b[j])break; 
    }
    if (j < 0) {
    Return the position of the first matching character in the main string and pattern string
      return i; 
    }
    // Slide the mode string back the j- BC [(int)a[I +j]] bit
    i = i + (j - sl[(int)a[i+j]]); 
  }
  return - 1;
}
Copy the code

\

3.3 Rules for good suffixes

Good suffixes are similar to bad suffixes. They match from back to front until they encounter a mismatched character X, which is called the good suffix before the main string X. \

Good suffix definition

The rules for moving are as follows:

  1. If a good suffix is found in the pattern string, box it with an X, then align the x box with the suffix and continue matching. \

    Found the movement rule

  2. When not found, if the length of the direct move is mode string M bits, it is very likely to be too much! The reason for excessive movement is, for example, if you find a good suffix U, u is not found in the pattern string as a whole, but the substring D of U can be matched with the pattern string. \

    Excessive mobile

Therefore, it is necessary to check whether the suffix substring matches the prefix substring in the pattern string. From the suffix substring of the good suffix string, find the longest suffix substring that can match the prefix substring in the pattern string and slide them together, such as the d above.

Then calculate the backward sliding number of bad characters and backward sliding number of good suffixes respectively. In this case, the larger of the two is taken as the backward sliding number of mode string. In this case, the negative number of bad characters can be avoided.

3.4 Good postfix code

The core of a good suffix actually lies in two points:

  1. In the pattern string, find another substring that matches the suffix.
  2. In the good suffix substring, find the longest suffix substring that matches the pattern string prefix substring.
3.4.1 Pretreatment

The two cores above could be solved by force at the code level, but it would take too long! We can preprocess the pattern string before matching, and calculate the position of each suffix substring of the pattern string and the corresponding position of another matched substring in advance.

First look at how to represent the different postfix substring in the pattern string, because the last character of the postfix substring is subscript M-1, we only need to record the length of the postfix substring, by the length can determine a unique postfix substring. \

Substring said


Let’s introduce the key variablessuffixThe array. Suffix The index of the array represents the length of the suffix substring. The array value of the subscript is stored

The starting subscript value matched by the good suffix in the pattern string:

Suffix array definition

For example, the suffix C matches another pattern string at position 2, the suffix BC matches another pattern string at position 1, the suffix DBC matches another pattern string at position 0, and the suffix CDBC only appears once in the pattern string, so it is -1.


The prefix array


Note here that we need to look not only in the pattern string for another substring that matches the good suffix, but also in the suffix substring of the good suffixThe longestThe suffix substring that can match the pattern string prefix substring. For example:

Longest pattern matching

With suffix, you can only find another substring that matches the suffix. But we also need a Boolean prefix array to record whether the suffix substring of the pattern string matches the prefix substring of the pattern string.

Next, we will focus on how to fill the suffix and prefix array. Take the substring with subscripts from 0 to I and the whole pattern string to find the common suffix substring, where I =[0,m-2]. If the length of the common suffix substring is K, suffix[k]=j, where j represents the starting subscript of the common suffix substring. If j = 0, it indicates that the common suffix substring is also the prefix substring of the pattern string, in which case prefix[k]=true. \

Suffix follows prefix array \

// b= pattern string, m= pattern string length, suffix, prefix array as defined above
private void generateGS(char[] b, int m, int[] suffix, boolean[] prefix) {
  / / initialization
  for (int i = 0; i < m; i++) { 
    suffix[i] = - 1;
    prefix[i] = false;
  }
  // b[0, i]
  for (int i = 0; i < m - 1; i++) { 
    int j = i;
    // Common suffix substring length
    int k = 0; 
    // if b[0, m-1] and b[0, m-1]
    while (j >= 0 && b[j] == b[m- 1-k]) { 
      --j;
      ++k;
      //j+1 indicates the starting subscript of the common suffix substring in b[0, I]
      suffix[k] = j+1; 
    }
    If the common suffix substring is also the prefix substring of the pattern string
    if (j == - 1) prefix[k] = true; }}Copy the code

\

3.4.2 Official code

With the suffix and prefix arrays, look at the movement rules. Suppose the good suffix string length =k, if k! = 0, indicating that there is a good suffix, then determine how to move according to the value of suffix[k].

  1. suffix[k] ! = -1, if not, it indicates that a match is made, and the pattern string moves to j-suffix[k]+1 bit, where J represents the character subscript in the pattern string corresponding to bad characters.

  2. Suffix [k] = -1, if the suffix[k] = -1, the suffix substring length is b[r, m-1], where r = [j+2,m-1], the suffix substring length is K =m-r, if prefix[k] = true, The suffix substring of length K has a matching prefix substring, so we can move the pattern string back by r bits.

  3. If none of them match, the pattern string is moved m bits back. \

    // a and n represent the length of the main string, respectively.
    // b and m represent the length of the pattern string respectively.
    public int bm(char[] a, int n, char[] b, int m) {
      // Record the last position of each character in the pattern string
      int[] sl = new int[SIZE]; 
      // Build a bad character hash table
      generateBC(b, m, sl); 
      int[] suffix = new int[m];
      boolean[] prefix = new boolean[m];
      // Build the suffix and prefix arrays
      generateGS(b, m, suffix, prefix);
      int i = 0// j represents the first character that the main string matches the pattern string
      while (i <= n - m) {
        int j;
        // The pattern string matches from back to front
        for (j = m - 1; j >= 0; --j) { 
          // The subscript in the pattern string for bad characters is j
          if(a[i+j] ! = b[j])break; 
        }
        if (j < 0) {
          // Match OK, return the position of the first matching character of the main string and pattern string
          return i; 
        }
        // Bad characters are calculated to move the length
        int x = j - sl[(int)a[i+j]];
        int y = 0;
        // j < m-1 indicates a good match
        if (j < m- 1) { 
          y = moveByGS(j, m, suffix, prefix);
        }
        i = i + Math.max(x, y);
      }
      return - 1;
    }
    
    // j = the character subscript in the pattern string corresponding to the bad character
    // m = mode string length
    private int moveByGS(int j, int m, int[] suffix, boolean[] prefix) {
      // Good suffix length
      int k = m - 1 - j; 
      if(suffix[k] ! =- 1) {
         // Good suffixes can be matched, return the length to move
         return j - suffix[k] + 1;
      }
      // There are pattern string prefix substrings that match good suffix substrings
      for (int r = j+2; r <= m- 1; ++r) {
        if (prefix[m-r] == true) {
          returnr; }}// Direct movement Max was not found
      return m;
    }
    Copy the code

3.5 Complexity Analysis

The whole BM algorithm uses three additional arrays of SL, suffix and prefix, where the sl array size is related to the character set size, and the suffix array and prefix array size is related to the pattern string length M. If you are dealing with string matching problems with large character sets, BC arrays can be more memory consuming.

Because the rules for good suffixes and bad characters are independent, if we are running in a memory-demanding environment, we can use only the rules for good suffixes and not the rules for bad characters to avoid excessive memory consumption for BC arrays. However, the efficiency of BM algorithm with good suffixed rules will be reduced.

The time complexity of BM algorithm is very complex to analyze, generally between 3N and 5N.

4 KMP algorithm

The KMP algorithm is similar to the BM algorithm. It matches from front to back, and calls the good prefix that can be matched, and the bad character that cannot be matched.

KMP match definition


When a bad character is encountered, it is necessary to compare the prefix substring of the main string with the prefix substring of the pattern string. The question is, for the good prefix that has been compared, can we find a rule to slide the pattern string many bits at a time?

Brute force

The idea is to compare the suffix substring of the good prefix in the main string with the prefix substring of the good prefix in the pattern string to obtain the maximum prefix substring that can be matched in the pattern string. In general, the suffix substring of the longest matched prefix substring is called the longest matched suffix substring. The corresponding prefix substring is called the longest matched prefix substring. If the longest matched suffix substring = u and the longest matched prefix substring = v, the length of u and V is k, then the bad character position in the main string is I and the pattern string is J, then the pattern string is moved j-K bit later, and then the pattern string position to be compared is j = J-K. \

KMP algorithm move

4.1 The existence significance of the Next array

The longest matching prefix and suffix can be solved by using pattern string. Imitate BM algorithm, just calculate it in advance. Construct a next array in advance in the KMP algorithm. The subscript of the next array is used to store the index of the last data of the prefix substring, and the corresponding value stores the intersection of the suffix substring set and prefix substring set of the string.

This might be a little hard to understand, but let’s take abababca. Its Partial Match Table array is as follows: \

Next array

If A=BS exists for strings A and B, where S is any non-empty string, then B is called the prefix of A. For example, the prefixes for “what” include {“w”,”wh”,”wha”}, and we take all the prefixes as a set of strings. We could also define the suffix A=SB, where S is any non-empty string, and call B the suffix A. For example, the suffix “Potter” includes {“otter”, “tter”, “ter”, “er”, “r”}, and then combine all the suffixes into A set of suffixes for the string. Note that the string itself is not its own suffix.

The value in the PMT array is the length of the longest element in the intersection of the prefix and suffix sets of strings. For example, for “ABA “, the prefix set is {“a”, “ab”} and the suffix set is {“ba”, “a”}. The intersection of the two sets is {“a”}, so the longest element is the string “a”, length 1, so the Next array value of “ABA” = 1, similarly for “ababa”, its prefix set is {“a”, “ab”, “aba”, “abab”}, Its suffix set is {“baba”, “aba”, “ba”, “a”}, and the intersection of the two sets is {“a”, “aba”}, where the longest element is “aba” with length 3.

If the characters do not match at j, then in the data string “ababab” of the pattern string [0,j-1], the maximum value of the intersection between the prefix set and the suffix set is “abab” of length 4. \

PMT array usage method


Based on this, PMT can be used to speed up string lookup. We see if it’s injDislocation, then the influencejThe position of the pointer backtracking is actually number oneJ - 1The value of the Next array is actually the longest of the stored prefix. It can match the subscript of the ending character of the prefix substring, and if it cannot match, it can be replaced by -1.

Next array use

4.2 KMP matching code

// a = main string, n= main string length. B = pattern string, m = pattern string length
public static int kmp(char[] a, int n, char[] b, int m) {
  int[] next = getNexts(b, m);
  int j = 0;
  for (int i = 0; i < n; ++i) {
    while (j > 0&& a[i] ! = b[j]) { j = next[j -1] + 1// keep I unchanged and j moves
    }
    if (a[i] == b[j]) {
      ++j;
    }
    if (j == m) { // The pattern string is found
      return i - m + 1; }}return - 1;
}
Copy the code

At this point you’ll find that you just have to figure out why PMT exists, and then follow the Next array along the way. \

4.3 Solving the Next array

When calculating next[I], can the previous calculation of next[0~i-1] be used to quickly deduce next[I]?

If next[i-1] = k-1, then next[I] = k if next[I] = k-1, then next[I] = k if next[I] = k-1, then next[I] = k \



Case 2: If b[0, I] is the longest available suffix substring b[r, I], then B [r,i-1] must be the longest matched suffix substring of b[0,i-1]. For example, if the string b = “dexdecdexdex”, the longest substring matching the suffix is “dex”, b becomes B by removing the last ‘x’. Even though “de” is the substring matching the suffix of B, “dexde” is the longest substring matching the suffix! That is, the next character of the prefix of the pattern string corresponding to the longest matched suffix substring b[0, i-1] is not equal to b[I].




So let’s look at b[0, I minus 1]longIf the next character of the matching prefix b[0,i-1-x] is equal to b[I], then the longest matching suffix substring b[0, I] is b[x, I].

What if we find the substring of b[0, i-1] with a matching suffix? The substring with the longest matched suffix must be included in the longest matched suffix substring, which in turn corresponds to the longest matched prefix substring B [0, y]. Finding the substring of b[0, i-1] becomes the problem of finding the longest matched substring of B [0, y].

Follow this idea to investigate all b[0, i-1] matched suffix substring B [y, i-1], until find a matched suffix substring, its corresponding prefix substring next character is equal to b[I], then this b[y, I] is the longest matched suffix substring b[0, I]. \

\

// b = pattern string, m = pattern string length
private static int[] getNexts(char[] b, int m) {
  int[] next = new int[m];
  next[0] = - 1;
  int k = - 1;
  for (int i = 1; i < m; ++i) { while (k ! =- 1 && b[k + 1] != b[i]) {
      k = next[k];
      // Because the next character of the previous longest string is not equal to the last one, we need to find the second string of the previous one,
      // The problem becomes the longest string from 0 to next(k). If the next character is different from the last,
      // Continue to find the next next(k) until you find it, or none at all
      // This is best seen in conjunction with the previous figure
    }
    if (b[k + 1] == b[i]) {
      ++k; // If the string is equal, go to the next one
    }
    next[i] = k; // Array assignment
  }
  return next;
}
Copy the code

KMP space complexity: This algorithm requires only an additional next array, the size of which is the same as the pattern string. So the space complexity is O(m), where M is the length of the pattern string. KMP time complexity: Next array calculation time complexity is O(m) + matching time complexity is O(n) = O(m+ N)

At this point, the common string matching algorithm formally explained the end, in fact, the previous said is a main string, a pattern string. If you are interested you can also see the AC automaton, the algorithm can achieve traversal a main string at the same time matching multiple pattern string, belongs to the upgraded version of KMP, in the ordinary string matching, widely used. The core idea of this article refers to the data structure and algorithm course of geek time little Brother, the quality of the course is quite high, as a recommended wave of tap water, there are private chat I want to buy, there are preferential oh.

5 reference

  1. Geek time KMP:time.geekbang.org/column/arti…
  2. Zhihu KMP:www.zhihu.com/question/21…
  3. KMP: T.1Yb.co /i95V