If you want to learn about THE KMP algorithm, please settle down and read this article. You will surely live up to your time

Violent matching (BF)

String matching is a common problem in our programming, and detecting another string (pattern string) from one string (main string) is a very classic problem. When it comes to this problem, the algorithm we first think of may be violent matching. The following GIF shows the process of violent matching.

In the figure above, the arrows point to characters that are both blue when they match, black when they don’t match, and red when the pattern string is found in the main string.

The general idea of this algorithm is that whenever there is a character mismatch between the pattern string and the main string, the corresponding position of the pattern string and the main string is moved back one bit, and the comparison starts from the first part of the pattern string again, and the above method is repeated until the pattern string is matched in the main string or the last bit of the main string is matched.

If both main and pattern strings are short, brute force matching is a good choice and coding is relatively easy. However, if the main string and mode string are too long, we just think about it and know that the process is very time-consuming, then will there be a corresponding optimization algorithm?

The following is the introduction of the protagonist of this paper — KMP algorithm. Without pulling useless concepts, it directly talks about the application process of the algorithm and the code to implement the algorithm using Python. Finally, it summarizes why THE KMP algorithm is better than the violent matching algorithm through the analysis of the time complexity of the two algorithms.

KMP algorithm

Building the Prefix table

We first need to determine the main and pattern strings for the citation:

  • Primary string S = “abacaababc”
  • Pattern string P = “ababc”

When the pattern string matches the main string, we will only look at step 4. It is obvious that c in the main string S and B in the pattern string P do not match:

The core of KMP algorithm is to use the information obtained after the failure of matching to reduce the matching times of pattern string and main string to achieve the purpose of fast matching. For example, for this mismatch phenomenon, can we directly move the pattern string in this way?

So where does the information come from? In THE KMP algorithm, the internal matching information of a pattern string can be calculated first, so that the pattern string can be moved most effectively when the matching fails, thus reducing the number of matches. Before we do that, we need to understand prefixes and suffixes.

  • Prefix: The prefix of abcDE can be a, ab, ABC, or abcd
  • Suffixes: The suffixes of abcde can be E, de, cde, bcde

Profix [I] stores the longest common prefix of the first I +1 character, and the longest common prefix must be less than the length of the string.

The prefix list above is obtained by visual comparison. After all, the program is not a person, so it is necessary to build the prefix list in a way that the program can recognize. The process is described according to the following figure.

Using this GIF, you can plan the following five steps to build the prefix table:

  1. We first create two Pointers, j to the first digit of the pattern string (subscript 0) and I to the second digit of the pattern string (subscript 1).
  2. Since the pattern string starts with a single character without a prefix or suffix, the first digit in the corresponding prefix list is always 0.
  3. When j=0, compare the characters pointed to by j and I. If the characters do not match, the position of the prefix list corresponding to I will be filled with 0, and I will be moved backward, and J will remain unchanged.
  4. When the characters pointed to by j and I match, the position of the prefix list corresponding to I is filled with (j+1), and both j and I are moved one bit later.
  5. If j and I don’t match, and at this point, j needs to go back to the position of profix[j-1] and compare it again with the character pointed to by I. Repeat this step until the character pointed to by j and I matches or j=0.

After reading the five steps combined with the GIF, I guess you will not understand the fifth step. If you do, I can only sigh NiuBi. The following example can better highlight the backtracking mechanism of the fifth step.

I wrote the first five characters in the prefix list according to the above steps, but the characters pointed to by j and I do not match and, the subscript of j is 3, so we need to find the value with subscript j-1 in the prefix table, that is, profix[2], and then backtrace j to the corresponding position.

This backtracking is done because we can find a prefix in the pattern string header that matches the string between j and I, which is a in this case. If the character pointed to by j and I matches, then the longest common prefix length is the length of the matched prefix (a) plus 1. Thus, this operation can save a lot of time if the string between j and I is long.

In this case, the characters pointed to by j and I still do not match, so we need to continue backtracking j, the method is the same as above, backtracking position is profix[0].

At this point, the characters pointed to by j and I still do not match, but what needs to be done here is not backtracking, because j=0 has already met the conditions for the end of backtracking, so you only need to fill 0 in the position of the prefix table corresponding to I (Profix [5]), and it will be found that there is no common prefix or suffix at this time by naked eye matching.

Once you understand the steps above, you can think of it as pseudocode, from which you can easily write the build prefix list function.

def PrefixTable(Pattern):
    i = 1; j =0
    prefix = [0]*len(Pattern)
    while(i<len(Pattern)):
        if Pattern[j]==Pattern[i]:
            prefix[i] = j+1
            j+=1; i+=1
        else:
            if j == 0:
                prefix[i] = 0
                i+=1
            else:
                j = prefix[j- 1]
    return prefix
Copy the code

You can enter a pattern string to test whether the code yields the corresponding prefix list.

Optimized prefix list

The basic fact is that the last digit of the prefix list doesn’t work. What’s the reason for this? Because when j and I do not match, the solution here is to backtrace j, which is always prefix[J-1]. J can never exceed I, so the last digit of the prefix list is never used.

Then the last digit can be removed, all elements can be moved one bit later, and -1 can be filled to the first digit of the prefix list, as shown below:

def MoveTable(prefix):
    for i in range(len(prefix)- 1.0.- 1):
        prefix[i] = prefix[i- 1]
    prefix[0] = - 1
    return prefix
Copy the code

KMP matching mechanism

The main string and pattern string use the example above, where we skip some simple matching and go straight to the key points.

It can be seen that the fourth bit of the Pattern string and the main string do not match. What we need to do now is to move the element of Pattern[prefix[4]] corresponding to the main string to be matched, that is, the element with subscript 1 in the Pattern string to the position corresponding to the fourth bit of the main string.

The value of the prefix list corresponding to this position is 0. Therefore, set Pattern[prefix[0]] corresponding to the element to be matched in the main string, that is, the element with subscript 0 of the Pattern string to this position in the main string.

At this time, the corresponding positions of the two strings still do not match, but a is already the first element of the pattern string. If we follow the above method, we need to move the pattern string further, so that the position of the main string matches the element with the subscript -1 of the pattern string, but there is no element with the subscript -1 in the prefix table.

Therefore, if the corresponding position of the pattern string and the main string do not match, and the value of the prefix list corresponding to the element of the pattern string is -1, then the whole pattern string and the pointer to the main string can be directly moved one bit later, which is why -1 is inserted at the first of the prefix list.

The following GIF is the whole process of using KMP algorithm to find pattern strings in the main string.

The code of KMP algorithm is as follows:

def KMP(TheString,Pattern):m = len(TheString); n = len(Pattern) prefix = PrefixTable(Pattern) prefix = MoveTable(prefix) i =0; j =0# I is the main string pointer, j is the mode string pointer
    while(i<m):
        if j==n- 1 and TheString[i]==Pattern[j]:
            print("Pattern string found at main string subscript %d" % (i-j))
            j = prefix[j]
        if TheString[i]==Pattern[j]:
            i+=1; j+=1
        else:
            j = prefix[j]
            if j==- 1:
                i+=1; j+=1
Copy the code

If j points to the last bit of the pattern string, and if the main string matches the pattern string, then the pattern string is found in the main string and the position where the first character appears is printed. whileThe purpose of this statement is to continue matching the remaining main strings after the pattern string is found, because it is possible to have several pattern strings in the main string.

Finally, the whole program running screenshot is as follows:

BF is compared with KMP

The reason why KMP is better than BF is given here by comparing the time complexity of the two. Suppose there are two extreme main string and pattern string:

  • Primary string S = “aaaaaaab”
  • Pattern string P = “aaab”

First, take a look at the process of BF algorithm to solve the matching problem:

Then look at the flow of KMP algorithm to solve the matching problem:

Assume that the main string length is M and the mode string length is N. For BF algorithm, whenever a character is not matched, it needs to be matched again from the beginning of the pattern string, so it corresponds to the time complexity; For THE KMP algorithm, whenever it encounters a mismatched character, according to the obtained information, it will not repeatedly match the known prefix, so the corresponding time complexity is. When the string is long, KMP algorithm is better than BF algorithm in terms of time complexity.

conclusion

Personally think difficulty not low KMP algorithm, this algorithm’s blog and video a lot, but each have differences, although principle are roughly the same, but don’t see table prefix and next array at the same time, because the two are like so confusing, can get through the prefix table first and then the next array related knowledge, understanding of the KMP is so thoroughly.

Follow the public account “toffee cat” to get more wonderful articles