A, introducing

This article is mainly to analyze the KMP algorithm. Firstly, the KMP algorithm solves a problem: to judge whether a string match is a substring of another string STR

It is just that there is a very famous algorithm called KMP that solves this problem with O(N) time complexity. Before we talk about KMP algorithm, we describe the solution of this problem in the form of violent solution. To determine whether match is a substring of STR, If both characters of STR index 0 are equal, then the judging position will start to move to the next index, that is, the index position is 1. If the two characters are not equal until the index position is X, then the match fails. STR [1, str.length-1] matches STR [0] and match[0] matches STR [1, str.length-1]

This is the violent solution, and the fundamental purpose of the violent solution is to match a match from each substring of STR. Imagine the following matching scenario:

  • str: abczkkabczkkabcs
  • match: abczkkabcs

If we match STR with match, we can see that there is abckkkABC in both STR and match. When STR and match are both 8, the match is equal until the next character match, STR is z and match is S. Then the two characters are not equal. Bczkkabczkkabcs = bCZkkabCs = bCZkkabCs = bCZkkabCs = bCZkkabCs = bCZkkabCs = bCZkkabCs = bCZkkabCs = bCZkkabCs That is, the match is matched from index position 6 in STR, so as to achieve a fast forward effect, which is the function of KMP algorithm

2. Introduction of the Next array

Next array is an int[]. Let’s not talk about the real semantics of next array, but let’s first introduce the data of next array. Let’s use the string abcabcx as an example:

  1. Next array: {0, 0, 0, 0, 1, 2, 3}
  2. Let’s start with the value 3 in the next array, because in abcabcx, the string before x is abcabc, and ABC, the longest equal string in prefix and suffix, is 3
  3. Again, this value is 2 because in abcabc, the string abcab before the last character c, the longest equal string ab of prefix and suffix, is 2
  4. Again, this value is 1, because in abCAB, the string abca before the last character b, the longest string with the same prefix and suffix is a, which is 1 in length

After analyzing the examples above, let’s look at the description of the next array: Each index x of the next array stores the length of the substring with the longest prefix and suffix in the string from 0 to index x-1, where the prefix cannot include the last character and the suffix cannot include the first character, such as aaac. When the index position is 3, the substring with the longest prefix and suffix is aa. Not AAA, because the prefix can’t include the last A, and the suffix can’t include the first A, so let’s ignore how the next array is generated, and let’s go back to the implementation of the KMP algorithm, assuming that we’ve already generated the corresponding next array for the match string, This array represents the maximum length of equal prefixes and suffixes in the substring before each index position of the match string. With this data as the basis, the fast forward function of the KMP algorithm for matching positions is guaranteed

3. KMP algorithm

1. Overall process

First, our question is: Check whether the string match is a substring of another string STR, and then we have a known data, which is the next array above. Each index of the next array X holds the maximum length of the same prefix and suffix in the match string from 0 to x-1. The prefix does not include the character whose index position is X-1, and the suffix does not include the character whose index position is 0

Then, we take STR as abcabeabcabf and match as abcabf (figure below). The next array generated from the match string is {0, 0, 0, 0, 1, 2}.

  1. After a series of comparisons, STR [strIndex] and match[matchIndex] are matched at 5 and 5 respectively. Find unequal

  2. The strIndex position of STR does not change, and the strIndex will only increase throughout the search until str.length does not decrease (in this case, compared to the violent solution, Violent solution under the condition of the first appeared different characters will strIndex rollback to 1 to this place, the next is to roll back to the index of 2 this position, but it’s clear rolled back for the first time to this place in 1 character b to match the first character is not equal, or excessive rolled back, and this is the disadvantages of violent solution, roll back too much)

  3. According to next[matchIndex], the maximum length of equal prefix and suffix is 2, that is, both prefix and suffix are AB, then matchIndex will be backward to position 1, that is, matchIndex will be changed to 2

  4. STR [3, 4] = match[0, 1]; STR [3, str.length-1] = match[0, 1]; STR [3, 4] = match[0, 1]; At this time, we continue to compare the characters of strIndex and matchIndex. It can be found that the position of STR comparison is fast forward. In the violent solution, when there is inconsistency in the above situation, it becomes from STR [1, Str.length-1] will look for the match string, and the matchIndex will change to 0 and the strIndex will change to 1. Both strIndex and matchIndex are overfallbacks

2. Deduce how fast forward works

Before proceeding to the next step of the process analysis, let’s analyze the principle of fast forward in its current state

  • Question 1: Why is matchIndex going back to position 1 equivalent to looking for a match string in the substring made up of STR starting at position 3 and ending at position 3

  • Answer: According to the color classification in the above picture, we can divide into red region, blue region and green region. First of all, before the rollback of matchIndex, we compared the two characters E and F, and found that they were not equal, so we started the rollback of matchIndex. First of all, we learned that Match [0, matchindex-1] and STR [0, strindex-1] are equal because:

    • Green area (prefix) == Blue area (suffix)
    • Blue region (suffix) == Red region (suffix)

    The green area == the red area, which tells us that we should then compare strIndex to the character next to the green area, which becomes the process of finding the match string in the substring starting with the STR red area, which is how fast forward works

  • Problem two: STR [0, str.length-1] = STR [3, str.length-1] = STR [0, str.length-1] = STR [0, str.length-1] Str.length-1] match match (Note that the matchIndex is still at the position shown in the figure and has no rollback at this time)

  • Answer: If we can find match from STR [X, str.length-1], then STR [X, strindex-1] is the same as match from [0, matchindex-1]. That is, the [0, X] prefix of the match array is equal to the [X, matchindex-1] substring of STR [0, matchindex-1]. Then it means that there must be a match string in which the maximum length of the equal prefix and suffix is greater than 2. According to the current definition of the next array, the maximum length of the equal prefix and suffix is 2, which is different from the actual situation, so it is assumed to fail

After the above two inferences, we understand that when the matching characters are not equal, matchIndex should be changed to the corresponding value of next[matchIndex] for matching again. If the characters of matchIndex and strIndex are equal, then add 1 to them. Matches the next character in STR and match, and if they are not equal, matchIndex needs to roll back again

3. Continue the analysis of the overall process

As shown below:

  1. After the rollback of matchIndex, matchIndex is now at position 1, that is, matchIndex = 2, and strIndex is still 5
  2. If match[matchIndex] is not equal to STR [strIndex], matchIndex should be backed up again. Next [matchIndex] is 0
  3. STR [strIndex, str.length-1] = STR [strIndex, str.length-1] = STR [strIndex, str.length-1] = STR [strIndex, str.length-1] = STR [strIndex, str.length-1] = STR [strIndex, str.length-1] = STR [strIndex, str.length-1
  4. StrIndex becomes str.length, and matchIndex becomes match.length. The match is complete and successful

Next array generation

The next array is generated by the string match, and the semantics of the next array are as follows: each index X of the next array stores the longest substring length of the prefix and suffix in the substring constructed from 0 to index X-1, where the prefix cannot include the last character and the suffix cannot include the first character

Here’s a look at the next array generation process:

  1. We expect the string abcabf? Generate the next array, assuming that we have generated the data from index 0 to 5, we expect to generate the data from the next array for the character whose index is targetIndex. First of all, we don’t have to worry about what the character whose index is targetIndex is. Because we want to get the longest prefix and suffix from match[0, targetindex-1], targetIndex is 6, targetindex-1 is 5
  2. First, we know that next[5] is 2, which means that the maximum length of the prefix and suffix of equality is 2, which means that the character formed by match[0,1] is indeed equal to the character ab formed by match[3,4]. Next [6] values match[3] and match[5]
  3. If match[3] == match[5], next[6] = next[5] + 1
  4. Because the match [3]. = match[5], so it needs to roll back the matching position, so it needs to obtain the maximum length of the equal prefix and suffix at index position 3, namely the value of next[3], and then judge the relationship between match[next[3]] and match[5], next[3] is 0
  5. Because the match [0]! = match[5], the matching value should be further rolled back, but since it has already been rolled back to 0, it cannot be further rolled back, so next[6] is 0, that is, in the case of index position 6, the longest prefix and suffix in the substring formed by match[0,5] match length is 0

In fact, the generation of the next array is somewhat similar to the rollback process of the match string in KMP algorithm. If you understand the above process of KMP algorithm clearly, the generation of the next array will be very easy to understand. Similarly, if you need to deduce why the next array is backed up like this, Refer to the principle analysis of fast forward in 4.2, which is similar to this one, especially the corollary to question 2

Five, the summary

KMP algorithm: The core of finding whether a substring exists in a string is the next array obtained by the next function and the derivation of the backoff principle. The latter is the core of our understanding of KMP algorithm. Through KMP algorithm, the time complexity of solving this problem has changed from O(N*M) to O(N) level. The solution of the KMP algorithm can spawn many similar problems by turning the original problem into a problem of finding whether a substring exists in a string, such as determining whether a string STR is a loopback string of another string str2

The main purpose of this paper is to analyze the process of KMP algorithm in the form of pictures and texts, and carry out principle analysis, without providing the code of KMP algorithm, we are interested in realizing such code by ourselves, and then compare the code on the Internet