Today I’m going to explain KMP to you.
KMP has been talking about it for a long time. The reason I haven’t written it is because I don’t think I can write it well. Better to miss the boat than to mislead it. After all, it is one thing to understand and another to speak clearly.
So in order to write this article, I went to refer to a lot of other sources. B: well.. I find that there are so many articles on KMP online, but most of them are still confusing after reading them (even though I know how to read them).
I hope my article can achieve the purpose is: let the small white also can learn KMP. If the purpose is achieved, please help me to forward. Otherwise, you just cross it out.
Without further ado, let’s get right to it.
01. Graphic analysis
The KMP algorithm is often referred to as the “porn watching algorithm” and is proposed by a K, M, and P. Is a string matching algorithm improved by violence matching.
I looked at the KMP explanation on the Internet and almost all of it started with the next matching table. But to be honest, if I was looking at this for the first time and you told me about the Next match table, I would have been stunned. So I’m going to put it another way.
As mentioned above, KMP is an improved string matching algorithm from violent matching. So what is a violent match? Suppose our target string and pattern string are shown below. As mentioned earlier in Sunday matching, all string matching algorithms begin with alignment. KMP, Sunday, BM is the same regardless of violent match)
Violence matching is the comparison of target and pattern strings one by one.
When A matches, we continue matching until we encounter an unmatched character.
We then adjust the pattern string to match from the next character of the target string (note that this is the core). Unfortunately, there is still no match (A and B)
Continue this step:
Until we complete the matching process:
Suppose we have a target string length of M and a pattern string length of N. The mode string is compared with the target string at least m times, and its length is N, so the theoretical time complexity is **O(m*n). ** But as we can see, we can stop because we encounter unmatched characters along the way, without needing a full comparison (as in line 2 above). So although the theoretical time complexity is O(m*n), it’s actually much more efficient in most cases.
Violent matching is also known as BF algorithm, storm algorithm. The code is relatively simple:
//GO
func BFSearch(haystack string, needle string) int {
l1 := len(haystack)
l2 := len(needle)
i, j := 0.0
for i < l1 && j < l2 {
if haystack[i] == needle[j] {
i++
j++
} else {
i -= j - 1
j = 0}}if j == l2 {
return i - j
}
return - 1
}
Copy the code
So let’s talk about KMP. Let’s say I have the same string up here. We start out the same way. We compare a-A,B-B, c-C, until we hit the first character a-E that doesn’t match.
Now it’s different, if you match the violence above. At this point, the target string should go back to B, and the mode string should go straight back to the head. But according to KMP, after our first match, BC is matched successfully, so we know that BC is not equal to A (note this logical relation) :
So now that we know BC is not equal to A, we don’t have to match A with BC. So let’s just use A to cross the previous BC that doesn’t need to match:
If we go down, we find that at D minus C, we don’t match.
Then we know that B is not equal to A, so we can skip the previous B:
In other words, we can compare directly from D:
Continue the downward comparison:
By now, you have mastered the first fifty percent of KMP: in KMP, if the pattern string and the target string do not match successfully, the target string does not backtrack. Now we need a new string to master the next fifty percent:
We continue to match from the beginning until we encounter the first mismatched character:
This is the same as the example above, because our BC matched successfully, we know that BC is not equal to A, so we can skip BC (note this logic) :
So we start with A:
Until we fail to match again:
I think now that you know how to do it, come and say it with me. ** Since the previous B match was successful, we know that B does not equal A, so we can skip B. ** Of course, the next match fails directly after skipping (a-d).
Here we go!! And then we go on to match the next one. We find that this time, our match is almost successful, but we are stuck in the last step of the comparison (D-B).
Now what? If we change the two strings (change AB to XY), then obviously you know where to start:
But now the problem is that AB appears repeatedly in the pattern string, so can we directly remove AB in the next comparison?
So we take the AB out, and when we take the AB out, we’re essentially skipping two more characters on the pattern string. (That is, the next pattern string match starts at C)
In fact, at this point KMP is basically finished. We can summarize a little bit:
-
If the pattern string matches the target string, the long and short strings are incremented by one
-
If the pattern string and target string did not match successfully:
-
- The target string does not backtrack (I told you this before the splitter line above)
- The pattern string goes back to the maximum length ** of the same true prefix and true suffix of the substring before the character that failed to match (I’m focusing on this for you after the dividing line above) **
Okay, I know the second case above is a bit of a mouthful. So I brought it out to you again. What does that mean?
Suppose we have a string abbaab:
- A, AB, abb, abba, abbaa, that’s the real prefix.
- B, ab, aab, baab, bbaab, that’s the real suffix.
- The word “truth”, in plain English, does not contain itself.
In our example above, the substring before the unsuccessful character is ABCEAB, which is the same as the longest true prefix and true suffix AB, with a maximum length of 2. So let’s go back to the second position in the pattern string.
I guess someone was going to say, “Didn’t the pattern string go back to the maximum length of the true prefix and the true suffix? So why is the first example above, back to the starting position?”
In fact, it’s not that we don’t have a backtrace pattern string, but that the maximum length at this point is actually 0.
So how do we get the maximum length? It’s a natural segue into the next table. It doesn’t matter if you think of the next table as something that describes the maximum length, or you think of the next table as something that backtracks pattern strings!! This is why you will see many articles on the Internet that have inconsistent understandings of the next table.
ABCEAB does not contain itself. That is ABCEA. The true prefix and true suffix of ABCEA are:
- A,AB,ABC,ABCE
- A,EA,CEA,BCEA
So the maximum length is 1. So what does this one do? We could just pass over this A next time, and we could just start with B.
Notice here that if we modify the pattern string a little bit like this, then the maximum length of F is 0, not 2. Beginners can easily mistake AB for the longest and the same true prefix and suffix.
So far, in fact, THE idea of KMP is almost finished. But the big god spoke, and the big god thought that the match table had to be changed again, otherwise there would be a problem.
Why is this wrong? We said that for KMP, the target string does not backtrack if there is no match. So if the target string doesn’t go back, if the pattern string is always zero, does that mean that the algorithm can’t continue? So god changed the next table to -1 at position 0.
So what does this minus one do? It’s just a code trick! If next[j] is equal to 0, then we are in an infinite loop. In any case, the first character of the pattern string matches (for j, -1++ is zero). But now the pattern string moves forward. Won’t you get stuck in a loop? Imagine an infinite loop stuck on line 11 without j==-1.)
//GO
func KmpSearch(haystack string, needle string, next []int) int {
l1 := len(haystack)
l2 := len(needle)
i, j := 0.0
for i < l1 && j < l2 {
if j == - 1 || haystack[i] == needle[j] {
i++
j++
} else {
j = next[j]
}
}
if j == l2 {
return i - j
}
return - 1
}
Copy the code
So far, in fact, KMP is not too bad, the code is relatively simple. But the trouble is, right? We generally don’t have a ready-made next table to use directly. So how do you generate the next table?
In fact, the generation of the next table can also be regarded as a string matching process: that is, the process of matching the original pattern string with its own prefix.
Let’s use the following string: XXYXXYXXX.
For this string:
- The true prefix is X,XX,XXY,XXYX,XXYXX…..
- True suffixes are X,XX,XXX,YXXX,XYXXX…..
To make it easier for you to understand, I have drawn two kinds of pictures (the left is the real process of filling in the form, and the right is the imaginary process) :
- Index [0] must be 0
- Then fill in index[1]. If they match, we add one to both I and j.
- Then fill in index[2], and if there is no match, trace j back to the index of the previous position j currently pointed to. In this case, it’s 0.
- Note that the table was filled after backtracking was completed, and index[2] was 0 at this time
- And then we move I, and we find that the next digit is a match. Add one to I and j at the same time and fill in the form.
- After filling out the form, we found that the next one still matched. I keep moving I and J.
- If the match is still successful, repeat.
- Notice that at this point the match starts to fail. So, if there’s no match, we’re going to go back to the index of the previous position that j was pointing to. In this case, it’s 2.
- Continue the process of backtracking… (This step is the core of the entire next table construction.)
- Attention! I’m just going to fill in the index value of the matching position that I went back to plus 1.
Careful readers might find a problem here. Let’s take out the completed form:
We find that this table is different from our top table, where the first value of the next table is -1, and the next value at which index position is recorded is to look at the maximum length of the true prefix and the true suffix of all the substrings that precede the element. That’s a bit of a mouthful, but here we go.
For example, when index is 5, the value of next looks at the maximum length of ABCEA (true suffix A, true prefix A, so 1). But in our table below, we find that we are the maximum length of the record at its current index position. And I’m actually going to say, this table down here, we call it a partial match table, or PMT.
What does this table have to do with our next table? We found that if we string this table back one bit, we get our final Next table.
But but but but!! Not all KMP tutorials will give you the idea of a partially-matched table, and some will simply refer to the PMT as equivalent to the next table. ** this belongs to the wrong explanation? It’s not! ** Because AS I said above, it’s ok for the next table to start with -1, or even for the PMT to start with -1 as the next table. Because what matters most is how you use it! After all, the next table is defined by people!
For example, if you didn’t add -1 at the beginning of the next table, we could actually remove -1 from the previous KMP algorithm. A separate if judgment is added to solve the problem of the infinite loop mentioned above.
02,
That’s the end of KMP! The KMP series is going to be divided into two parts. The first part is to explain what KMP is, what the next table is, and what PMT is. The second lecture will tell you about the next table calculation, and next table optimization.
In short, I personally think that the quality of this content is ok in the whole network of KMP articles. If it is small white, learning this article is actually enough.
That’s the end of today’s topic, have you learned? Leave your thoughts in the comments section!
I’ve compiled all the problems I’ve written into an e-book, complete with illustrations for each problem. Click to download