This article has been published on my wechat public account “Mynongxiaoalfei”. Open the wechat and search for “Mynongxiaoalfei”, or scan the QR code at the end of the article, enter the official account and follow, and you will receive my tweet in the first time!
preface
Strings may not be one of the basic types in any programming language, but they are certainly one of the most commonly used data types, and manipulation of strings is the most common behavior in programming.
Of all the operations on strings, string lookup matches seem to be the most common in everyday programming, whether the back-end program matches the search keyword submitted by the user and returns search candidates. Or does the front-end program highlight the matching string based on the keyword entered by the user.
String matching is the matching of substrings with the same characters at each position as the pattern string, which is the string that needs to be checked for occurrence in the main string. The efficient string matching algorithm can give the processor a smaller burden in the process of running, reduce the user’s waiting time in the process of algorithm matching, and give the user a flowing experience. Therefore, it is very important to choose a suitable string matching algorithm in the programming process
Naive pattern matching algorithm
There are already a number of utility classes that can help you do this by simply calling a string. If you had to write a simple string matching algorithm by hand, what would you do?
When it comes to string matching, I think the algorithm that naturally comes to mind first is the naive pattern matching algorithm. The so-called naive pattern matching algorithm is to align the first character of the main string with the first character of the pattern string at the beginning. Starting from the first character of the pattern string, the characters of the main string and the pattern string are compared from left to right. If the characters are the same, compare the next character of both sides. If it appears a certain position, the main characters and patterns of chain corresponding position of the character is not the same, is the pattern string of the location of the overall move to the right one character, to continue the operation, until in the main string matching to substring with exactly the same pattern string, or the main series of surplus has not yet been comparing the number of characters less than model number strings of characters.
/***** * Java code implements naive pattern matching **@paramStringS main stringS star@paramStringT mode stringT */
public boolean match(String stringS, String stringT) {
char[] charsS = stringS.toCharArray();
char[] charsT = stringT.toCharArray();
for (int i = 0, sizeI = charsS.length - charsT.length; i <= sizeI; i ++) {
for (int j = 0, sizeJ = charsT.length; j < sizeJ; j ++) {
if(charsS[i + j] ! = charsT[j]) {break;
}
if (sizeJ - 1 == j) {
return true; }}}return false;
}
Copy the code
The naive pattern matching algorithm is relatively simple and easy to understand and accept. The following is the running process of the naive pattern matching algorithm:
With that said, is there an algorithm that can avoid the inefficient problem of character matching by backtracking through the main string pointer? No doubt there is! That’s the KMP pattern matching algorithm we’re going to talk about today.
KMP pattern matching algorithm
The main string does not match the sixth character of the pattern string until the sixth character is matched. Indicating that the first five characters of the main string are the same as the first five characters of the pattern string is an important piece of information. As can be seen from the pattern string, the first three characters of the pattern string are different. There is no need to move the pointer back to the main string like naive pattern matching, comparing the second and third characters of the main string with the first character of the pattern string, because that must be different.
In this case, move the pattern string to the right, skipping the comparison between the first character of the pattern string and the second and third characters of the main string, so that the first character of the pattern string is aligned with the fourth character of the main string, and the second character of the pattern string is aligned with the fifth character of the main string.
Now, you might be wondering, well, of course we can read pattern strings intuitively by graphing them, but what about computers? For the computer, the only way to solve the problem is through loops and judgments. How do you use code to make the computer read the information of an unknown pattern string?
KMP algorithm -next[] array derivation
KMP algorithm is based on the mode string of characters before and after the location information to generate a long plastic, such as pattern string array next [], used to store when the mode string of characters in each position on the if and the string of the corresponding characters is not the same time, according to the next [] of the integer as a guide, do an optimal sliding, what is the optimal sliding? That is, the pattern string slides as far to the right as possible, so that as much of the left is still aligned as possible, you don’t miss possible substrings, and you make as few comparisons as possible.
Note that the subscript is used here, because Java language and many program languages, array subscript from 0, this and the above mentioned several characters should pay attention to distinguish, many people are confused because of the relationship between the two, and lead to the subsequent understanding of the next[] array of derivation process caused no small impact.
All characters to the left of m before the mode string is moved are identical to the corresponding characters in the main string. The move is to avoid unnecessary comparisons in the backtracking of the main string pointer, so all valid next[m]=n should satisfy the property: If next[m]=n is true, then all characters to the left of n up to the end of the string should be identical to the characters to the left of m, unless there are no characters before n. There could be more than one n that satisfies this condition, but this n is definitely the largest possible one. This is what people often say that the longest public of longest prefix and suffix, is the next [] array is the core idea, and the subsequent derivation next [] array of theoretical basis, it must be clear, because if you do not meet the nature, all of the slide in vain, there is no point, will only cause misjudgment.
However, if the characters at m and n are different, we should not make next[m+1]=n+1, because this would not satisfy the above property. How to do? That we are like a model character string n place compared to failure, for n = next [n] an optimal sliding, m and n can still meet the above relations, the same as above, we continue to compare m and n in the characters are the same, not the same as moving, until the m and n in the character of the same, there is no character or n. We can say: given all next[m], where m
/** * Java code to implement KMP pattern matching algorithm next[] derivation process **@paramStringT Mode string * */
private int[] getNext(String stringT) {
char[] chars = stringT.toCharArray();
int[] next = new int[chars.length];
next[0] = -1; // This is an inevitable result, no matter what pattern string, from which to break the point of derivation
for (int i = 1; i < chars.length; i ++) {
int j = next[i - 1];
while (true) {
if (-1 == j || chars[i - 1] == chars[j]) {
next[i] = j + 1;
break;
} else{ j = next[j]; }}}return next;
}
Copy the code
/***** * Java code to implement KMP pattern matching **@paramStringS main stringS star@paramStringT mode stringT */
public boolean match(String stringS, String stringT) {
char[] charsS = stringS.toCharArray();
char[] charsT = stringT.toCharArray();
int[] next = getNext(stringT);
int j = 0;
for (int i = 0, sizeI = charsS.length; i < sizeI; i ++) {
Next [j]=-1; next[j]=-1
if (0 > j || charsT[j] == charsS[i]) {
j ++;
} else {
j = next[j];
i --; // The implementation continues to compare characters at the same position of the main string
}
if (charsT.length == j) {
return true; }}return false;
}
Copy the code
Let’s see what happens.
KMP algorithm -next[] array derivation optimization
Is the pointer to the main string not moving back? And a lot less unnecessary comparison, right? But I want to tell you that it can be optimized again. What if the pattern string above is changed to “ABCABC”?
Let’s optimize the code for getNext() :
/** * Java code implements KMP pattern matching algorithm next[] inference process **@paramStringT Mode string * */
private int[] getNextPlus(String stringT) {
char[] chars = stringT.toCharArray();
int[] next = new int[chars.length];
next[0] = -1; // This is an inevitable result, no matter what pattern string, from which to break the point of derivation
for (int i = 1; i < chars.length; i ++) {
int j = next[i - 1];
while (true) {
if (-1 == j || chars[i - 1] == chars[j]) {
next[i] = j + 1;
break;
} else{ j = next[j]; }}}// Optimized version adds code
for (int i = 0; i < next.length; i ++) {
if (0<= next[i] && chars[next[i]] == chars[i]) { next[i] = next[next[i]]; }}return next;
}
Copy the code
/***** * Java code to implement KMP pattern matching **@paramStringS main stringS star@paramStringT mode stringT */
public boolean match(String stringS, String stringT) {
char[] charsS = stringS.toCharArray();
char[] charsT = stringT.toCharArray();
int[] next = getNextPlus(stringT);
int j = 0;
for (int i = 0, sizeI = charsS.length; i < sizeI; i ++) {
Next [j]=-1; next[j]=-1
if (0 > j || charsT[j] == charsS[i]) {
j ++;
} else {
j = next[j];
i --; // The implementation continues to compare characters at the same position of the main string
}
if (charsT.length == j) {
return true; }}return false;
}
Copy the code
Now, look at the effect again:
conclusion
In the world of programming, there is often more than one way to solve a problem, and none of them is superior or inferior, just different circumstances at the time. The above example shows that the KMP algorithm is much more efficient than the naive pattern matching algorithm. What if the main string is “abcdeabcde” and the pattern string is “abcDE”? The naive pattern matching algorithm does not need pointer backtracking to match successfully, but the KMP pattern matching algorithm has a next[] array derivation process. The advantage of the KMP algorithm is that the next[] array is derived once, matching everywhere.
Here KMP pattern matching algorithm is also finished! Thanks to the founders of the algorithm, D.E.Knuth, J.H.Morris and V.R.Pratt, KMP is based on their initials.
If you find this article helpful to your understanding of the KMP pattern matching algorithm and the derivation of the next[] array in it, please feel free to do so. Of course, if you have any questions about this article, or if there is something wrong or improper in this article, please leave a comment in the comment section and discuss it together.
Follow me to show you some interesting discoveries in the world of programming.