Writing in the front

A basic operation on strings is substring lookup: given a text string of length N on one end and a pattern string of length M(M<N), find substrings in the text string that are identical to the pattern string. In the Age of the Internet, string look-ups are required in many situations, such as looking up a word in a text editor or browser, intercepting pattern text of interest in communication content, and so on.

The simplest implementation of substring lookup is definitely brute force lookup:

public static int search(String text, String pattern) { int N = text.length(); int M = pattern.length(); for (int i = 0; i < N-M; i++) { int j; for (j = 0; j < M; j++) { if (text.charAt(i+j) ! = pattern.charAt(j)) break; } if (j == M) return i; } return -1; }Copy the code

It can be seen that the worst time complexity of violent search is O(N*M). In practical applications, the text string is often very long (tens of billions of characters), while the pattern string is very short, so the time complexity of violent algorithm is unacceptable.

In order to improve the search time, many string search algorithms have been invented, and today’s protagonist KMP (invented by D.E.Knuth, J.H.Morris and V.R.Pratt, abbreviated as KMP algorithm) is one of them.


The body of the

Brute force lookup is slow because it starts from scratch each time and abandons the results already calculated, which is where KMP starts: After the matching failure, the known content information of a part of the text string can be used to avoid the pointer backing back to all these known characters, that is, to reduce the matching times between the pattern string and the text string to achieve the purpose of fast matching.

For example, look for the pattern string “ababCAB” in the text string “ababaababcabcd” (in Figure 1, the red characters represent the characters being compared) :


Figure 1 string search track of KMP algorithm

In Figure 1, when I =4 and j=4 are matched, the “A” of text is not equal to the “C” of pattern. In the violent algorithm, j will be set to zero to start the comparison again. Instead of setting j to zero, THE KMP algorithm first tries to find the same non-overlapping maximum prefix substring and maximum suffix substring in the previously matched string. For example, “abab”, the same non-overlapping maximum prefix substring and maximum suffix substring is “ab”, remember, “non-overlapping”, the maximum prefix substring and maximum suffix substring for the string “ababa” is “A”, not “aba”.

Since the “abab” of the pattern string already matches the text string, the text string also has the “abab”, so we obviously do not need to do the ab initio calculation, and align the same maximum prefix substring with the maximum suffix substring. See Figure 2.


Figure 2 Pointer rollback of KMP algorithm

The KMP algorithm speeds up the matching efficiency by avoiding the pointer rollback. The number of digits j needs to roll back for each failed match depends on the pattern string, so we definitely don’t count the number of digits j needs to roll back for each failed match (possibly double counting), Before matching, we can calculate the same maximum prefix substring and maximum suffix substring in each substring from 0 to M (m is in the interval [0, m], where M is the length of the pattern string) in the pattern string, so as to directly use them in the matching process and reduce the calculation times.

For example, the relationship between the length k and index j of the same non-overlapping maximum prefix substring and maximum suffix substring in each substring from 0 to m of the string “ababCAB” :

a ab aba abab ababc ababca ababcab
k 0 0 1 2 0 1 2
j – 1 – 1 0 1 – 1 0 1

From the table above, j=k-1, because in the programming language, the index of the string starts at 0, so minus 1, and j=-1, which means that there is no same non-overlapping maximum prefix substring and maximum suffix substring. We’re going to use the value of j in our programming. Ok, we now programmatically find the same non-overlapping maximum prefix substring and maximum suffix substring in each substring from 0 to m in the pattern string, and we store the values in a next array. I came up with an algorithm:

/** * next array of pattern strings * @param pattern string * @return next array, The value of the next array corresponds to the same non-overlapping maximum prefix and maximum suffix substrings */ private static int[] calNext(final String Pattern) {int m = pattern.length(); int[] next = new int[M]; next[0] = -1; for (int i = 1; i < M; i++) { next[i] = calSameSubStr(pattern.substring(0, i+1)); } return next; } private static int calSameSubStr(final String subStr) { int M = subStr.length(); if (M < 1) return -1; int mid = M / 2; int l = 0; Int h = mid; If (M % 2 == 1) h = mid + 1; While (l < mid && h < M) {if (substr.charat (l) == substr.charat (h)) l++; else l = 0; h++; } return l-1; }Copy the code

Since pattern strings are generally short, my algorithm should still work in general scenarios, but god does not like to settle for… KMP used the next array in the calculation of the next array, very cleverly:

/** * next array of pattern strings * @param pattern string * @return next array, The value of the next array corresponds to the same non-overlapping maximum prefix and maximum suffix substrings */ private static int[] calNext(final String Pattern) {int m = pattern.length(); int[] next = new int[M]; next[0] = -1; Int k = -1; // The first substring has only one character. For (int I = 1; for (int I = 1; i < M; While (k > -1 && pattern. CharAt (k+1)! = pattern.charAt(i)) { k = next[k]; } if (pattern.charat (k+1) == pattern.charat (I)) {k++; } next[i] = k; } return next; }Copy the code

One of the more difficult parts of the algorithm that the gods designed is the innermost loop, the while, which I’ll explain here. K >-1 indicates that the same non-overlapping maximum prefix substring and maximum suffix substring exist in the substring pattern[0, k]. If the low character (k+1) does not match the high character (I) at this time, k needs to be backtracked to where? The lowest order zero (I wrote the algorithm to go back to zero)? No, we have clearly used the same idea we discussed above, which is to back the pointer to the same non-overlapping maximum prefix and maximum postfix substrings in the pattern string [0, k], which we calculated earlier, next[k]. This paragraph may be a little difficult to understand, can not understand the students suggest using a short string to follow the program to go through the second to understand.

Here to find the next array has used the above explained KMP core idea, as you can see, in fact, the algorithm matching with the next array algorithm core idea is exactly the same! It’s just that the substring of the pattern string acts as the pattern string, and the entire pattern string acts as the text string.

We get the next array, so we don’t need to roll back the pointer every time a mismatch occurs, reducing duplicate comparisons and improving the efficiency of the lookup:

public static int search(final String text, final String pattern) { int[] next = calNext(pattern); int k = -1; int N = text.length(); int M = pattern.length(); for (int i = 0; i < N; i++) { while(k > -1 && pattern.charAt(k+1) ! = text.charat (I)) {k = next[k]; // text:... abac... // pattern: abad... // if I points to c and k points to the second a, then k+1 points to d and does not equal C, then we need to backtrack. // If I points to C and k points to the second A, then k+1 points to d and does not equal C, then we need to backtrack. // Because pattern[0, k] == text[i-1-k, i-1] Next [k]} if (pattern. CharAt (k+1) == text.charat (I)) k++; if (k == M - 1) return i - M + 1; } return -1; // No matching character is found}Copy the code

The time complexity of KMP algorithm is O(N+M), which is much better than O(NM) of violent algorithm. However, the efficiency of KMP algorithm depends on the character repetition of pattern string. If there is no character repetition, THE EFFICIENCY of KMP algorithm is even lower than that of violent algorithm.

At this point, KMP string search algorithm has been analyzed, in fact, is a relatively simple algorithm, learning up soon can understand ~


Write in the back

KMP algorithm requires the pattern string to have a high degree of repetition of the character to play a powerful power, but in the actual application of this scenario is not much, it seems to learn nothing? In fact, AFTER LEARNING some algorithms, I found that algorithms always have advantages and disadvantages, sacrifice time for space, or sacrifice space for time, always can not be perfect, we need is accumulated, in the practical application of similar scenarios, can think of such a way to solve the problem. Secondly, the most important learning algorithm is to learn ideas rather than implementation, such as fast row, you understand its core ideas on the line, there is no need to memorize, the actual work does not need you to write again, because the system has been integrated. And the idea of divide and conquer is very useful, this is often encountered in the work, when you encounter this situation, you can recall the idea of divide and conquer; “Oh! It seems that this can be solved with a quick divide-and-rule approach!” , and then go to see the specific implementation of fast row, see can learn from, I think, learning fast row this algorithm is not in vain ~

And KMP algorithm for next array scheme and pointer back way are very learning value, worth savoring, should be in the bag, looking forward to the next “oh! I have a solution for this situation!” Epiphany.

reference

  1. Algorithms: fourth edition
  2. KMP algorithm is the most simple to understand – a look to understand