Data structure and algorithm of KMP algorithm
preface
KMP algorithm is a linear time string search algorithm that the blogger contacted in college, but at that time, there was no special understanding of KMP algorithm. Recently, the blogger accidentally saw KMP algorithm, so it was brought, hey, hey, hey… to share with you.
define
The KMP algorithm is a string lookup algorithm used to find the word W in the string S. The following is the Wikipedia definition of KMP.
In computer science, the KNUth-Morris-Pratt string lookup algorithm (KMP algorithm for short) can find the occurrence of a word W in a string S. The algorithm takes advantage of the fact that a mismatched word itself contains enough information to determine where the next match is likely to start, to avoid reexamining previously matched characters.
According to the KMP algorithm, we can simply abstract as the following topic: Given a string S and a word W to look for, find the position of W in S.
Algorithm is the premise
- List all substrings of the word W (except for a single element)
- List the prefix and suffix corresponding to each substring of the word W (prefix: substring of the remaining characters after removing the last character; Suffixes: remove the first character, the substring of the remaining characters, prefix suffixes remove a single element)
- Figure out the common element length of each substring prefix and suffix based on the prefix and suffix
Algorithm ideas
In the matching process, we need to use two Pointers I and j to point to the ith character currently matched to S and the JTH character of W (here, note that j may be set to -1 in order to satisfy the following). There are two outcomes of this match:
- J = -1 or S[I] = W[j]: I ++, j++
- j ! = -1 or S[I]! = W[j] : means the string is in the JTH position of W
mismatch
J = next[J].- 2.1 Next [j] is not -1, so that j moves to the position of the common element length of the prefix and suffix of j-1 characters before W (mismatch value), and I remains unchanged.
- 2.2 next[j] = -1, i++, j = 0
mismatch
In the above algorithm idea, we need to introduce the concept of the partial matching table, also known as the mismatch function, which is the algorithm premise step done in the above, to find the common element length of each substring of the word W. I just want to know if I can find the same two parts of the string from the beginning to the end, and the length of the two parts. For example, a substring of W “abbacdab” has the same field “ab “from the beginning to the end of the substring. In this way, if W and S match at the end of the substring W, and there is a mismatch, W can skip the first two strings of W and then match S. Because S must be “abacdabXXX”, an example will be given in detail below.
The partial match table, also known as the mismatch function, allows the algorithm to match any character in S more than once. The key to being able to implement a linear time search is to examine the initial field of the pattern string in some fields of the main string, and we can know exactly where a potential match is before the current position. In other words, without missing any potential matches, we “pre-search” the pattern string itself and translate it into a list of all possible mismatches corresponding to the most invalid characters that can be bypassed.
KMP algorithm flow
The following is an example given by Wikipedia to illustrate the entire process of KMP algorithm.
Problem description
Given the string S=”ABC ABCDAB ABCDABCDABDE”, W=”ABCDABD”, design an algorithm for linear time matching to position W.
Problem analysis
- List all substrings of W.
W |
---|
A |
AB |
ABC |
ABCD |
ABCDA |
ABCDAB |
ABCDABD |
- List the prefixes and suffixes for each substring
Take the “ABCDAB” substring of W as an example:
“ABCDAB” prefix | “ABCDAB” suffix | The common prefixes and suffixes of the substring |
---|---|---|
A | B | 0 |
AB | AB | 2 |
ABC | DAB | 2 |
ABCD | CDAB | 2 |
ABCDA | BCDAB | 2 |
When the prefix and suffix of a certain length are different, the calculation of the remaining common length of the prefix and suffix can be stopped.
3. For W in this case, calculate the following mismatch array
Based on the calculation of the prefix and suffix of the mismatched array, move one bit to the right, and move the 0 position to -1.
4. Start matching
When I = j = 3 is matched, W is ‘D’ and S is’ ‘, the second result is executed, and the mismatch value of the substring “ABCD” of W is -1, then according to the algorithm idea 2.2, we can know that at this time, we need to match from the position 0 of W and I +1 of S.
When I = 10, j = 6, W is ‘D’ and S is’ ‘, the second result is executed, and the mismatch value of the substring “ABCDABD” of W is 2, then according to the algorithm idea 2.1, we can know that at this time, we need to match from the position 2 of W and I +1 of S. To paraphrase Wikipedia:Notice that "AB" appears at the beginning and end of "ABCDAB". That means the "AB" at the end can be used as the starting point for the next comparison.
Finally, if I ranges from 15 to 21 and j ranges from 0 to 6 are successfully matched, i-j is returned as the first letter of the string successfully matched with the word.
Algorithm code
public class KMP {
public static HashMap<Integer, Integer> next = new HashMap<>();
public static int KMP_caluate(String S, String W) {
int i = 0,j = 0;
// The first two are fixed
next.put(0, -1);
next.put(1.0);
// Calculate the mismatch array
matchTable(W);
System.out.println(next.toString());
// Start matching
while (i < S.length() && j < W.length()) {
char s = S.charAt(i);
char w = W.charAt(j);
if (s == w) {// Match successful, next bit
i++;
j++;
} else {
j = next.get(j);// Fail to get the mismatch position
if (j == -1) {// If the value is -1, it indicates that there is no matching failure and no common prefixes and suffixes. The next bit of S is compared with W again
i++;
j = 0; }}}return i - j;
}
public static void matchTable(String W) {
for (int i = 2; i < W.length(); i++) {
int len = 0;
next.put(i, len);
String subW = W.substring(0,i);
for(int j = 1; j < i; j++) { String prefix = subW.substring(0,j);
String suffix = subW.substring(i - j, i);
if(prefix.equals(suffix)) { len = j; next.put(i,len); }}}}public static void main(String[] args) {
String S = "acabaabaabcaccaabc", W ="abaabcac";
intindex = KMP_caluate(S,W); System.out.println(index); }}Copy the code
conclusion
The preliminary explanation of KMP algorithm ends here. It is relatively easy to solve this problem to clear the three matching ideas of string S and word W. The camera mainly takes you to comb through the process of THE KMP algorithm, as well as its internal operation process understanding.