Written in the book of the former
This time, I will share the basic idea of KMP by implementing the animation effect of KMP algorithm.
Welcome to pay attention to my blog, irregularly updated —
Concept of pre –
String matching
String matching is one of the oldest and most widely studied problems in computer science. A string is a sequence of characters defined over a finite alphabet. For example, ATCTAGAGA is A string on the alphabet ∑ = {A,C,G,T}. The string matching problem is to search for all occurrences of a certain string P in a large string T.
KMP algorithm
KMP algorithm is an improved string matching algorithm, discovered by D.E.K Nuth, J.H.Morris and V.R.Pratt at the same time, so it is called knut – Morris – Pratt operation (abbreviated KMP algorithm). The key of KMP algorithm is to use the information after the failure of matching to reduce the matching times of pattern string and main string to achieve the purpose of fast matching. The implementation is to implement a next() function that itself contains local matching information for the pattern string. Time complexity O(m+n).
In JS string matching we usually use the native API indexOf; The c++ implementation itself is beyond the scope of this discussion. This time mainly through the way of animation demonstration to show the simple algorithm and KMP algorithm comparison process similarities and differences, so as to try to understand the basic idea of KMP.
PS: In the following narration, BBC ABCDAB ABCDABCDABDE is the main string; ABCDABD is a mode string
Results the preview
At the top is the naive algorithm, that is, bitwise comparison, and at the bottom is the string comparison method implemented by KMP algorithm. KMP can match with fewer comparisons.
The basic idea
It can be seen from the effect preview in the figure above that using the naive algorithm to compare pattern strings in sequence requires 13 shifts, while using KMP requires 8 shifts, so it can be said that the idea of KMP is to quickly move to the specified location by avoiding invalid shifts. Let’s take a look at how KMP “hops” around:
Consistent with the naive algorithm, the first character of the pattern string ABCBABD in the previous matching of the main string “BBC” was different and shifted backwards to the position shown in the figure above. By comparing the main string with the characters in the pattern string, we can see that the first six characters of the pattern string are the same as the main string, namely ABCDAB; This is the key to the KMP algorithm.
Calculate the next shift position based on the known information
Let’s first look at the process of the naive algorithm and the next shift in KMP from the following figure:
The naive algorithm has moved back a notch from the rain. KMP skips the BCD character of the main string. Thus a meaningless shift comparison is performed. So how does it know that I’m going to jump three this time and not two or not? The key is the portion of ABCDAB that was matched last time
Mining information from matched portions
We know that both the main string and the mode string have the same part ABCDAB. So how do you get useful information from this common section? Or another way to think about it: What is the basis on which we can skip some places?
The situation when the first match fails is as follows:
BBC ABCDAB ABCDABCDABDE ABCDABD D ! = space failedCopy the code
To extract information from matched portions. Now transform the main string:
ABCDABXXXXXX... X could be any characterCopy the code
We now only know the part that has been matched, because the match has failed and we will not read the following characters, so we use X instead.
So how many places can we skip to get the answer:
//ABCDAB moves back how many bits might match? ABCDABXXXXXX... ABCDABDCopy the code
The answer, of course, is the following:
ABCDABXXXXXX...
ABCDABD
Copy the code
Because we don’t know what X stands for, we can only analyze it from matched strings.
So what is the basis that we can skip some places?
A: Can the first n bits of the matched pattern string equal the last n bits of the matched part of the main string? And n is as big as possible.
Here’s an example:
XXXABCDDDABCFXXX ABCDDDABCE = ABCDDDABCE = ABCDDDABCE = ABCDDDABCE = ABCDDDABCE = ABCDDDABCE Therefore, we can skip the DDD match because it will not match XXXABCDDDABCFXXX ABCDDDABCECopy the code
Now KMP basic train of thought is clear, it is through after failure that has matching fields, to find the main strung together, string together the head tail and pattern of the same maximum matching, if you can cross in the middle of the part, because of the so-called “middle” part, it is possible to enter the main strung together, string together head, tail and patterns not in reason is the relative position of different characters, So eventually the pattern string can be skipped when it is shifted.
Partial matching value
The above is to describe in colloquial terms how we can determine the position of the next pattern string shift based on the matched parts. You should already have a general idea of KMP’s thinking. Now for the official line.
We can use partial matching to describe the result of finding the same part of the main string header as the pattern string tail in the matched part:
- Where “prefix” and “suffix” are defined. The prefix “refers to all header combinations of a string except the last character;” Suffixes refer to all combinations of the ends of a string except the first character.
- The “partial match value” is the length of the longest common element of the prefix and suffix.
For example ABCDAB
- The prefixes are A, AB, ABC, ABCD, and ABCDA
- The suffixes are B, AB, DAB, CDAB and BCDAB
It’s easy to see that the partial match is 2, the length of AB. Therefore, based on the previous idea, we can know that the mode string can be directly shifted to the corresponding place of the main string AB, and the middle part must be mismatched. How many places do I move?
A: Matching string length – partial matching value; In this example, 6-2=4, and the pattern string moves four bits to the right
Code implementation
Calculates the partial match table
function pmtArr(target) { var pmtArr = [] target = target.split('') for(var j = 0; j < target.length; Var PMT = target var pmtNum = 0 for (var k = 0; k < j; K ++) {var head = PMT. Slice (0, k+ 1) var foot = PMT. Slice (j-k, k++) If (head.join('') === food.join ('')) {var num = head.length if (num > pmtNum) pmtNum = num}} pmtArr.push(j + 1 - pmtNum) } return pmtArr }Copy the code
KMP algorithm
function mapKMPStr(base, target) { var isMatch = [] var pmt = pmtArr(target) console.time('kmp') var times = 0 for(var i = 0; i < base.length; i++) { times++ var tempIndex = 0 for(var j = 0; j < target.length; j++) { if(i + target.length <= base.length) { if (target.charAt(j) === base.charAt(i + j)) { isMatch.push(target.charAt(j)) } else { if(! Var skip = PMT [j-1] tempIndex = I + skip-1 break}}} var data = {index: I, matchArr: isMatch } callerKmp.push(data) if(tempIndex) i = tempIndex if(isMatch.length === target.length) { console.timeEnd('kmp') Console. log(' shift times :', times) return I} isMatch = []} console.timeEnd(' KMP ') return-1Copy the code
Have thought, after the overall is not complicated, just need to be part of string of length is calculated by pattern matching values, with the main string after the matching process, after each failure if there is a partial match value, we can check it on the part of matching value should shift to the next position, eliminate unnecessary steps.
So in some extreme cases, for example, if there is no internal repetition of the word to search for, the algorithm will degenerate into traversal, which may not perform as well as the traditional algorithm, and there is the overhead of comparison involved.
Refer to the article
- String matching KMP algorithm
The last
Custom Po writer’s blog, updated irregularly —
If you have any questions, please contact us at issues.