KMP

Violent algorithm is in the case of mismatch, backtrace the substring pointer, match from scratch. KMP mismatches try to move forward more aggressively, skimming over impossible matches. The length of the forward move is determined by a “partial match table”.

(CLRS Introduction to Algorithms, 3rd Ed 1)

Partial match in front of the table is really a mismatch that position (matching) on the inside of the longest suffix length before public, move is to prefix directly against the suffix that position, just skip the middle can’t match (already swept, behind do not match, repeat and still does not match) of that paragraph, began to try a possible match directly.

Partial match table

Refer to Wikipedia’s algorithm 2 (see English page, some paragraphs missing).

By hand:

func GetNext(substr: string) []int {  / / substr: [0... M - 1)
    M := len(substr)
    next := make([]int, M)

    next[0] = - 1
    for j := 1; j <= M; j++ {
        next[j] = len(longest common suffix (substr[0:j])  // before the j character)}return next
}
Copy the code

The longest common prefix and suffix is for example:

(Picture from Ruan Yifeng “STRING Matching KMP Algorithm” 3)

Programming quick calculation:

func GetNext(substr string) []int {
	next := make([]int.len(substr))
	next[0] = - 1

	i, j := 0.- 1
	for i < len(substr)- 1 {
		if j == - 1 || substr[i] == substr[j] {
			i++
			j++
			next[i] = j
		} else {
			j = next[j]
		}
	}
	return next
}
Copy the code

matching

KMP:

func KMP(s string, substr string) int {
	next := GetNext(substr)

	i, j := 0.0
	for i < len(s) && j < len(substr) {
		if j == - 1 || s[i] == substr[j] {
			i++
			j++
		} else { // miss matching
			j = next[j]
		}
	}

	if j >= len(substr) { // matched
		return i - len(substr)
	}
	return - 1 // not matched
}
Copy the code

This program actually works:

package main

// ...

func main(a) {
	s := "BBC ABCDAB ABCDABCDABDE"
	sub := "ABCDABD"

	res := KMP(s, sub)
	println(res)
}
Copy the code

Next array

Add a point: partial matching table, there are several different versions, such as “Introduction to algorithms” 1, and for example, our postgraduate examination book (Tianqin Wang Dao) 4,^, 5.

(The partial matching table is affectionately called “the next array”…)

Here’s an example of the difference:

index 0 1 2 3 4 5 6
char a b a b a c a
CLRS 0 0 1 2 3 0 1
Wikipedia – 1 0 0 1 2 3 0 = CLRS moves one to the right and complements the left- 1
Day frequently king 0 1 1 2 3 4 1 = Wikipedia + 1

If mismatch at original (large) index I, schema (substring) index j:

  • The CLRS moves toT[j-1](Special cases:j=0i+=1)
  • Wikipedia moved toT[j](Special cases:j=-1j+=1).

Better understood code

To be honest, I can’t read the code above. Mainly is the period of the if (j = = 1 | | s [I] = = substr [j]) {… } WTF????

Studied for a long time to understand clearly is two different cases, just happened to be the same processing methods merged together, put here to write assembly. Throw up.

It is easier to understand if the two cases are written separately by referring to answer 6 on Zhihu:

func GetNext(substr string) []int {
	next := make([]int.len(substr))
	next[0] = - 1

	i, j := 0.- 1
	for i < len(substr)- 1 {
        if j == - 1 { // special: end of the line: the smallest, can not be reduced
			i++
			j = 0
			next[i] = j
		} else if substr[i] == substr[j] { // Add one more bit
			i++
			j++
			next[i] = j
		} else { // Mismatch: shrink, re-match
			j = next[j]
		}
	}
	return next
}

func KMP(s string, substr string) int {
	next := GetNext(substr)

	i, j := 0.0 // I of s and j of substr
	for i < len(s) && j < len(substr) {
        if j == - 1 { // special: go back to the head: I move one, j compare from the head
			i++
			j = 0
		} else if s[i] == substr[j] { // add: next character
			i++
			j++
		} else { // mismatch: j moves to the next possible match
			j = next[j]
		}
	}

	if j >= len(substr) { // matched
		return i - len(substr)
	}
	return - 1 // not matched
}
Copy the code

Improve the KMP

Improve 🔨, to sleep.

Nextval nextVal nextVal nextVal nextVal

Nextval construction rules are as follows:

nextval[0] = - 1;
for (int j = 1; j < len(substr); j++) { nextval[j] = (substr[j] ! = substr[next[j]] ? next[j] : nextval[next[j]] ); }Copy the code

If the character of next[j] is the same as that of the current character, it is not necessary to go back step by step (which has already been done before), but directly borrow the previous result and quickly go back to the end.

e.g.

substr A B A B A A B
j 0 1 2 3 4 5 6
next – 1 0 0 1 2 3 1
nextval – 1 0 – 1 0 – 1 3 0

(Remember that the test book starts at 1 and increments next and Nextval.)

Recommended reading

Still have ruan yifeng of bibliothecary below, easy to understand, see this hand calculate enough. Wikipedia is also ok (be sure to see the English page).

We can chew CLRS when we have time.

Dead to memorize the code, the king of heaven.

And leetcode-cn.com/problems/im… This article also writes the cock (this big guy’s public account I followed for several years), uses the moving gauge (actually is the DFA) to explain KMP, recommended reading.

reference


  1. Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to Algorithms, 3rd Ed.↩
  2. Wikipedia. Knuth, Morris, Pratt algorithm. En.wikipedia.org/wiki/Knuth -… ↩
  3. Nguyen. String matching KMP algorithm. www.ruanyifeng.com/blog/2013/0… ↩
  4. High Score Notes on Data Structure (10th edition of Tianqin in 2022). China Machine Press, 2020.↩
  5. Wang Dao Forum. 2022 Data structure Test review guide. Publishing House of Electronics Industry, 2021↩
  6. Ruan Xingzhi. How to better understand and master KMP algorithm? . www.zhihu.com/question/21… ↩