What is a string
In the data structure, the string is stored in a separate storage structure, called the string storage structure. String means string. A string is usually a finite sequence of zero or more characters.
In general, a string consisting of n strings is denoted as S=”a0a1…… An -1″(n≥0), ai (1≤ I ≤n)
- N is a finite number
- The string is generally denoted as S is the name of the string, and the sequence of characters enclosed by double or single quotes is the value of the string (the quotes do not belong to the contents of the string).
- It can be a letter, a number, or any other character, and I is the position of the character in the string. The number of characters in a string, n, is called the string length, and n is a finite number
No matter which programming language you learn, strings are the most manipulated. In the data structure, some special strings are named according to the number and characteristics of the characters stored in the string, as follows:
-
Empty string: a string that stores 0 characters, such as S = “” (double quotes next to each other)
-
Space string: string containing only space characters, such as S = “” (double quotation marks contain 5 Spaces)
-
Main string and substring: Suppose there are two strings, A and B. If A string consisting of several consecutive characters can be found in B that is exactly the same as A, then A is the main string of B and B is the substring of A. For example, if A = “ZIHUCHUAN”, B = “HUA”, since A also contains “HUA”, strings A and B are main and sub-strings
-
Prefix, proper prefix, suffix, proper suffix is a suffix that does not include itself.
Given a string, then:
To determine whether there is a main string and substring relationship between two strings, the main matching algorithms are as follows:
- Naive pattern matching algorithm (Brute Force, BF algorithm), also known as Brute Force algorithm
- Fast Pattern Matching algorithm (KNUth-Morris-Pratt, KMP algorithm)
BF algorithm
Naive pattern matching algorithm, its implementation process without any skills, is a simple and rough to take a string with another string of characters one by one, get the final result.
Code implementation
Public class BruteForce {/** * @param p substring */ public static int BruteForce (String s, String p) {// match the initial position int sl = s.length(); int pl = p.length(); if (sl < pl) return -1; int i = 0, j = 0; while (i < sl && j < pl) { if (s.charAt(i) == p.charAt(j)) { i++; j++; } else {// backtrace I = I - j + 1; j = 0; }} if (j >= pl) {return i-j; } else {// return -1 is not matched; } } public static void main(String[] args) { String s = "ZIHUCHUAN"; String p = "HUA"; System.out.println(bruteForce(s, p)); }}Copy the code
KMP algorithm
KMP algorithm is a very efficient string matching algorithm, its core idea is that the main string does not backtrack, pattern string as much as possible to the right.
Next [k] represents the maximum common substring length of the string consisting of the first k characters.
Maximum common substring length
For a string that has both a prefix and a suffix, true prefixes are those that do not contain themselves.
Here, the maximum common substring length is the longest and equal true prefix and true suffix of the string. Such as:
Abcd True prefix: a,ab, ABC Prefix: a,ab, ABC,abcd True suffix: BCD, CD,d Suffix: abcd, BCD, CD,dCopy the code
The string abcd does not have a common suffix, so it does not have the longest common suffix. For the string abcab, it looks like this:
Prefix: a,ab, ABC,abca,abcab True suffix: BCAB,cab,ab,b Suffix: abcab,bcab,cab,ab,bCopy the code
Where the true prefix ab and the true suffix AB are equal and unique, that is, the maximum common substring length of the string abCAB is ab, and its length is 2.
Obviously, from the above analysis, it can be concluded that the true prefix is included in the prefix, so the following prefix refers to the true prefix.
Building the Next array
What does the next array do?
To hold the maximum length of a common substring, which is where the substring should fall back if the main string and substring do not match.Copy the code
What does the maximum common substring length do?
If S [I]! = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j] = P[j]
Given a main string BBCABCDABABCDABCDABDE and a substring ABCDABD, the brute force solution, after a match, is as follows:
At this point S [9] {A}! = P[6]{D}, the violent algorithm is to set I =i-j+1,j=0, and match again by backtracking, but all the ones before I have been compared, so if I can be kept unchanged,j becomes 2, and substring is shifted 4 bits to the right, that is, s[9]{A} and j[2]{C} are aligned, as follows
(Note: The yellow box in the picture above is actually what the brute force algorithm needs to compare, but it’s actually redundant)
So how do we get this 4 bits? In other words, how do we know that j has to point to 2? This requires the maximum common substring length.
In the first diagram, when S[9] and P[6] are unmatched:
- The pink box is matching, which is
ABCDAB
- The green box matches the blue box, i.e
AB
- Blue box
AB
It’s in the pink boxABCDAB
The suffix - Red box
AB
It’s in the pink boxABCDAB
The prefix - The blue box and the red box are both
AB
For the pink box in the substringABCDAB
They have the same prefix and suffix, so if the green matches the blue, then the red matches the green - So, align the red box with the green box, pointer
j
Point to the last bit of the red box, thus maintainingi
It doesn’t move, andj
Move to position 2, and the maximum common substring isABAnd the length is 2
Therefore, when the match between S[I] and P[j] is out of balance, the longest common presuffix length of the left substring (ABCDAB) excluding P[j] is calculated. Assuming the length is k, then the j pointer needs to be reset to k (j=2 in the figure above), leaving I unchanged and continuing to match.
How do I find the maximum common length of a substring?
Next [j] represents the longest common suffix for P[0~j-1] substring P[0~j-1] of the P string, where for next[0], We usually set it to -1 (because there is no substring to the left of P[0], so next[0] cannot be solved).
For example, the string A,P[0] = A, has no other elements to the left of it, so there is no prefix and suffix lengthCopy the code
Take the substring ABCDABD as an example to build the next array:
Note: The true prefixes in the above table are the true prefixes of the left substring
From the table above, we get the next array:
It can be seen from the above that there exists the longest common suffixk
, the prefixP[0~k-1]
And postfixP[j-k~j-1]
Is equal to (j>k), thennext[j]=k
,P[j]
The previous substring had length zerok
So in the KMP matching process, when the match imbalance, only need toj
Move to thenext[j]
Continues to match at the position of, which corresponds to the substring P being shifted to the right at the original positionnext[j]
position
Code implementation
Because when S[I]! = P[j] = P[j] = P[j] = P[j] = P[j] Next [j] = next[j] = next[j] = next[j] = next[j] = next[j] = next[j
First define a j that traverses the substring P from left to right, where the position of j represents the right-most character of the suffix of the substring P, and then define a k that represents the right-most character of the prefix of the substring P
-
If next[j]=k, then P[0~k-1]=P[j-k~j-1], and the k-1 character before and after the k-1 character overlap. For example, when j=0, P[0] has no longest presuffix, that is, next[0]=-1, j=0,k=-1+1=0, and j and k move right into the next cycle
-
We have figured out next[j]=k, the next step should be next[j+1]. At this time, there are two situations as follows
- if
P[j]==P[k]
, the number of repeated strings will increase, thenP[0~k-1]+P[k] = P[j-k~j-1]+P[j]
, i.e.,P[0~k]=P[j-k~j]
In front of,k
Bit characters and the followingk
Bit characters overlap, that is, one more bit, so you can getnext[j+1]=k+1
, i.e.,next[++j]=++k
, the following figure
-
If P [j]. =P[k], indicating that the number of repeated strings does not increase, that is, the length of the maximum repeated substring cannot be increased. At this time, we need to find the value of next[j+1], obviously this time is to ask for the substring before j+1, that is, the maximum repeated substring length of P[0~j]. Let’s assume that the longest repeating substring length is k1, then
P[j]==P[k],P[0~k-1]=P[j-k~j-1] P[j]! =P[k], then P[0~ K1-1]=P[j+1-k~j]Copy the code
Since the maximum length is not increasing at this time, k1 <= k, that is, the current maximum repeating substring may exist in P[0~k-1], that is, the value of next[k] represents the maximum repeating substring of the first K elements, which is analyzed as follows
- if
Analysis:
-
Given an array like this, suppose we want the value of next[16], given that next[j]=k,next[j+1]=k+1
-
We assume that next[15]=7, that is, j=15,k=7, indicating that the maximum common substring length of P[0~k-1]=P[j-k~j-1] is K =7, then P[0~6]=P[8~14]
-
If P[7]=P[15], then next[16]=next[15]+1=8
next[j+1]=k+1 next[15+1]=7+1 next[16]=8 Copy the code
-
If P [7]! =P[15], set next[7]=3, then it indicates that the maximum common substring length of P[0~6] is 3, that is, P[0~2]=P[4~6], and P[0~6]=P[8~14], so the blue part below is coevent
The value of next[7] represents the maximum common substring length of P[0~k-1], so it is obvious that P[k] and P[next[k]] must not be equal, but A and B are equal, so when P[7]! =P[15], k=next[k], k=next[k], k=3, k=3 And then to determine the p [next [k]] and [j] p are equal, if equal p [j + 1) = k + 1, p = 3 + 1 [16]
Next [1]=0, indicating that there is no common suffix.
Public static int[] getNext(String p) {int len = p.length(); Int [] next = new int[len]; int k = -1; Int j = 0; int j = 0; Next [0] = -1 next[0] = -1; While (j < len - 1) {/ / cycle of p string string before the if (k = = | | p.c harAt (j) = = p.c harAt (k)) {next [+ + j] = + + k; } else { k = next[k]; } } return next; }Copy the code
-
The next array matches a string
- when
i=j=0
If,S[i]==P[j]
, character match, i++, j++ - if
j=-1
, P string must be matched from the beginning, i++, j++ S[i]! =P[j]
, the matching imbalance,j=next[j]
Code implementation
Public class KMP {public static int KMP (String s, String p) {// getNext table int[] next = getNext(p); Int I = 0; int j = 0; while (i < s.length() && j < p.length()) { if (j == -1 || s.charAt(i) == p.charAt(j)) { i++; j++; } else { j = next[j]; } } if (j >= p.length()) { return i - j; } return -1; Public static int[] getNext(String p) {int len = p.length(); Int [] next = new int[len]; int k = -1; Int j = 0; int j = 0; Next [0] = -1 next[0] = -1; While (j < len - 1) {/ / cycle of p string string before the if (k = = | | p.c harAt (j) = = p.c harAt (k)) {next [+ + j] = + + k; } else { k = next[k]; } } return next; } public static void main(String[] args) { int kmp = kmp("BBCABCDABABCDABCDABDE", "ABCDABD"); System.out.println(kmp); }}Copy the code
reference
- String and KMP algorithm
- KMP algorithm for the next array code