Live each day like it could be your last.
Live each day as if it was your last.
The introduction
Avenue to simple, my ideal is to use the simplest code to achieve the most beautiful algorithm. String matching is widely used, such as Java’s indexOf(), JS’s whole-family bucket set (find, indexOf,include…). And so on. This article mainly shares the string matching algorithms that they depend on at the bottom. Two simple, two complicated. Without further comment, all source code has been uploaded to Github: link
Ps: BF algorithm is the most simple string matching algorithm, need to know and know why. The RK algorithm understands it to be BF + HASH. As for the BM and KMP behind, it is ok to understand its principle. The predecessors have already built the wheel, and they can think of using it when needed, but it is quite difficult to write it out.
BF algorithm
Bf algorithm commonly known as naive matching algorithm, why is it called this name, because it is very violent, in the main string, check the starting position are 0, 1, 2… N-m substrings of length m, n-m+1, and see if there’s anything that matches the pattern string.
parsing
Here I loop is the trace main string TXT,j is the trace pattern string pattern, first peripheral first determine the access times, tlen-plen.
J loop for comparison. Maybe the only thing that’s hard to understand here is I + j. Look at the test results.
private int bfSearch(String txt, String pattern) {
int tLen = txt.length();
int pLen = pattern.length();
if (tLen < pLen) return- 1;for (int i = 0; i <= tLen - pLen; i++) {
int j = 0;
for (; j < pLen; j++) {
System.out.println(txt.charAt(i + j) + "--" + pattern.charAt(j));
if(txt.charAt(i + j) ! = pattern.charAt(j))break;
}
if (j == pLen) return i;
}
return- 1; }Copy the code
variant
The function of I and j is the same as that of normal I +j. However, when a mismatch is found, the two Pointers of I and j need to be backdated. J returns to the beginning and I points to the next character.
private int bfSearchT(String txt, String pattern) {
int tLen = txt.length();
int i = 0;
int pLen = pattern.length();
int j = 0;
for (; i < tLen && j < pLen; i++) {
System.out.println(txt.charAt(i) + "--" + pattern.charAt(j));
if (txt.charAt(i) == pattern.charAt(j)) {
++j;
} else{ i -= j; j = 0; }}if (j == pLen) {
System.out.println("end... i = " + i + ",plen = " + pLen);
return i - pLen;
}
return- 1; }Copy the code
The test code
Ps: hello world
public static void main(String[] args) {
BFArithmetic bf = new BFArithmetic();
String txt = "hello world";
String pattern = "world";
int res = bf.bfSearch(txt, pattern);
System.out.println(BF algorithm matching result: + res);
// int resT = bf.bfSearchT(txt, pattern);
// System.out.println("BF algorithm (display rollback) matching results :" + resT);
}Copy the code
The test results
Fairly RK algorithm
Rk algorithm is equivalent to an advanced version of BF algorithm, which mainly introduces the hash algorithm. Reduced time complexity. The n- M +1 substrings in the main string are hashed by the hash algorithm, and then the size is compared with the hash value of the pattern string one by one. If the hash value of a substring is equal to the pattern string, then the corresponding substring matches the pattern string. Because the hash value is a number, comparing numbers for equality is very fast, so the efficiency of pattern string and substring comparisons is improved.
Initialize the
Here we prefab the pattern string, generate the corresponding hash value, and then randomly generate a large prime number for subsequent use.
private RKArithmetic(String pattern) {
this.pattern = pattern;
pLen = pattern.length();
aLen = 256;
slat = longRandomPrime();
System.out.println("Random primes:" + slat);
aps = 1;
// aLen^(pLen - 1) % slat
for (int i = 1; i <= pLen - 1; i++) {
aps = (aLen * aps) % slat;
}
patHash = hash(pattern, pLen);
System.out.println("patHash = " + patHash);
}Copy the code
To prepare
Randomly generate a large prime number
private static long longRandomPrime() {
BigInteger prime = BigInteger.probablePrime(31, new Random());
return prime.longValue();
}Copy the code
The hash algorithm
private long hash(String txt, int i) {
long h = 0;
for (int j = 0; j < i; j++) {
h = (aLen * h + txt.charAt(j)) % slat;
}
return h;
}Copy the code
Verify that the string matches
private boolean check(String txt, int i) {
for (int j = 0; j < pLen; j++)
if(pattern.charAt(j) ! = txt.charAt(i + j))return false;
return true;
}Copy the code
implementation
The implementation is still relatively easy to read, but instead of comparing hash values, the comparison is made.
private int rkSearch(String txt) {
int n = txt.length();
if (n < pLen) return- 1; long txtHash =hash(txt, pLen);
if ((patHash == txtHash) && check(txt, 0))
return 0;
for (int i = pLen; i < n; i++) {
txtHash = (txtHash + slat - aps * txt.charAt(i - pLen) % slat) % slat;
txtHash = (txtHash * aLen + txt.charAt(i)) % slat;
int offset = i - pLen + 1;
System.out.println("The first" + offset + "Sub-txthash =" + txtHash);
if ((patHash == txtHash) && check(txt, offset))
return offset;
}
return- 1; }Copy the code
The test code
public static void main(String[] args) {
String pat = "world";
String txt = "hello world";
RKArithmetic searcher = new RKArithmetic(pat);
int res = searcher.rkSearch(txt);
System.out.println("RK algorithm matching result:" + res);
}Copy the code
The test results
BM algorithm
The BM algorithm wheel has been built. It is said to be the most efficient and commonly used string matching algorithm.
Analysis of the
- The core idea is to greatly reduce the number of comparisons and reduce the time complexity by sliding the pattern string backward along the main string in big strides. The key of the algorithm lies in how to balance the steps of enough and no omission, and at the same time to improve the efficiency of execution. This requires the pattern string to follow both the bad character rule and the good suffix rule as it slides backwards, along with a few tricks.
- Bad character rule: The characters in the pattern string and the main string are compared bit by bit from back to front. When the bad character that does not match is found, the subscript value si of the pattern string is recorded, and the nearest position xi before the subscript si of the bad character in the pattern string is found (if there is no bad character, it is marked as -1). Si-xi is the backward sliding distance. (PS: I think plus xi must be in front of si, that is smaller than si, so we don’t have to worry about the calculated distance being negative). But the bad character rule doesn’t slide back long enough, so it needs the good suffix rule.
- Good suffix rule: Compares the characters in the pattern string to the main string bit by bit, stopping when a bad character occurs. If there is a substring {u} that has been successfully matched, the nearest {u} is found before {u} in the pattern string and is denoted as {u’}. The pattern string is then moved back so that {u’} of the pattern string overlaps {u} of the main string. If {u’} does not exist, the pattern string is moved directly after {u} in the main string. In order not to miss anything, you need to find the longest suffix substring of good suffixes that matches the prefix substring of the pattern string (which is also the suffix substring of the pattern string). Then move the pattern string to the right to its left edge to align the suffix substring of this good suffix with the left edge of the main string.
To prepare
Construct bad character hash table
private void generateBC(char[] patChars, int[] records) {
for (int i = 0; i < aLen; i++) {
records[i] = -1;
}
for(int i = 0; i < patChars.length; Int ASCII = (int) patChars[I]; int patChars[I]; records[ascii] = i; } System.out.println("Bad character hash table:");
print(records);
}Copy the code
Good suffix
private void generateGS(char[] patChars, int[] suffix, boolean[] prefix) {
int pLen = patChars.length;
for(int i = 0; i < pLen; ++ I) {// suffix[I] = -1; prefix[i] =false;
}
for(int i = 0; i < pLen - 1; ++i) { int j = i; Int k = 0;while(j >= 0 && patChars[j] == patChars[pLen - 1 - k]) { --j; ++k; // suffix[k] = j+1; // Suffix [k] = j+1; } // If the common suffix substring is also the prefix substring of the pattern stringif (j == -1) prefix[k] = true; }}Copy the code
mobile
private int moveByGS(int index, int pLen, int[] suffix, boolean[] prefix) { int k = pLen - 1 - index; // Good suffix lengthif (suffix[k] != -1) return index - suffix[k] + 1;
for (int i = index + 2; i <= pLen - 1; i++) {
if (prefix[pLen - i])
return i;
}
return- 1; }Copy the code
implementation
- Suffix In the pattern string, look for another substring that matches the good suffix
- Prefix Records whether the suffix substring of a pattern string matches the prefix substring of a pattern string
Private int bmSearch(String TXT, String pattern) {// record the last position of each character in the String int[] records = new int[aLen]; char[] txtChars = txt.toCharArray(); int tLen = txtChars.length; char[] patChars = pattern.toCharArray(); int pLen = patChars.length; generateBC(patChars, records); int[] suffix = new int[pLen]; boolean[] prefix = new boolean[pLen]; generateGS(patChars, suffix, prefix); Int index = 0; // int index = 0;while(index <= tLen - pLen) { int i = pLen - 1; // The pattern string matches from the back to the frontfor(; i >= 0; -- I) {// The subscript in the pattern string for bad characters is Iif(txtChars[index + i] ! = patChars[i])break;
}
if (i < 0) {
return index;
}
int x = i - records[(int) txtChars[index + i]];
int y = 0;
if (i < pLen - 1) {
y = moveByGS(i, pLen, suffix, prefix);
}
System.out.println("x = " + x + ",y = " + y);
index = index + Math.max(x, y);
}
return- 1; }Copy the code
The test code
public static void main(String[] args) {
BMArithmetic bmArithmetic = new BMArithmetic();
String txt = "hello world";
String pattern = "world";
int res = bmArithmetic.bmSearch(txt, pattern);
System.out.println("BM algorithm matching result:" + res);
}
Copy the code
The test results
BM Heuristic (Lite)
Analysis of the
BM algorithm is worthy of the so-called linear level of calculation, it is said that the efficiency of KMP algorithm is 3~4 times, have time must check.
If you have a problem and it doesn’t work, try to get√ new ideas.
Initialize the
private BoyerMoore(String pattern) { this.pattern = pattern; pLen = pattern.length(); int aLen = 256; records = new int[aLen]; // Initialize the record array. The default is -1for(int i = 0; i < aLen; i++) { records[i] = -1; } // The right-most position in which the character in the pattern string appearsfor(int j = 0; j < pLen; j++) { records[pattern.charAt(j)] = j; }}Copy the code
implementation
According to the name skip, two keywords can also be analyzed in reverse order and skip.
private int bmSearch(String txt) {
int tLen = txt.length();
int skip;
for (int i = 0; i <= tLen - pLen; i += skip) {
skip = 0;
for (int j = pLen - 1; j >= 0; --j) {
System.out.println(txt.charAt(i + j) + "--" + pattern.charAt(j));
if(txt.charAt(i + j) ! = pattern.charAt(j)) { skip = j - records[txt.charAt(i + j)];if (skip < 1) skip = 1;
break; }}if (skip == 0) return i;
}
return- 1; }Copy the code
The test code
public static void main(String[] args) {
String txt = "hello world";
String pattern = "world";
BoyerMoore bm = new BoyerMoore(pattern);
int res = bm.bmSearch(txt);
System.out.println("BM algorithm matching result:" + res);
}Copy the code
The test results
KMP algorithm
Analysis of the
The KMP algorithm introduces an invalidation function — the next array. The key to this algorithm is how the next function is calculated. Amazing? No, it’s pins and needles. It’s hard to understand. Only debug step by step with.
To prepare
K = next[k]. Since the next character in the longest string of the previous one is not equal to the last one, we need to find the second long string of the previous one. The problem becomes finding the longest string from 0 to next(k). If the next character is not the same as the last one, we continue to find the second long string, that is, the next next(k), until we find it, or none at all.
private int[] getNext(char[] patChars, int pLen) {
int[] next = new int[pLen];
next[0] = -1;
int k = -1;
for (int i = 1; i < pLen; i++) {
while(k ! = -1 && patChars[k + 1] ! = patChars[i]) { k = next[k]; }if (patChars[k + 1] == patChars[i])
++k;
next[i] = k;
}
System.out.println("Good prefix:");
print(next);
return next;
}Copy the code
implementation
private int kmpSearch(String txt, String pattern) {
char[] txtChars = txt.toCharArray();
int tLen = txtChars.length;
char[] patChars = pattern.toCharArray();
int pLen = patChars.length;
int[] next = getNext(patChars, pLen);
int index = 0;
for (int i = 0; i < tLen; i++) {
while(index > 0 && txtChars[i] ! = patChars[index]) { index = next[index - 1] + 1; } System.out.println(txtChars[i] +"--" + patChars[index]);
if (txtChars[i] == patChars[index])
++index;
if (index == pLen)
return i - pLen + 1;
}
return- 1; }Copy the code
The test code
public static void main(String[] args) {
KMPArithmetic kmpArithmetic = new KMPArithmetic();
String txt = "hello world";
String pattern = "world";
int res = kmpArithmetic.kmpSearch(txt, pattern);
System.out.println("KMP algorithm matching result:" + res);
}Copy the code
The test results
KMP algorithm (based on dFA of determined Priority State Automata)
Initialize the
private KMPByDFA(String pattern) {
this.pattern = pattern;
this.pLen = pattern.length();
int aLen = 256;
dfa = new int[aLen][pLen];
dfa[pattern.charAt(0)][0] = 1;
int i = 0;
for (int j = 1; j < pLen; j++) {
for(int k = 0; k < aLen; Dfa [k][j] = dfa[k][I]; } dfa[pattern. CharAt (j)][j] = j + 1; I = dfa[pattern.charat (j)][I]; }}Copy the code
implementation
private int kmpSearch(String txt) {
int i = 0;
int j = 0;
int tLen = txt.length();
for(; i < tLen && j < pLen; i++) { j = dfa[txt.charAt(i)][j]; } // Find a match and reach the end of the pattern stringif (j == pLen)
return i - pLen;
return- 1; }Copy the code
The test code
public static void main(String[] args) {
String txt = "hello world";
String pattern = "world";
KMPByDFA kmp = new KMPByDFA(pattern);
int res = kmp.kmpSearch(txt);
System.out.println("BM algorithm matching result:" + res);
}Copy the code
The test results
end
Your likes and attention are the biggest support for me, thank you!