The premise
I haven’t studied algorithms for a long time. After all, I seldom use them in daily life. Even if I memorize them by rote, there is no implementation scenario, which makes it easy to forget. Recently, an Algorithm called Levenshtein Distance Algorithm was used to meet the requirement of “desensitization data and plaintext data matching”. This paper makes a simple analysis of the Algorithm principle and uses this Algorithm to solve several common scenarios.
What is Levenshtein Distance
Levenshtein Distance, commonly known as Edit Distance (Levenshtein Distance is just one type of Edit Distance) or Levenstein Distance, The algorithm concept was proposed by The Russian scientist Levenshtein Vladimir I in 1965. The concept of this algorithm is simple: Levenshtein Distance refers to “the minimum number of editing operations required to convert from one to the other between two strings”. The allowed editing operations include:
- Replace one character with another (
Substitutions
). - Insert a character (
Insertions
). - Delete a character (
Deletions
).
❝
Hereafter, Levenshtein Distance is referred to as LD
❞
Levenshtein Distance formula definition
This mathematical formula is going to give you the value of LD. Here’s an example:
Kitten converted to sitting has an LD value of 3:
- Kitten → sitten (k→s)
- Sitten → sittin (e→ I)
- Sittin → sitting (insert a ‘g’)
Levenshtein Distance dynamic programming method
Dynamic programming can be used to measure LD values, and the steps are roughly as follows:
- Initialize one
LD
matrix(M,N)
.M
andN
Are the lengths of two input strings. - The matrix can be filled from the top left to the bottom right, with each horizontal or vertical jump corresponding to an insert or a delete, respectively.
- By defining the cost of each operation to be 1, if the two strings do not match, then the cost of a diagonal jump is 1, otherwise it is 0, in a nutshell:
- if
[i][j]
If two strings of position are equal, the value of[i][j]
Add 1 to the left, 1 to the top, and 0 to the top left, and fill it with the smallest of the three[i][j]
. - if
[i][j]
If the two strings of position are not equal, the value from[i][j]
Take the minimum value of the left, upper left, and upper three positions, add one to the minimum value (or add one to all three values and then take the minimum value), and then fill to[i][j]
.
- if
- According to the above rules
LD
matrix(M,N)
After filling, finally“The number in the lower right corner of the matrix“It’s two stringsLD
Value.
Instead of trying to prove the results of the above dynamic programming (i.e., by default, the results of this dynamic programming are correct), here are two examples to illustrate the problem:
- Example 1 (two strings of equal length) :
son
andsun
. - Example 2 (two non-equal length strings) :
doge
anddog
.
“Example 1:“
Initialize LD matrix (3,3) :
s |
o |
n |
||
---|---|---|---|---|
0 |
1 |
2 |
3 |
|
s |
1 |
|||
u |
2 |
|||
n |
3 |
Computing [0] [0] the location of the value, because the ‘s’ = ‘s’, so [0] [0] value = min (1 + 1 + 1, 1, 0, + 0) = 0.
s |
o |
n |
||
---|---|---|---|---|
0 |
1 |
2 |
3 |
|
s |
1 |
0 | ||
u |
2 |
|||
n |
3 |
Calculate the values of other positions according to this rule, and the LD matrix ‘after filling is as follows:
s |
o |
n |
||
---|---|---|---|---|
0 |
1 |
2 |
3 |
|
s |
1 |
0 | 1 | 2 |
u |
2 |
1 | 1 | 2 |
n |
3 |
2 | 2 | 1 |
So the LD value of son and Sun is 1.
“Example 2:“
Initialize LD matrix (4,3) :
d |
o |
g |
||
---|---|---|---|---|
0 |
1 |
2 |
3 |
|
d |
1 |
|||
o |
2 |
|||
g |
3 |
|||
e |
4 |
Then fill the matrix:
d |
o |
g |
||
---|---|---|---|---|
0 |
1 |
2 |
3 |
|
d |
1 |
0 |
1 |
2 |
o |
2 |
1 |
0 |
1 |
g |
3 |
2 |
1 |
0 |
e |
4 |
3 |
2 |
1 |
So the LD value of doge and dog is 1.
Levenshtein Distance algorithm implementation
According to the dynamic programming method mentioned above, LD algorithm can be relatively simple to implement. Here, Java language is used to implement:
public enum LevenshteinDistance {
/ / the singleton
X;
/ * ** Calculate Levenshtein Distance* / public int ld(String source, String target) { Optional.ofNullable(source).orElseThrow(() -> new IllegalArgumentException("source")); Optional.ofNullable(target).orElseThrow(() -> new IllegalArgumentException("target")); int sl = source.length(); int tl = target.length(); // To define the matrix, increment the column and column by 1 int[][] matrix = new int[sl + 1][tl + 1]; // First row, first column assigned for (int k = 0; k <= sl; k++) { matrix[k][0] = k; } for (int k = 0; k <= tl; k++) { matrix[0][k] = k; } // Define temporary edit consumption int cost; for (int i = 1; i <= sl; i++) { for (int j = 1; j <= tl; j++) { if (source.charAt(i - 1) == target.charAt(j - 1)) { cost = 0; } else { cost = 1; } matrix[i][j] = min( / / left matrix[i - 1][j - 1] + cost, / / right matrix[i][j - 1] + 1. / / the left matrix[i - 1][j] + 1 ); } } return matrix[sl][tl]; } private int min(int x, int y, int z) { return Math.min(x, Math.min(y, z)); } / * ** Calculates the match rate* / public BigDecimal 先生(String source, String target) { int ld = ld(source, target); // 1 - ld / max(len1,len2) return BigDecimal.ONE.subtract(BigDecimal.valueOf(ld) .divide(BigDecimal.valueOf(Math.max(source.length(), target.length())), 2, BigDecimal.ROUND_HALF_UP)); } } Copy the code
The complexity of the algorithm is O(N * M), where N and M are the lengths of two input strings, respectively. The implementation of the algorithm here completely refers to the inference process of the previous dynamic programming method. In fact, it is not necessary to define two-dimensional array (matrix), only two one-dimensional arrays are used. Please refer to the implementation of Levenshtein algorithm in Java-String-Similarity. Run the previous example:
public static void main(String[] args) throws Exception {
String s = "doge";
String t = "dog";
System.out.println("Levenshtein Distance:" +LevenshteinDistance.X.ld(s, t));
System.out.println("Match Rate:" +LevenshteinDistance.X.mr(s, t));
} / / output Levenshtein Distance:1 Match Rate:0.75 Copy the code
Some usage scenarios of Levenshtein Distance algorithm
The main application scenarios of LD algorithm are as follows:
DNA
Analysis.- Spell check.
- Speech recognition.
- Plagiarism detection.
- And so on…
In fact, the main “string” matching scenario, based on the actual scenario encountered.
Desensitized data matched with clear text data
Recently, there are some scenarios to match desensitized data with plaintext data. Sometimes, exported files from a third party are desensitized files in the following format:
The name | Mobile phone no. | Id card |
---|---|---|
Zhang * dog |
123 * * * * 8910 |
123456 * 8765 * * * * * * * |
Our plaintext data is as follows:
The name | Mobile phone no. | Id card |
---|---|---|
Big dogs |
12345678910 |
123456789987654321 |
To match the two pieces of data, the above two pieces of data correspond to the same person’s data. The principle is as follows: If and only if the LD value of the two pieces of data is 4, the LD value of the ID card is 8, and the LD value of the name is 1, the two pieces of data match completely.
Using the algorithm written earlier:
public static void main(String[] args) throws Exception {
String sourceName = "Zhang * Dog";
String sourcePhone = "123 * * * * 8910";
String sourceIdentityNo = "123456 * 8765 * * * * * * *";
String targetName = "Zhang Big Dog";
String targetPhone = "12345678910"; String targetIdentityNo = "123456789987654321"; boolean match = LevenshteinDistance.X.ld(sourceName, targetName) == 1 && LevenshteinDistance.X.ld(sourcePhone, targetPhone) == 4 && LevenshteinDistance.X.ld(sourceIdentityNo, targetIdentityNo) == 8; System.out.println("Match or not :" + match); targetName = "Big doge"; match = LevenshteinDistance.X.ld(sourceName, targetName) == 1 && LevenshteinDistance.X.ld(sourcePhone, targetPhone) == 4 && LevenshteinDistance.X.ld(sourceIdentityNo, targetIdentityNo) == 8; System.out.println("Match or not :" + match); } // Output the result Matching:true Matching:false Copy the code
Spell check
This scenario seems to be close to life. It is a spelling cue for dictionary applications. For example, if you type throwab, you will hear throwable. Search for words with high matching degree (low LD value) for hints (in fact, it may not be implemented in this way for efficiency). Here’s an example:
public static void main(String[] args) throws Exception {
String target = "throwab";
// simulate a word library
List<String> words = Lists.newArrayList();
words.add("throwable");
words.add("their"); words.add("the"); Map<String, BigDecimal> result = Maps.newHashMap(); words.forEach(x -> result.put(x, LevenshteinDistance.X.mr(x, target))); System.out.println("Input value is :" + target); result.forEach((k, v) -> System.out.println(String.format("Candidate value :%s, compatibility :%s", k, v))); } // Output the result The value is throwabCandidate value :the, matching degree:0.29 Candidate values :throwable, compatibility:0.78 Candidate value :their, matching degree:0.29 Copy the code
This allows you to select the throwable with the best match based on the input throwab.
Plagiarism detection
The essence of plagiarism detection is string matching. It can be simply considered as plagiarism if the matching degree is higher than a certain threshold. For example, the line from “I am a Little Bird” is:
❝
I am a small little bird, want to fly ah fly but also fly not high
❞
Suppose I write a lyric:
❝
I’m a tiny little dog who wants to sleep but can’t get enough
❞
We can try to find a match between two sentences:
System.out.println(LevenshteinDistance.X.mr("I'm a little bird, I want to fly but I can't fly high."."I'm a little little dog who tries to sleep and can't get enough."));
// The output is as follows
0.67
Copy the code
It can be considered that the lyrics created by the author are completely copied. Of course, for plagiarism detection of large text (such as paper duplicate, etc.), the execution efficiency needs to be considered, and the solution should be similar, but we need to consider how to divide words, case and other problems.
summary
This paper only makes a superficial analysis of Levenshtein Distance and lists some simple scenes. In fact, this algorithm is very common in daily life. The author suspects that word spelling check and paper duplicate check (plagiarism discrimination) used in dictionaries may be related to this algorithm. Algorithms have a steep learning curve, but they are a real problem solver. Levenshtein Distance has obvious advantages and disadvantages:
- Obvious disadvantages: The time complexity of the algorithm is
O(M * N)
, there is a double cycle,“Low efficiency“. - The obvious advantage: The degree of match, or “the result is closest to a real-life scenario”, based on this algorithm.
For its disadvantages, some improved editing distance algorithms can be considered, and expansion will not be done here.
References:
- Wikipedia – Levenshtein Distance
- java-string-similarity
- The Levenshtein Algorithm
(The cover image of e-A-20200308 R-A-20200716 is from “Elf Swordsman F”)
This article is formatted using MDNICE