This article is participating in Python Theme Month. See the link for details
introduce
Levenshtein Distance is the most commonly used method for calculating text editing Distance, which is usually used to calculate the minimum number of editing operations required to transform two strings from one to the other. The smaller the editing Distance is, the greater the similarity of two strings is. There are three permissible editing operations:
- Substitute, to replace one character with another, the edit distance is generally defined as 1, but may also be defined as 2
- Add, insert a character, edit distance defined as 1
- Delete, delete a character, edit distance defined as 1
This paper introduces the principle of Levenshtein Distance in detail through examples, and implements Python code according to the principle.
The principle of
Here we use two strings ABC and yab to illustrate the principle. The ” in the figure represents the empty string, and the figure shows the editing distance details of the two strings. The whole 4*4 numeric matrix is represented by D, the index starts at 0, and the y-coordinate is x. In the figure, D[1][1] indicates that the editing distance between Y and A is 1.
‘ ‘ | a | b | d | |
---|---|---|---|---|
‘ ‘ | 0 | 1 | 2 | 3 |
y | 1 | 1 | 2 | 3 |
a | 2 | 1 | ||
b | 3 | |||
Here are a few key points: |
-
When calculating the edit distance between an empty string and a non-empty string, the result is actually the length of the non-empty string
-
D[1][0], D[y][x], D[y-1][x-1], D[y][x], D[y-1][x-1], D[y][x]
-
When we change ya to AB, we have three different operations:
-
Substitute: The editing distance of replacing A with B is 1, and that of replacing Y with A is D[1][1] in the figure. Together, the editing distance of the two is 2, that is, D[y][x] calculated by substitute operation is D[Y-1][X-1] + 1
-
Add: Insert B after A to edit distance 1, change ya into A to edit distance D[2][1], and add the two to edit distance 2, that is, D[y][x] calculated by add operation is D[y][X-1] + 1
-
Delete: The editing distance to delete A is 1, and the editing distance to change y to AB is D[1][2] in the figure. The combined editing distance of the two is 3, that is, D[y][x] calculated through the add operation is D[y-1][x] + 1
-
Take the minimum value of the three operations, the current value of D[y][x]
-
-
After the above operation, we can get the details below, and the final result is the last value in the lower right corner
‘ ‘ | a | b | d | |
---|---|---|---|---|
‘ ‘ | 0 | 1 | 2 | 3 |
y | 1 | 1 | 2 | 3 |
a | 2 | 1 | 2 | 3 |
b | 3 | 2 | 1 | 2 |
implementation
def distance(s1, s2):
d = [[x for x in range(len(s1)+1)] for _ in range(len(s2)+1)]
for y in range(1,len(s2)+1):
d[y][0] = d[y-1][0] + 1
for x in range(1, len(s1)+1):
for y in range(1, len(s2)+1):
if s1[x-1] == s2[y-1]:
d[y][x] = d[y-1][x-1]
else:
substute = d[y-1][x-1] + 1
add = d[y][x-1] + 1
delete = d[y-1][x] + 1
d[y][x] = min(add, substute, delete)
return d[-1][-1]
Copy the code
The results of
Compare ABC and ADB’s Levenshtein Distance
print(distance('abd','yab'))
Copy the code
Results the print
2
Copy the code