This is the second day of my participation in the First Challenge 2022
The basic concept
- Edit distance problem:
- Editing distance problem is difficult, but the solution is very beautiful, and it is a rare practical algorithm
- Edit distance usage scenario:
- For modifying the misplaced content of the article. Limit the article can only modify 20 words, and support to add, delete, replace operations, seek the optimal solution to modify
- For measuring the similarity of DNA, DNA sequence is composed of A, G, C and T, which can be likened to A string. The similarity of two DNA sequences can be measured by the editing distance. The smaller the editing distance, the more similar the two DNA sequences are
Thought analysis
- Edit distance problem:
- Given two strings s1 and s2, only three operations can be used to change s1 to s2 and find the smallest operand
- I want to make sure that whether I change s1 to S2, or s2 to S1, I get the same result
- In the longest common subsequence, the problem of dynamic programming of two strings is solved by using two Pointers I and j respectively to the end of the two strings, and then moving forward step by step to reduce the size of the problem
- Calculate edit distance:
- The algorithmbase case:
- S1 and S2 have the same characters. In order to minimize the editing distance, there is no need to perform the task operation, just move
- If j completes S2, I has not completed S1. In order to minimize the editing distance, s1 can only be shortened to S2 using the delete operation
- If I completes S1 and j has not completed S2, in order to minimize the editing distance, s1 can only be increased to S2 using an insert operation
- The algorithmbase case:
The problem solution
- The base case of the algorithm: if I completes S1 or j completes S2, it can return the remaining length of the other string
- For each pair of S1 [I] and S2 [j] there are four operations:
ifS1 [I] == s2[j]: no operation, skip I directly,j moves forward at the same timeelseOperation:3 选 1: insert -insert delete -delete replace -replaceCopy the code
- For the 1 of 3 operations, you can use a recursive technique to try all three operations once, and then select the result of the operation based on which operation has the smallest edit distance
def minDistance(s1, s2) - >int:
# return the minimum edit distance for s1[0,... I] and s2[0,... j]
def dp(i, j) :
# base case
if i == -1: return j + 1
if j == -1: return i + 1
if s1[i] == s2[j]:
No task operation is required
# s1 and s2 [0,..., I] [0,..., j] the minimum edit distance s1 [0,..., I - 1) and s2] [0,..., j - 1 the minimum edit distance
Dp of I, j is dp of I minus one, j minus one.
return dp(i - 1, j - 1)
else:
return min(
# insert
S2 [j] = s1[I]; s2[j] = s2[j]
# move the characters of S2 forward by 1 bit to continue the comparison
Insert operand increment by 1
dp(i, j - 1) + 1
# remove
Select s1[I] from s1[I]
# move the characters of S2 forward by 1 bit to continue the comparison
Add 1 to delete operation
dp(i - 1, j) + 1
# replace
S1 [I] = s2[j] = s2[j
# At the same time, I and j move forward by 1 bit to continue the comparison
# replace the operand by 1
dp(i - 1, j - 1) + 1
)
return dp(len(s1) - 1.len(s2) - 1)
Copy the code
- This method has overlapping sub-problem and needs to be optimized by dynamic programming algorithm
- How to determine whether there are overlapping subproblems?
- The algorithm framework of the solution is abstracted
- Identify the different paths from the subproblem to the original problem
- Once there are repeated paths, it means there are a large number of repeated paths, that is, there are overlapping sub-problems
Optimization problem
- There are two ways of using dynamic programming optimization for overlapping subproblems:
- The memo
- DP Table
The memo
def minDistance(s1, s2) - >int:
# memo
memo = dict(a)def dp(i, j) :
if (i, j) in memo:
return memo[(i, j)]
...
if s[i] == s[j]:
memo[(i, j)] = ...
else:
memo[(i, j)] = ...
return memo[(i, j)]
return dp(len(s1) - 1.len(s2) - 1)
Copy the code
DP Table
- The first thing is to understandDPThe meaning of array,DPAn array is a two-dimensional array
- base case: dp[…] [0] and dp [0] […].
- [I][J]
- Def dp (I, j) – > int: return to s1 and s2 [0,…, I] [0,…, j] the minimum edit distance
- Dp [I – 1] [1] : store s1 and s2 [0,…, I] [j] 0,…, the minimum edit distance. Because the values of I and j of the dp function are -1, and the minimum index of the array is 0. So the DP array will be offset by one
int minDistance(String s1, String s2) {
int m = s1.length(), n = s2.length();
int[][] dp = new int[m][n];
// base case
for (int i = 1; i <=m; i++) {
dp[i][0] = i;
}
for (int j = 1; j <= n; j++) {
dp[0][j] = j;
}
// Compute the DP Table from bottom up
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1.isCharAt(i) == s2.isCharAt(j)) {
dp[i][j] = dp[i - 1][j - 1];
} else {
dp = min(
/ / delete
dp[i - 1][j] + 1;
/ / replace
dp[i - 1][j - 1] + 1;
/ / insert
dp[i][j - 1] + 1;) ; }}}// Store the minimum edit distance for s1 and s2
return dp[m][n];
}
int min(int a, int b, int c) {
return Math.min(a, Math.min(b, c));
}
Copy the code
conclusion
- Handle dynamic programming of two strings:
- According to the algorithm of editing distance to solve
- Create a DP Table. This makes it easier to find the state transition relationship
- Because each dp[I][j] is only related to three nearby states, the spatial complexity can be compressed to O(min(m,n)), where M and n are the lengths of two strings O(min(m,n)), where m and n are the lengths of two strings O(min(m,n)), where m and n are the lengths of two strings O(min(m,n))
- You can add additional information to the DP Table:
Node[][] dp;
class Node {
// The value of the previous dp array
int val;
// Attribute manipulation
int choice;
}
Copy the code
- When making an optimal choice, write down the actions and then you can work backwards from the results
- The final result is dp[m][n], where val has the minimum edit distance and choice has the last operation
- Repeat this process, step by step back to the starting point dp[0][0], forming a path, according to this path for string editing, is the minimum editing distance
- Minimum edit distance Java implementation
- Minimum edit distance C++ implementation:
class DynamicProgramming { public: int minDistance(String s1, String s2) { int m = s1.size(), n = s2.size(); vector<vector<int>> dp(m + 1, vector<int>(n + 1)); // basecase // when s2 is empty, s1 needs to delete all characters to be the same as s2 for (int i = 1; i <= m; i++) { dp[i][0] = i; } // When s1 is empty, s2 needs to delete all characters to be the same as s1 for (int j = 1; j <= n; j++) { dp[0][j] = j; } // Solve from bottom up for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { if (s1[i - 1] == s2[j - 1]) { // The current characters of the two strings are equal dp[i][j] = dp[i - 1][j - 1]; } else { // The current characters of the two strings are different dp[i][j] = min({ // delete s1[I] dp[i - 1][j] + 1.S1 [j] = s1[I]; s2[j] = s2[j dp[i][j - 1] + 1.// Replace s1[I] with s2[j] dp[i - 1][j - 1] + 1}); }}}// Store the minimum edit distance for s1 and s2 returndp[m][n]; }};Copy the code