In the demo at the beginning of this article, just type two strings of arbitrary content and the difference between the strings will be calculated and rendered automatically on the page.
Address: string-diff-demo.herokuapp.com
Source code address: github.com/lqt0223/str…
Dynamic programming
Dynamic programming is one of the topics that everyone will encounter in algorithm learning. My personal understanding of it is:
- Dynamic programming deals with large scale problems
- But just like recursion, the base case of the problem has a solution
- And a solution to a larger version of the same problem can be derived from an existing solution to a smaller problem by a set of rules
- Dynamic programming differs from recursion in that the latter uses methods defined to invoke itself in its process, and “declaratively” forms the whole solution process. The former requires the establishment of a dynamic programming matrix to record the solutions under each problem scale, and requires the explicit execution of the loop to continuously expand the problem size and solution, which is “imperatively” the form of the solution process
The common problems that can be solved by dynamic programming are:
- Knapsack problem: Given a backpack with limited capacity and a group of items with weight and value, how to select several items to put into the backpack to maximize the value
- Longest common sequence problem (LCS problem for short) : Find the longest common sequence of two strings. A common sequence is a sequence that occurs in both strings and is not necessarily contiguous in the original string
- Subset sum problem: Given a set of integers, whether there are non-empty subsets such that the sum of the numbers in the subset is 0
- .
The relationship between the longest common sequence problem and Diff & Patch algorithm
In the past, WHEN I was doing algorithm problems on codeWar and other websites, I got the “Longest Common Sequence” or similar questions many times. I also learned from some algorithm books that a relatively easy to understand algorithm for such problems is dynamic programming. But I never quite understood the practical application of such questions. Until recently I saw the following paper:
An O(ND) Difference Algorithm and Its Variations – EUGENE W. MYERS
I’ve only read one or two sections of this article, and it’s all about using the graph algorithm to find the LCS between two strings and the shortest edit script (SES for short, the steps required to transform from string A to string B). A step is an operation on a string, such as deleting a character at a position, inserting a character at a position, etc.).
It can be seen from this paper that LCS and SES are dual problems, which are only two aspects of an optimization problem. That is, when we are looking for a common subsequence of two strings, if the optimal solution (the longest common subsequence) has been found, then the editing step between the two strings in this optimal solution case is also the shortest editing step. In general terms, when we solve for LCS, we get SES.
Because SES describes a sequence of operations from one string to another, it is similar to the differential data produced by various data comparison tools. So we know that one of the practical applications of the LCS problem is data comparison, difference calculation, and difference update.
Use matrices to transform LCS and SES problems
Since dynamic programming is used to solve LCS and SES problems, we need to use matrices (two-dimensional arrays) to record some information about the optimal solution.
This section mainly describes the properties of matrices in the process of using matrices to solve the above problems, and what aspects of LCS or SES problems these properties correspond to. These contents are also the summary and simplification of the content in section 2 of the paper mentioned in the previous section.
If you haven’t been exposed to solving LCS problems using dynamic programming before, you can watch the video below to get a basic idea of the process.
Longest Common Subsequence – Tushar Roy – Youtube
In general, using matrix transformation and solving LCS and SES problems requires the following three stages:
- Initialization phase. Assume that A string of length m, string B length is n, then initialize A matrix m + 1 * n + 1, will be the first column of the matrix and the first line are initialized to 0 (matrix in the subsequent need to fill in the length of an LCS, so during initialization, the first row or the first column represents any one of the two string is empty, need to fill in 0)
- Projection stage. Fill in the matrix from left to right and top to bottom according to certain deduction rules. That is, constantly solve for the length of the LCS between the prefixes of strings A or B
- Backtracking stage. When the matrix is filled, the value at the bottom right corner of the matrix is the length of the LCS of strings A and B. If we need to find out what LCS is, we need to start from the lower right corner of the matrix and follow certain rules to find a path to the upper left corner of the matrix, ensuring that the length of the LCS decreases by 0 or 1 each time we go through the path.
After three stages, the matrix will look something like the following.
The figure shows the matrix formed by solving LCS and SES using dynamic programming when string A is “abcabba” and string B is “cbabac”. From this matrix we can get the following answers about LCS and SES between two strings:
- The LCS of both strings is 4 (the value entered in the bottom right corner of the matrix)
- The LCS of the two strings is “caba” (this is obtained by looking at the path formed by the red arrow after the backtrace is completed; It can be seen from the above figure that, in the backtracking stage, every time when it is necessary to move to the upper left corner, A character in string A corresponding to this coordinate is the same as A character in string B, that is, this character can be used as one of the components of LCS.)
- Each move in backtracking can be mapped to a step in SES:
- Moving to the upper-left corner means that one of the strings that make up the LCS has been found, which for SES means that no operation is required
- Moving to the left, in the case of SES, means removing characters at the specified position in string A
- Move up
- Moving up in the first column of the matrix (the leftmost column where all zeros are initialized), for SES, means adding characters to the head of string A
- Moving up in other columns of the matrix, for SES, means adding characters after the specified position in string A
For example, if string A is “abcabba” and string B is “cbabac”, how do you know what steps can be taken to change string A to string B the fastest? We can use the rules above to translate the red path to the SES we need
- Delete the first and second characters of string A (the two left arrows in the upper-left corner)
- Adds the character “b” at position 3 of string A (fourth upward arrow from top left to bottom right)
- Delete the sixth character of string A (third to last left arrow from top left to bottom right)
- Add the character “c” at position 7 of string A (bottom-right upward arrow)
After that we can transform string A to string B
SES simultaneous operation problem
The end of the previous section gave the SES from “abcabba” to “cbabac”. You may have tried to use the SES with scratch paper or some other tool, but it didn’t work. This is because the compilation steps represented by SES need to be performed simultaneously. Here is an example of the correct use of SES using “abcabba” through “cbabac” :
The original string
a b c a b b a Copy the code
Delete the first and second characters of string A (the two left arrows in the upper-left corner) (where * marks the character to be deleted)
* * c a b b a Copy the code
Adds the character “b” at position 3 of string A (fourth upward arrow from top left to bottom right)
* * c a b b a b Copy the code
Delete the sixth character of string A (third to last left arrow from top left to bottom right)
* * c a b * a b Copy the code
Add the character “c” at position 7 of string A (bottom-right upward arrow)
* * c a b * a b c Copy the code
Restore the above hashtable-like structure to a string by ignoring characters that need to be deleted, concatenating vertically extended list into a substring, and finally concatenating all substrings in order to get “cbabac”.
Therefore, the simultaneous operation of SES, meaning any operation step, should not affect the initial character arrangement of the string. We can use this vertical data structure to rearrange string operations and eventually convert them to the target string.
Difference visualization
As shown in the previous section, one application of SES is direct execution, which results in the generation of the target string.
We can also combine the original String with SES to generate a DOM String and render the original String to the target String differentially in the browser. The demo at the beginning of this article is a demonstration of this approach.
Afterword.
Not only character-level DIFF & patch, but also word-level and line-level diFF & patch can be easily realized by dynamic programming without considering the spatial complexity of the algorithm.
My biggest experience from learning and implementing this algorithm is as follows:
- Transforming problems using graphical presentation and solution processes makes seemingly complex problems intuitive and simple (for example, using matrices to record and solve LCS)
- Some algorithms and algorithm ideas have been mastered, after re-thinking, sometimes can get unexpected greater harvest