Myers’Diff
preface
I have been writing this article for a long time, because I have been preparing the follow-up Myers’ Diff’s Linear Space refinement. It is not known when the DiffUtil comparison list item data was found to be refreshed locally. Git file comparison uses this algorithm. I saw it again last month and wanted to take a closer look. However, it is found that domestic blogs and posts have little content about this algorithm. Each article tells the content that the author thinks is important, so it is difficult to understand comprehensively if one point is not understood. At the beginning, I did not think clearly about a point for several days. I think programmers should not be afraid of algorithms. Reading books for hundreds of times shows its meaning, and so does reading algorithm codes.
Myer difference algorithm
As a common example, when you commit with Git, you’ll see what changes have been made to the commit. Here’s a quick definition of what diff is: Diff is the difference between the target text and the source text, and the action required to turn the source text into the target text. Myers algorithm was published in The journal Algorithmica by Eugene W.Myers in 1986. Is an algorithm that can produce the shortest intuitive DIff in most cases.
Look for the shortest intuitive diffIs a very vague problem, first of all, we need to abstract this problem into a specific mathematical problem, and then look for an algorithm to solve it.
define
- File A and File B
The diff algorithm takes two files as input. The first, usually older, one is file A, And the second one is file b. The algorithm generates instructions to turn file A into file B. The diff algorithm generates the two files as input. The first usually older file is file A, and the second is file B. The algorithm generates instructions to convert file A to file B.
- Shortest Edit Script ( SES )
The algorithm finds the Shortest Edit Script that converts file A into file B. The SES contains only two types of Commands: deletions from file A and insertions in file b. the algorithm finds the shortest edit script to convert file A to file B. SES contains only two types of commands: delete from file A and insert into file B.
- Longest Common Subsequence ( LCS )
Finding the SES is equivalent to finding the Longest Common Subsequence of the two files. The LCS is the longest list of characters that can be generated from both files by removing some characters. The LCS is distinct from the Longest Common Substring, which has to be contiguous.(Searching SES is equivalent to finding the longest Common subsequence of two files. The LCS is the longest list of characters that can be generated from two files by deleting certain characters. The LCS differs from the longest common substring, which must be continuous.
The sample
This article uses the same example as in this article. File A contains ABCABBA and file B contains CBABAC. These are represented as two character arrays: A [] and B []. The length of A [] is N and the length of B [] is M.
We can solve the problem of going from an array of A to an array of B, and turn it into A problem of going from A string to A string of B.
An array of a.
Along theThe x axis
Put it on top.The array B
Along they
Put it down.
PS: The graphs in this article were created by DiffTutorial, an application that is a learning aid. It displays a graphical representation of the various stages of the algorithm.
Solution: shortest path from top left (0,0) to bottom right (7,6). You can always move a character horizontally or vertically. Horizontal (right) movement indicates deleting from file A and vertical (down) movement indicates inserting into file B. If there are matching characters, it can also move diagonally to end the match. The solution is the trace that contains the most diagonals. LCS is the diagonal in the trajectory, SES is the horizontal and vertical movement in the trajectory. For example, LCS has a length of 4 characters and SES has a length of 5 differences.
Snake: A snake represents a step. For example from (0, 0) – > (0, 1)/(0, 0) – > (1, 0)/(0, 1) – > (0, 2) – > (2, 4) the three snake respectively, and the walk diagonally do not count towards the steps.
K lines: K lines represent long diagonal lines where each K = x-y. Suppose a point m(m,n), then its k line value is m – n.
D contour: the number of steps of each effective path (the path from (0,0) to (m,n) is called effective path). Image point, a path consists of multiple snakes, where D contours represents the number of snakes.
Greedy algorithm
The algorithm is iterative. It calculates the farthest arrival path for each k line of continuous D. When the path reaches the lower-right corner, the solution is found.
There are important points here:
- The path must end at
k
Online. The iteration goes on, sok
The next step on the line isk+1
Move down ork-1
Move to the right;- Computationally continuous
d
Each of thek
Farthest reach path on line (evend
The endpoints of phi are evenk
Line, odd number similar);- The path ends at the lower right corner;
Both 1 and 2 are proved in this paper.
k line
: The brown lines are k lines with odd values of k. The yellow line is the k line of even value of k.snake
: The dark blue lines are snakes. Red snake shows traces of solution.d contours
: The light blue line is the difference outline. For example, all three endpoints on a line labeled “2” have two horizontal or vertical movements.
Number of external cycles
From the top left corner of the rectangle of (x, y) to the bottom right corner. The longest path is one that doesn’t go through all the diagonals. That’s just the length of X and Y that’s the maximum length is equal to N plus M. for ( int d = 0 ; d <= N + M ; d++ )
The number of internal cycles
Within this loop, we must find the farthest path to each K line. For a given d, the reachable k line is in the range [-d.. + d]. K = -d is possible when all moves are down; K = + d is possible when all the moves are on the right. for ( int k = -d ; k <= d ; K plus is equal to 2. If you look at this, you might wonder why k plus is equal to 2.
There’s an optimization here, and we said earlier that the endpoints of even d are on even k, and odd is similar. Explanation: If you move an odd number of steps forward or backward, you must end up on an odd number of k lines. If you move an even number of steps forward, you must end up on an even number of K lines. PS: I have been struggling with this for a long time. The last few thoughts make me more clear:
- from
zero
Start one step at a timeK line
Must be moving fromzero
Start.- The calculation here is not even plus even to get an even number, odd plus odd to get an odd or even number
+ 1 or - 1
).- Even or odd
+ 1 or - 1
And then they change their parity, sod
The parity after this operation is determined byd
Even and odd to make decisions. Because the starting point is evenzero
So saidAn even number
dThe endpoints of phi are even
kLine, odd number similar
.
Example (d=3)
fromd = 3
The example of conducting research meansk
The value range of is[-3, -1, 1, 3]
. To help you, I will use the examples insnake
Is transcribed into the following table:
K = -3: In this case, move down one step only if k = -2, d = 2 (k = -4, d = 2 does not exist). So, we could go this way, starting at 2,4 and going down to 2,5, and since there’s a diagonal between 2,5 and 3,6, we could go to 3,6. So snake is :(2,4) -> (2,5) -> (3,6).
K = -1: when k = -1, there are two cases to discuss: k = 0, d = 2 downward; K is negative 2, d is 2 go to the right.
- When k = 0 and d = 2, it’s the point (2,2). So when you take a step down from (2,2) to (2,3), since (2,3) is not connected diagonally, the whole snake is :(2,2) -> (2,3).
- When k = -2 and d = 2, it’s the point 2,4. So when the right step, from (2, 4) to (3, 4), because (3, 4) and (4, 5) diagonal, so the snake is: (2, 4) – > (3, 4) – > (4, 5).
In the whole process, there are two snakes, and we choose the snake that goes furthest along the K line, so we choose the second snake.
K = 1: when k = 1, there are two possibilities, one is to go to the right from k = 0, or one is to go down from k = 2. Let’s discuss them separately.
- When k = 0 and d = 2, it’s the point (2,2). So when from (2, 2) the right step, arrive at (3, 2), (3, 2) and (5, 4) diagonal, so the snake is: (2, 2) – > (3, 2) – > (5, 4).
- When k = 2, d = 2, it’s the point 3,1. So when you go one step down from 3,1, you get to 3,2. So the snake is :(3,1) ->(3,2) ->(5,4).
In the whole process, there are two snakes, and we choose the snake with the larger starting x value, so it is :(3,1) ->(3,2) ->(5,4).
K = 3: In this case, (k = 4, d = 2) is not possible, so we have to move one step to the right at (k = 2, d = 2). When k = 2, d = 2, it’s the point 3,1. So if we move one step to the right from 3,1, we get 4,1. So the entire snake is :(3,1) -> (4,1) -> (5,2).
Algorithm implementation
We have two loops, and we need a data structure. Note that the solution of d (n) depends only on the solution of d (n-1). Also remember that for even values of D, we find endpoints on even k rows, and these endpoints only depend on previous endpoints all on odd k rows. The same is true for odd values of d. We use an array called V, where k is the index and the x position at the end is the value. We don’t need to store the y position because we can calculate it in terms of x and k: y = x-k. Again, for a given d, k is in the range [-d.. d].
bool down = ( k == -d || ( k ! = d && V[ k - 1 ] < V[ k + 1 ] ) );
If k is equal to -d, we must have moved down; If k is equal to d, we must have moved to the right. For the normal intermediate case, we choose to start with any adjacent row with a large x value. This ensures that we get as far as possible on the K line.
V[ 1 ] = 0;
for ( int d = 0 ; d <= N + M ; d++ )// Loop around, how many times is the number of snake and V arrays
{
for ( int k = -d ; k <= d ; k += 2 )/ / inner loop
{
// down or right?
booldown = ( k == -d || ( k ! = d && V[ k -1 ] < V[ k + 1]));int kPrev = down ? k + 1 : k - 1;// If it goes down, then the previous step should be k+1
// start point
int xStart = V[ kPrev ];//v[k]=x
int yStart = xStart - kPrev;//y=x-k
// mid point, mid point
int xMid = down ? xStart : xStart + 1;
int yMid = xMid - k;
// end point
int xEnd = xMid;
int yEnd = yMid;
// follow diagonal
int snake = 0;// Whether there is a diagonal to continue to the main point
while ( xEnd < N && yEnd < M && A[ xEnd ] == B[ yEnd ] )
{
xEnd++; yEnd++; snake++;
}
// save end point
V[ k ] = xEnd;
// check for solution
if ( xEnd >= N && yEnd >= M ) /* solution has been found */}}Copy the code
The code above looks for a Snake that has reached its destination. Since V array stores the coordinates of the latest endpoint of K line, in order to find all snakes, we iterate from D (Solution) to 0 after each loop of D. As follows:
IList<V> Vs; // saved V's indexed on d
IList<Snake> snakes; // list to hold solution
// From the back, the last snake is the route you must take to reach the end.
POINT p = new POINT( N, M ); // start at the end
for ( int d = vs.Count - 1 ; p.X > 0 || p.Y > 0 ; d-- )
{
var V = Vs[ d ];// Data generated by the inner loop
int k = p.X - p.Y;// This loop starts with line K
// end point is in V
int xEnd = V[ k ];
int yEnd = x - k;
// down or right?
booldown = ( k == -d || ( k ! = d && V[ k -1 ] < V[ k + 1]));int kPrev = down ? k + 1 : k - 1;// Last snake's k line
// start point
int xStart = V[ kPrev ];
int yStart = xStart - kPrev;
// mid point
int xMid = down ? xStart : xStart + 1;/ / intermediate point
int yMid = xMid - k;
snakes.Insert( 0.new Snake( /* start, mid and end points */)); p.X = xStart; p.Y = yStart; }Copy the code
Time complexity: Expected O(M+N+D^2), worst case O((M+N)D).
For those interested, read Myers’ Diff’s article on Linear Spatial refinement.
Algorithm practice: DiffUtil and its difference algorithm
Myers Diff Alogrithm: Part 1 Myers Diff Alogrithm: Part 2
The article here is all about the end, if there are other need to exchange can leave a message oh ~! ~!
To read more articles by the author, check out my personal blog and public account: