As is known to all, many communities are content audit mechanism, in addition to the published for the first time, subsequent modifications also needs to review, the most brutal way, of course, is from the beginning again, but the editor must want to kill you, obviously this efficiency is lower, such as a change to a wrong character, then look at a few times may also look not to come out, so if we can know what each time change, Just like git’s diff, it is much more convenient, and this article will simply implement one.

An algorithm problem

To see the difference between two pieces of text, we can figure out their common content first, and the rest is deleted or added. Longest common subsequence 1143. Longest common subsequence

Dynamic programming is more like reasoning, you can solve it from the top down using recursion, or you can do it from the bottom up using a for loop, which usually uses a memo called DP to store information, depending on the problem, We define dp[I][j] as the longest common length of a subsequence from 0 to I of text1 and from 0 to j of text2. Then we need to consider the boundary case. First, when I is 0, text1’s substring is empty. So the length of the longest common subsequence is 0 no matter what j is, and the same is true if j is 0, so we can initialize a dp array with all initial values of 0:

let longestCommonSubsequence = function (text1, text2) {
    let m = text1.length
    let n = text2.length
    let dp = new Array(m + 1)
    dp.forEach((item, index) = > {
        dp[index] = new Array(n + 1).fill(0)})}Copy the code

When neither I nor j is 0, there are several cases:

Dp [I][j] = 1 + dp[I][j-1]; dp[I] = 1 + dp[I][j-1]; dp[I] = 1 + dp[I][j-1]; dp[I] = 1 + dp[I][j-1];

2. When Text1 [i-1]! == text2[j-1], it is clear that dp[I][j] completely depends on the previous situation, there are three kinds: Dp [i-1][j-1], dp[I][j-1], dp[i-1][j], dp[I][j-1], dp[i-1][j], but the first case can be excluded, because it is obviously not longer than the latter two cases, because the latter two cases have one more character than the first one, so the length may be 1 more, so we can take the best value of the latter two cases.

Then we just need a double loop to iterate through all the cases of the two-dimensional array:

let longestCommonSubsequence = function (text1, text2) {
    let m = text1.length
    let n = text2.length
    // Initializes a two-dimensional array
    let dp = new Array(m + 1).fill(0)
    dp.forEach((item, index) = > {
        dp[index] = new Array(n + 1).fill(0)})for(let i = 1; i <= m; i++) {
        // Since I and j both start at 1, we have to subtract 1
        let t1 = text1[i - 1]
        for(let j = 1; j <= n; j++) {
            let t2 = text2[j - 1]
            / / 1
            if (t1 === t2) {
                dp[i][j] = 1 + dp[i - 1][j - 1]}else {/ / 2
                dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1])}}}}Copy the code

The value of dp[m][n] is the length of the longest common subsequence, but knowing the length doesn’t help. We need to know exactly which positions, so we need to recurse again. Why can’t we collect positions in the t1 === T2 branch of the above loop, because all positions of the two strings are compared in pairs? There is duplication when there are multiple identical characters, as in the following:

We define a collect function to recursively judge whether the positions I and j are in the sequence of the eldest eldest. For example, for the positions I and J, if text1[i-1] === Text2 [J-1], then it is obvious that the two positions are in the sequence of the eldest eldest, then we only need to judge the positions i-1 and J-1. Similarly, if the current position is different, we can use the dp array to determine, because we already know the value of the entire DP array:

For example, dp[i-1] > dp[j], then the next thing to judge is i-1 and j. Otherwise, to judge the positions of I and j-1, the recursive end condition is that one of the positions of I and j has reached 0:

let arr1 = []
let arr2 = []
let collect = function (dp, text1, text2, i, j) {
    if (i <= 0 || j <= 0) {
        return
    }
    if (text1[i - 1] === text2[j - 1]) {
        // Collect the index of the same character in two strings
        arr1.push(i - 1)
        arr2.push(j - 1)
        return collect(dp, text1, text2, i - 1, j - 1)}else {
        if (dp[i][j - 1] > dp[i - 1][j]) {
            return collect(dp, text1, text2, i, j - 1)}else {
            return collect(dp, text1, text2, i - 1, j)
        }
    }
}
Copy the code

The results are as follows:

If you don’t like it, you can order it in reverse order:

arr1.sort((a, b) = > {
    return a - b
});
arr2.sort((a, b) = > {
    return a - b
});
Copy the code

We still need to calculate the deleted and added positions based on the sequence of the largest number of characters. This is a simple way of iterating through the two strings. Characters that are not in arr1 and ARR2 are deleted or added:

let getDiffList = (text1, text2, arr1, arr2) = > {
    let delList = []
    let addList = []
    // Iterate over the old string
    for (let i = 0; i < text1.length; i++) {
        // Characters representing positions in the old string that are not in the common subsequence are deleted
        if(! arr1.includes(i)) { delList.push(i) } }// Iterate over the new string
    for (let i = 0; i < text2.length; i++) {
        // Characters that are not in the common subsequence of the new string represent new characters
        if(! arr2.includes(i)) { addList.push(i) } }return {
        delList,
        addList
    }
}
Copy the code

Note deletions and additions

The common subsequence and the newly deleted index are known, so it can be marked out, for example, with a red background for the deleted index and a green background for the new index, so that we can see at a glance what has changed.

To keep things simple, let’s display both additions and deletions in the same paragraph, like this:

Suppose there is need to compare two pieces of text, every piece of text internal \ n space to wrap, we first put them into an array, and then in turn two comparison, if the old and the new text equal added directly into the array display, otherwise, we in the new text on the basis of the operation, if the characters are added at a certain position to give it a new label, The deleted character also finds its place in the new text and wraps a label around it before inserting it. The template part looks like this:

<template>
  <div class="content">
    <div class="row" v-for="(item, index) in showTextArr" :key="index">
      <span class="rowIndex">{{ index + 1 }}</span>
      <span class="rowText" v-html="item"></span>
    </div>
  </div>
</template>
Copy the code

Then make a pairwise comparison:

export default {
    data () {
        return {
            oldTextArr: [].newTextArr: [].showTextArr: []
        }
    },
    mounted () {
        this.diff()
    },
    methods: {
        diff () {
            // New and old text is split into arrays
            this.oldTextArr = oldText.split(/\n+/g)
            this.newTextArr = newText.split(/\n+/g)
            let len = this.newTextArr.length
            for (let row = 0; row < len; row++) {
                // If the old and new text are exactly the same, there is no need to compare
                if (this.oldTextArr[row] === this.newTextArr[row]) {
                    this.showTextArr.push(this.newTextArr[row])
                    continue
                }
                // Otherwise compute the position of the longest common subsequence of old and new text
                let [arr1, arr2] = longestCommonSubsequence(
                    this.oldTextArr[row],
                    this.newTextArr[row]
                )
                // Perform annotation operations
                this.mark(row, arr1, arr2)
            }
        }
    }
}
Copy the code

The mark method is used to generate the final string with the difference information. The above getDiffList method is first used to obtain the deleted and newly added index information. Since we are based on the new text, the operation of newly added index is relatively simple. Concatenate the characters of the tag element before and after:

/* oldArr: the longest common subsequence index array of old text newArr: the longest common subsequence index array of new text */
mark (row, oldArr, newArr) {
    let oldText = this.oldTextArr[row];
    let newText = this.newTextArr[row];
    // Get deleted and added location indexes
    let { delList, addList } = getDiffList(
        oldText,
        newText,
        oldArr,
        newArr
    );
    // Because the span tag will also occupy space, our new index will be offset, we need to subtract the length of the tag to correct
    let addTagLength = 0;
    // Iterate over the new position array
    addList.forEach((index) = > {
        let pos = index + addTagLength;
        // Intercepts the string before the current position
        let pre = newText.slice(0, pos);
        // Cut the following string
        let post = newText.slice(pos + 1);
        newText = pre + `<span class="add">${newText[pos]}</span>` + post;
        addTagLength += 25;//  length is 25
    });
    this.showTextArr.push(newText);
}
Copy the code

The effect is as follows:

Deleting is a little bit more difficult, because obviously the deleted character doesn’t exist in the new text, so we need to figure out where it would have been if it hadn’t been deleted, and then insert it back in here. Let’s draw a picture:

We can find the index of the character before it in the new list by using the longest common subsequence, so it is obvious that the index is the position of the deleted character in the new string:

Write a function to get the index of the deleted character in the new text:

getDelIndexInNewTextIndex (index, oldArr, newArr) {
    for (let i = oldArr.length - 1; i >= 0; i--) {
        if (index > oldArr[i]) {
            return newArr[i] + 1; }}return 0; }}Copy the code

The next step is to calculate the exact position in the string. For the flash, its position is computed as follows:

mark (row, oldArr, newArr) {
    // ...

    // Iterate over the deleted index array
    delList.forEach((index) = > {
        let newIndex = this.getDelIndexInNewTextIndex(index, oldArr, newArr);
        // The number of characters added previously
        let addLength = addList.filter((item) = > {
            return item < newIndex;
        }).length;
        // The number of characters before it does not change
        let noChangeLength = newArr.filter((item) = > {
            return item < newIndex;
        }).length;
        let pos = addLength * 26 + noChangeLength;
        let pre = newText.slice(0, pos);
        let post = newText.slice(pos);
        newText = pre + `<span class="del">${oldText[index]}</span>` + post;
    });

    this.showTextArr.push(newText);
}
Copy the code

Go to the position of the flash here to know, look at the effect:

As you can see, there is a mess behind it, for the simple reason that for the crystal, the position of the newly inserted flash is not added:

// The position of the inserted character
let insertStrLength = 0;
delList.forEach((index) = > {
    let newIndex = this.getDelIndexInNewTextIndex(index, oldArr, newArr);
    let addLength = addList.filter((item) = > {
        return item < newIndex;
    }).length;
    let noChangeLength = newArr.filter((item) = > {
        return item < newIndex;
    }).length;
    // Add the total length of the newly inserted characters
    let pos = insertStrLength + addLength * 26 + noChangeLength;
    let pre = newText.slice(0, pos);
    let post = newText.slice(pos);
    newText = pre + `<span class="del">${oldText[index]}</span>` + post;
    // x
    insertStrLength += 26;
});
Copy the code

This is where our sloppy diff tool completes:

Existing problems

If I delete a whole line, or add a whole new line, the number of old and new lines is not the same. Fix the diff function:

diff () {
    this.oldTextArr = oldText.split(/\n+/g);
    this.newTextArr = newText.split(/\n+/g);
    // If the number of new and old lines is different, use an empty string to complete them
    let oldTextArrLen = this.oldTextArr.length;
    let newTextArrLen = this.newTextArr.length;
    let diffRow = Math.abs(oldTextArrLen - newTextArrLen);
    if (diffRow > 0) {
        let fixArr = oldTextArrLen > newTextArrLen ? this.newTextArr : this.oldTextArr;
        for (let i = 0; i < diffRow; i++) {
            fixArr.push(' '); }}// ...
}
Copy the code

If we add or delete the last line, then it doesn’t matter:

However, if a middle line is added or deleted, all diff after that line is meaningless:

The reason is very simple, to delete a row, lead to the two contrast just behind the stagger, to do this, a kind of idea is to be removed through found a line or a line is added, and then modified compared to the number of rows, another method is no longer each line separate diff, but diff directly the whole text, so that you can’t delete the new line.

Diff the whole text by deleting the logic of line breaks:

diff () {
    this.oldTextArr = [oldText];// .split(/\n+/g);
    this.newTextArr = [newText];// .split(/\n+/g);
    // ...
}
Copy the code

That looks like it, so let’s increase the number of words in the text:

Sure enough, our simple algorithm for the longest common subsequence can’t handle too much text, either because the DP array takes up too much space, or because the recursive algorithm has too many layers and runs out of memory.

For algorithmic crappy writers, it’s hard to figure this out, so what to do is to use the power of open source, and that’s it: diff-match-patch.

The diff – match – patch library

Diff-match-patch is a high-performance library for manipulating text. It supports a variety of programming languages. In addition to calculating the difference between two texts, it can also be used for fuzzy matching and patching, as the name indicates.

Use is very simple, we first introduce it, import way to introduce the need to modify the source file, source code is the default class to hang on the global environment, we have to manually export the class, and then new an instance, call diff method can:

import diff_match_patch from './diff_match_patch_uncompressed';

const dmp = new diff_match_patch();

diffAll () {
    let diffList = dmp.diff_main(oldText, newText);
    console.log(diffList);
}
Copy the code

The returned result looks like this:

0 means no difference, 1 means new, and -1 means deleted. We simply iterate through the array to concatenate the strings. It’s very simple:

diffAll () {
    let diffList = dmp.diff_main(oldText, newText);
    let htmlStr = ' ';
    diffList.forEach((item) = > {
        switch (item[0]) {
            case 0:
                htmlStr += item[1];
                break;
            case 1:
                htmlStr += `<span class="add">${item[1]}</span>`;
                break;
            case -1:
                htmlStr += `<span class="del">${item[1]}</span>`;
                break;
            default:
                break; }});this.showTextArr = htmlStr.split(/\n+/);
}
Copy the code

The measured time of 21,432 character diff is about 4ms, which is still very fast.

Well, after the editor can be happy to touch fish ~

conclusion

In this paper, a simple algorithm [to find the longest common subsequence] is done, and an analysis of its practical use in the text diff, but our simple algorithm can not support the actual project, so if you have relevant requirements can use an open source library introduced in the paper.

Full example code: github.com/wanglin2/te…