@[TOC] recently, due to the project needs to implement a text comparison function, the natural thought of Git text comparison function, so I looked up some information on the Internet, saw a keyword (the longest common subsequence), feel back to the time of the university swiping.

Longest common subsequence

Quote from LeetCode problem 1143

Given two strings text1 and text2, return the length of the longest common subsequence of the two strings.

A subsequence of a string is a new string made up of the original string without changing the relative order of the characters by deleting some characters (or none). For example, “ace” is a subsequence of “ABCDE”, but “AEC” is not a subsequence of “ABCDE”. A “common subsequence” of two strings is a subsequence that is shared by both strings.

Returns 0 if the two strings have no common subsequence.

Example 1:

Input: text1 = “abcde”, text2 = “ace” Output: 3 Explanation: The longest common subsequence is “ace”, which has length 3. Example 2:

Input: text1 = “ABC “, text2 =” ABC “Output: 3 Explanation: The longest common subsequence is” ABC “, which has length 3. Example 3:

Input: text1 = “ABC “, text2 = “def” Output: 0 Explanation: The two strings have no common subsequence, return 0.

Tip:

1 <= text1.length <= 1000 1 <= text2.length <= 1000 The entered character string contains only lowercase English characters.

The longest common subsequence is the longest common character in two strings that are contiguous but not contiguous

Solve for the longest common subsequence

The problem of the longest common subsequence can be solved using dp (dynamic programming), but using dynamic programming requires the determination of the state transition equation. First we assume that the strings we need to solve are sda3b225CWSadbs and ass3BCDSSD2CPSeKLD998L, where their common substring is S3b2CSd

Determine the state transition equation

We can use a simpler example to derive the state transition equation for example 1a2b3CDEF, asDCdrF substring is acDF first let’s suppose that there is now a way to compute the longest common subsequence, let’s call it LCS (m,n) then LCS (1,””) = 0, The LCS (” “, a) = 0, LCS (1,a) =0, LCS (1a,a)=1, LCS (1,as) =0, LCS (1a,as) =1 from the derivation above we know that if we need to find the longest common subsequence of strings 1A and AS then we need to know string 1,as and string 1a, The longest common subsequence of A. If LCS (x,y) =0 and LCS (“”,y) =0 then LCS (x,y) can be solved: If x = y LCS (x, y) = Max (LCS (x, “”), the LCS (” “, y)) + 1 else LCS (x, y) = Max (LCS (x,” “), the LCS (” “, y)) according to the state transition equation above we can easily write the corresponding code

 public LcsResult lcs(List<String> source, List<String> target){
        int[][] dp = new int[source.size()+1][target.size()+1];
        int max = 0;
        for (int i = 1; i < (source.size() + 1); i++) {
            for (int j = 1; j < (target.size() + 1); j++) {
                String str1 = source.get(i-1);
                String str2 = target.get(j-1);
                if(str1.equals(str2)){
                    int temp =dp[i][j] =dp[i-1][j-1] +1;
                    max = Math.max(max,temp);
                }else{
                    dp[i][j] = Math.max(dp[i-1][j],dp[i][j-1]); }}}return findStr(dp,target);
    }
Copy the code

How do I find the longest common subsequence

In the above dynamic programming code we have obtained a state matrix as shown below:Through dynamic planning matrix to find the longest subsequence, we only need to pass back dp array can identify subsequence We prior to starting from the last position, respectively, whether on the left and top with the current position is equal, if equal or move up to the left one, if not equal then this is the character, we need to record down, Then a move to the left and up at the same time But we in through the back of the longest common subsequence will encounter many problem, if the left and above all with the current position is equal to do, either to the left at this time can also be upward, but the result could be different The image below as a result of using the default upward, The resulting subsequence is:acdf The code is as follows:

 private LcsResult  findStr(int[][] dp, List<String> target) {
        LcsResult lcsResult = new LcsResult();
        lcsResult.setCommonString(new ArrayList<>());
        lcsResult.setSourceIndex(new ArrayList<>());
        lcsResult.setTargetIndex(new ArrayList<>());
        int i=dp.length-1;

        for (int j = dp[i].length-1; j > 0 && i>0;) {
            int k = dp[i][j];
            int up = dp[i-1][j];
            int left = dp[i][j-1];
            if(up==k ){
                i--;
            } else if(left==k) {
                j--;
            }else{
                int targetIndex= j-1;
                int sourceIndex = i-1;
                lcsResult.getSourceIndex().add(sourceIndex);
                lcsResult.getTargetIndex().add(targetIndex);
                lcsResult.getCommonString().add(target.get(targetIndex));
                i--;
                j--;
            }
        }
        Collections.reverse(lcsResult.getCommonString());
        Collections.reverse(lcsResult.getSourceIndex());
        Collections.reverse(lcsResult.getTargetIndex());
        return lcsResult;
    }

Copy the code

How to implement text alignment

So once we have the longest common subsequence so how do we compare these two strings now we have str1=1a2b3cdef

str2= asdcdrfThe common subsequence isacdfIn the figure below, we compare the common substring. As we can see, we can easily distinguish that a new string a has been added to str1, while 2B3 has been changed to SD and E has been changed to FFrom the above comparison, we just need to replace the element with a string to perform text comparison, as shown below

 /** * compare text, if there are multiple strings between two equal strings, and the number is not equal and neither is 0, then it is considered that the comparison is not accurate * then in the following markup process does not need to compare word for word * use the header first comparison *@paramSource Source data *@paramTarget Compares data *@returnText alignment result */
    @Override
    public List<CompareResult> compare(List<String> source, List<String> target) {
        LCS lcs = new LCS();
        LcsResult lcsResult = lcs.lcs(source, target);
        return compare(lcsResult,source,target);
    }

    /** * compare text, if there are multiple strings between two equal strings, and the number is not equal and neither is 0, then it is considered that the comparison is not accurate * then in the following markup process does not need to compare word for word * use the header first comparison *@paramLcsResult The longest common subsequence result *@paramSource Source data *@paramTarget Compares data */
    public List<CompareResult> compare(LcsResult lcsResult, List<String> source, List<String> target) {
        List<CompareResult> res = new ArrayList<>();
        int lastSourceIndex = 0;
        int lastTargetIndex = 0;

        for (int i = 0; i < lcsResult.getCommonString().size(); i++) {
            int sourceIndex = lcsResult.getSourceIndex().get(i);
            int targetIndex = lcsResult.getTargetIndex().get(i);
            List<String> sourceTemp = source.subList(lastSourceIndex,sourceIndex);
            List<String> targetTemp = target.subList(lastTargetIndex,targetIndex);
            compareLeftAndRight(sourceTemp,targetTemp,res);
            lastSourceIndex = sourceIndex+1;
            lastTargetIndex = targetIndex+1;
            res.add(new CompareResult(CompareResult.RESULT_EQUAL,source.get(sourceIndex),target.get(targetIndex),true));
        }
        List<String> sourceTemp = lastSourceIndex>=source.size()?new ArrayList<>():source.subList(lastSourceIndex,source.size());
        List<String> targetTemp = lastTargetIndex>=target.size()?new ArrayList<>():target.subList(lastTargetIndex,target.size());
        compareLeftAndRight(sourceTemp,targetTemp,res);
        return res;
    }

    private void compareLeftAndRight(List<String> sourceTemp, List<String> targetTemp, List<CompareResult> res) {

        if(CollectionUtils.isEmpty(sourceTemp)){
            targetTemp.forEach(item-> res.add(new CompareResult(CompareResult.RESULT_INSERT,"",item,true)));
        }else if(CollectionUtils.isEmpty(targetTemp)){
            sourceTemp.forEach(item-> res.add(new CompareResult(CompareResult.RESULT_DELETE,item,"".true)));
        }else if(targetTemp.size()==sourceTemp.size()){
            for (int k = 0; k < targetTemp.size(); k++) {
                res.add(new CompareResult(CompareResult.RESULT_CHANGE,sourceTemp.get(k),targetTemp.get(k),true)); }}else if(sourceTemp.size()>targetTemp.size()){
            for (int k = 0; k < sourceTemp.size(); k++) {
                res.add(newCompareResult(k>=targetTemp.size()? CompareResult.RESULT_DELETE:CompareResult.RESULT_CHANGE, sourceTemp.get(k),k>=targetTemp.size()?"":targetTemp.get(k),false)); }}else{
            for (int k = 0; k < targetTemp.size(); k++) {
                res.add(newCompareResult(k>=sourceTemp.size()? CompareResult.RESULT_INSERT:CompareResult.RESULT_CHANGE, k>=sourceTemp.size()?"":sourceTemp.get(k),targetTemp.get(k),false)); }}}Copy the code

CompareResult

@Data
@AllArgsConstructor
public class CompareResult implements Serializable {
    public static final String RESULT_EQUAL = "EQUAL";
    public static final String RESULT_INSERT = "INSERT";
    public static final String RESULT_DELETE = "DELETE";
    public static final String RESULT_CHANGE = "CHANGE"; /** * private String tag; /** * oldText */ private String oldText; /** * newText */ private String newText; */ private Boolean isNeedCheckDetail; }Copy the code

Compare the effect drawing

Of course, the effect map is the text comparison of Word, which has a large part of the code on Word, will not be posted, the core idea is the text comparison above

Reference article:

How is git/ Linux-like file comparison (DIff) implemented?

Longest common substring