Why isn't there a perfect JSONDIFF tool?

The background is that the department is developing a traffic playback tool recently. Traffic playback simply means recording the original request information and response information online and playing it back in the environment to check whether the same response results are obtained for the two same requests. If the response results are different, So this tool needs to be able to tell us where the two results are different. Because most services within the department provide interfaces externally through HTTP + JsonResponse, the comparison of response results can be interpreted as comparison of two JSON structured strings.

So thriving today in the open source community, of course, most of the students must be a lot harder if received this requirement first reaction, all open source libraries always have a meet your mind, of course, I also think so at the time of receiving the demand, but the reality is often cruel, you never know how much you don’t know your target users demand.

This article aims to share the learning principle of JsonDiff and how to meet the different needs of various users on JsonDiff.

String matching

So how do you prove that two JSON files are the same?

If this is an OJ problem, then there are many ways to ask the question, and the easiest way to ask it is

Given two jSON-formatted strings of type string, return true if they are equal and false if they are not. (You can assume that the values of type Number in the test cases are all INTEGERS.)

The easiest way to do this, of course, is to compare strings, and perhaps you can take a line of code as the answer to this question, but of course you can’t say that the answer is completely wrong, because objectively it might actually pass some scoring test cases. You submit the following code and decide to test the test case.

func Compare(json0,json1 string)bool{   return json0 == json1}
Copy the code

Of course, your OJ experience will tell you that test cases are not as benign as you think. For example, it must not pass the following test cases.

param0:{"a":1}

param1:{ "a":1}

The difference is that the key a field is preceded by an unknown space, which is not nice, but you have to admit that this is syntactic JSON.

After painfully and politely greeting the test case author, you decide to wash the test case over and over again, open the ASCII code table, delete all the interference from the first thirty-two control characters that you think the author might add to the test case, and press Submit with a vicious click. You think you’re going to see it pass this time, but three seconds later there’s still a solution error on the screen, and you sigh and open the test case, and the guys who screwed you up look like this

param0:{"a\t":1}

param1:{"a":1}

It turns out that you deleted control characters that affect json semantics. You’re wondering why someone would add control characters to a key or string value, and you’re scratching the back of your head. You learn smarter this time, you match all the contents of the double quotes, and witty survived “, to ensure the control of the operator to delete not friendly forces, after you silently think this click on the Submit will see through, staring at the a-share market during the day you wouldn’t think of himself so hope someday green to appear on the screen. But you’re still wrong. The test case looks like this.

param0:{"a":1,"b":2}

param1:{"b":2,"a":1}

A sentence comes to your mind that a JSON object is an unordered set of key and value pairs. You realize that a JSON object is not different depending on the order of the key. This time you decide to go ahead and parse the JSON string. You know the Reflect.deepequal thing is in the Golang bag. Click Submit, this time you finally green, you stand on the shoulders of giants defeated the funny question maker.

You know that you stand on the shoulders of giants, and you believe that you have the ability to draw inferences. So you open a similar topic and confidently click on the related topic marked Hard.

2. Json tree analysis

Given two JSON strings of type string (you can assume that the Value of type Number in the test case is both Integer), this time if the two Jsons are different, you need a data structure to describe how they are different, and if they are the same, you can return NULL.

The giant throws you over his shoulder, and you know DeepEqual can’t help you, so this time you have to go alone.

You take a long drag on your cigarette and decide to calm down and analyze the familiar and unfamiliar data structure of JSON. You discover that JSON is a special tree-like data structure, especially because the nodes of this tree have the concept of “types” : Object, Array, Number, String, Boolean, In addition, not every type of node has child nodes, only Array and Object nodes have child nodes, and the clue of the parent node to the child node is different from the left and right subtrees of the binary tree. The index of an Object parent node to its child node is a string key, and the index of an Array parent node to its child node is an int index. The following diagram illustrates your understanding of JSON.

You notice that this problem has some of the same trees as the one you have swiped before, but of course they are more similar if the problem wants the answers to give detailed diff. You summed up the approach to the problem in one sentence.

The values of the two nodes are first compared, and then the left and right child nodes of the two nodes are respectively compared recursively in the same way. Of course, if you need to give a detailed DIFF, you need to pass in an additional tree type data structure that describes the DIFF.

So you see that they also need to recursively compare individual nodes. The difference is that the value type of the node value needs to be compared before the comparison of the two nodes. If the value type is different, of course, there is no significance to compare the values. You also find that if you want to compare two current nodes with values of Object or array, you need to recursively compare their children. Different from “same tree”, the index is no longer Left and Right, but needs to be discussed according to the Value type of the current node. If Value is Object, then you match children with the same key, recursively, for the next comparison. If Value is array, then you match children of the same Index and recurse again for the next comparison.

Of course, in order to give a detailed diff, you also define the data types of different diff, respectively

Added: Records redundant object/array nodes in the new JSON that cannot be matched.
Deleted: Indicates the number of object/array nodes that are not matched due to a lack of deletion.
Modified: When one of the two node types is the basic type, the two nodes are recorded because the value type is different or the value type is the same, but the specific value is not the same.
Object: Records the current Object node of diff in the child node. Since it needs to describe the differences between the child nodes in more detail, it needs the child diff type field to describe the diff in more detail. This diff type includes all the types in this list.
Array: Records the current Array node of the diff that exists in the child node. Since he needs to describe the differences between the child nodes in more detail, he needs the child diff type field to describe the diff in more detail. This diff type includes all the types in this list.

You draw a graph that shows you an understanding of this diff data structure

You finally solved the problem, looking at the full keyboard drop hair, do you think you stronger, what do you think of your understanding of jsondiff very in place, in the face of all jsondiff you can solve the business scenario, you decided to formally began to face the flow of real users demand, the second day you confident to find leaders say, “this requirement, I just need two days.”

Two days later, the feature goes live, and you’re confident that the code you’ve written will be all right, and if it is, it’s the world’s problem.

Three. Introduce noise

Two hours later, The first user of your new feature, Xiao Zhao, comes to you and describes his confusion.

“Although the trace_id field in my recorded response is different from that in the replay response, you don’t need to tell me, because I don’t need to know.”

Well, trace_id is a field that doesn’t need to be compared to a response request. Watching Zhao replay 100,000 requests, it’s hard to convince him to ignore every record with a diff count of 1 by eye. Also because of the damn trace_id field, Zhao has no way to properly use your carefully designed look at different functions. You design this feature so that users can actually see that there are different records that they should care about, but it also makes fun of its designers when the differences between the different records become meaningless.

When you come home that night and eat your fried chicken snack, you think that the fried chicken isn’t as good as it used to be. You suddenly have an Epiphany. You can add a concept called noise to ignore some diff that exists objectively but the user doesn’t need to care about. And thanks to your design of the Diff tree, you naturally think that this “noise” can be described by jsonPath. So after you generate a JsonDiff tree, you find the diff node that the user does not need to care about through jsonPath, delete it, and recursively delete the parent node that does not need to exist because the child node is missing. The picture describes the process as follows.

The next day, with black eyes, I explained to Xiao Zhao what noise is. Although Xiao Zhao didn’t understand, he knew that only a jsonPath was needed to ignore the damn trace_id. So you spent a whole morning teaching him to match the jsonPath with trace_id. Well, that’s settled. At this time, you feel that you are an expert in the field of Jsondiff, and no other strange problems can trouble you. You decide to take a nap to make up for the sleep lost the previous night. In your dream, you see your boss coming towards you kindly, and you think he must give you a raise. At this time, Xiao Zhao suddenly woke you up and told you that the code you wrote was wrong again.

Array matching strategy based on LCS

This time Zhao’s original record and playback record look like this.

Old:,1,2,3 [0]

New: [1, 2, 3]

He saw Diff like this.

(Orange background means different, pink background means delete, green background means new)

This is what he wants to see in Diff.

According to your design, the diff tree of two JSON alignments should look like this.

You think there is nothing wrong with this result, which is completely consistent with your design. In your original design, the value alignment of array types is matched according to the same index during the matching phase, and you think there is nothing wrong with this. However, zhao says that the display result is meaningless to him. He wants the result to tell him that 0 is the extra value in the left array and the other values match exactly.

Do you think of the users demand is reasonable, xiao zhao xiao zhao although care about the order, but he CARES about is not the absolute order, relative order, this world is not the problem, is your design out of the question, so you decided to solve this problem, years of experience in brush problem to tell you, this problem is very similar with famous LCS, And you did brush the longest common subsequence of this problem one day.

Since you have the DP table of the LCS, you can find matching pairs between the old and new arrays based on the growing trend of the DP table. The remaining elements/nodes that do not match are counted as Added diff if they are in the new table. If they are in the old table, they are counted as Added diff if they are in the old table. You record it as a Deleted type diff, as follows:

In this example, you find the matching pairs (0,1),(1,2), and (2,3) in the old and new arrays, so you record as Deleted the elements in the old array that do not appear in the matching pairs, and as Added the elements in the new array that do not appear in the matching pairs.

You sneaked online, but you didn’t tell Zhao that you had solved the problem because you wanted to sleep well tonight. But you did not expect, Xiao Zhao secretly have to pay attention to your GitLab, you just online this function, Xiao Zhao on the first time to use, and tell you, your function and problems! This time the problem json looks like this:

Old: [1, 2, 3, 4]

New:,3,2,1 [4]

According to your LCS strategy for array nodes, only one element/node can be matched in this example, and the remaining elements are diff of Added/Deleted type. However, in this example, Zhao does not care about the order of the new and old nodes. He hopes that as long as the new and old nodes can be matched, they should be matched. In his expectation, the two json arrays should match exactly, and you know it’s going to be another sleepless night because you’re a man who won’t give up.

Five. Double traverse the matching strategy of the number group

So you think, it’s time to create a new Array alignment strategy.

You want to set up a Map by taking all the elements in one array as the Key and traversing each element in the other array to see if the current element exists in the Map. Based on the characteristics of hash table, the time complexity of single search can be O(1), the total time complexity can be O(n+m), and the space complexity is O(MIN(n,m)).

But when you run the test case, the test case tells you that the elements in an Array don’t have to be simple types. If there are elements of type Object or Array that are parsed in Golang as map[string]interface{} and []interface{}, respectively, Because array types and map types are special types of Unhashable, the operators == and! Are also not supported. =, so it cannot be used as a Map Key. So, unfortunately, you can only use it

A time-complex scheme to iterate over two arrays, picking out elements that match exactly, and recording their positions. Elements in the old and new JSON whose location is not recorded are recorded as Added and Deleted respectively.

And you provide a mechanism that allows the user to choose whether to match by relative order or sequence-independent, depending on the array’s JSONPath. Although the solution to this problem is three o ‘clock in the morning, but this time you feel you really win, you imagine xiao Zhao pick no fault of the helpless expression, you feel lost sleep is worth it.

Array matching strategy based on similarity + LCS

Another day passed, you popularized the array strategy comparison method to Xiao Zhao, and finally solved the problem of the previous afternoon. You think the rest of your life is quiet, but a few minutes later, Xiao Zhao comes again. This time xiao Zhao encountered the problem is this:

old:[{"a":1,"b":2},{"c":1,"d":0}]

new:[{"c":2,"d":0},{"e":3}]
Copy the code

In the two array alignment strategies designed by you, the comparison results of this case show that the new JSON does not match the old JSON completely. Specifically, it seems that two elements are Added and two elements are reduced in the new JSON compared with the old JSON. Corresponding to the Diff structure, there are two Added Diff elements. Two diff elements of type Deleted (figure below). And you think this is very reasonable, but Zhao’s requirement is that the index 1 in the old array should match the index 0 in the new array, you ask him why? He said because of the resemblance, a normal person would compare, at the moment you very much want to hit him, but reason tells you that his demand also has a certain sense, so you decide to accept his demand.

(status quo)

(expect)

You carefully analyze his problem, and you realize that you think the problem is simple. The root cause of the problem is that the elements in the array do not know how to locate the elements in the other array that should match it.

So you look up the meaning of image in xinhua Dictionary, and you decide to quantify image by abstracting the definition of image into similarity. You define the similarity as an 0-1 floating point number, where a similarity of 1 means they are exactly the same, and a similarity of 0 means they are completely different. You define the following similarity calculation method.

Added: 0
Does: 0
Modified: at least in the same position, so you give a base score of 0.3. If they are of the same value type, you give a comfort score of 0.3. Depending on how different the values are, you give the remaining 0.4.
Object: Divide the total number of fields by the total number of fields The sum (fields’ similarity)/count (fields), sum (fields’ similarity)/count (fields) sum (fields’ similarity)/count (fields)
Array: Divide the total number of fields by the total number of fields The sum (fields’ similarity)/count (fields), sum (fields’ similarity)/count (fields) sum (fields’ similarity)/count (fields)

You’re abstracting the matching of elements in the JSON array diff into the pursuit of maximum similarity, and each step of the matching decision in the two arrays is the pursuit of a greater value for the overall similarity of the array, so in order to record the similarity of each element in the old array compared to each element in the new array, It is natural to establish a similarity table scoreTable with a scale of N *m. ScoreTable [I][j] represents the similarity between the elements with subscript I in the old array and the elements with subscript J in the new array. You find that this problem can also be solved by dynamic programming. You define dp[I][j] as the sum of the maximum similarity of the first I elements of the old array and the first J elements of the new array. The state transition equation is dp[I][j] = MAX(scoreTable[I][j] + DP [i-1][j-1], DP [I][J-1], DP [i-1][j]). You find the matching pair with the best strategy and perform deep recursive matching on the elements in the matching pair. The old array elements that do not appear in the matching pair are marked as Deleted, and the new array elements that do not appear in the matching pair are marked as Added. Therefore, Xiao Zhao sees the matching result that meets the expectation and reluctantly returns to his job.

Until three days later, Zhao found you again and told you that your old strategy based on similarity + LCS could not adapt to him! He came up against a new problem.

Array matching strategy based on similarity + eight queens

This time xiao Zhao encountered the problem is the following such

(status quo)

(expect)

When you run the test cases of xiao zhao, step through step by step, see if you have a problem with my code, the results you found that your code is not the problem, the end result is in line with expectations, xiao zhao test cases + LCS array is according to the similarity matching strategy carried out correctly, then you have to ask xiao zhao modestly reasoning.

Zhao said that he did not care about the order in the array, and it was obvious in this test case that the elements with subscript 1 in the old array looked very similar to the elements with subscript 0 in the new array, so they should be matched and further compared without caring about the order.

He explained that the old array subscript 0 matches the new array subscript 1, the old array subscript 1 matches the new array subscript 0, and since the array matches are out of order, the diff in this case should be expressed in a variety of colors so that people can locate it at a glance.

You think Xiao Zhao is a little unreasonable this time, but you are a man who does not know how to say no, so you decide to support Xiao Zhao’s needs again.

You analyze two days this problem, what do you think can’t abandon the concept of similarity, because although there is no order request xiao zhao, but he still has the similarity requirements, so you decided to in decision-making phase matching algorithm for a change, you know this time is likely to be like before walking down group match (section 5) exhaustion every possibility. The previous requirement was to match only the elements that could be completely equal and discard all the elements that could not be completely equal. But this time, because of the concept of similarity, you need to enumerate the total score of each match pair, and record the maximum value and the array of matches that match it, so you draw the following table.

You see that this problem is very similar to eight Queens, so you name this matching method similarity + eight Queens. The difference is that after working backwards to figure out the probability of each queen position, you need to record their total score and eventually return the total score and the highest queen position probability. Of course, in the combination array with the total score and the largest selection, it is necessary to record the combination with the score of 0, that is, the combination that cannot match at all, as Added/Deleted. However, the disadvantage of this method is obviously too high time complexity. However, you have not come up with a better method for the time being, but it finally solved zhao’s problem once again. You think Xiao Zhao will never come to you again.

8. Open similarity algorithm

…

That’s right, he’s back again, he’s back with his test case, and this time his test case output looks like this.

(status quo)

(expect)

You look at zhao’s test case confused, you give the array matching decision to meet zhao’s requirements for array disordered matching, but also to meet zhao’s greatest demand for overall similarity, you don’t know where the problem is.

Xiaozhao watching you show a look of satisfaction, full of wonder, said leisurely, although the decision to some extent you provide is right, but in this case, he hopes the id field value of the corresponding element in the same match, because in this case the primary key of the object element in an array of ids, only id the same objects have a matching the needs of the alignment.

After analyzing this seemingly difficult problem, you think that the reason for this result is that the similarity calculation algorithm is too general. It may be able to meet most business scenarios, but a few business scenarios cannot be universal. So you decide to give the keyboard to Zhao, let Zhao to achieve their own custom similarity calculation algorithm. Of course, you don’t trust zhao to modify your Git repository, and you also worry that zhao’s modification of the code for your own needs will affect other people’s needs. Therefore, you decide to ask Zhao to write the similarity calculation algorithm elsewhere, and ask him to define what JSON path array needs to use this special similarity calculation algorithm. When your matching logic is performed to the array diff of the path, the similarity calculation is replaced by a custom method. You may wish you hadn’t studied compiler principles in college, but there are plenty of abstract syntax tree packages out there that you can call on. So Zhao had the ability to define his own similarity algorithm. The code looks like this.

(a,b)=>a.id==b.id? 0.999:0

9. Predefined similarity algorithm

Although the above approach may be able to solve Xiao Zhao’s problem, this solution seems to be too low level. Users need to know the implementation principle of your code to know how to write the code they want, which obviously has a high learning threshold for users. Therefore, on the basis of it, we can also abstract out some relatively high level methods and provide users with choices through configuration.

summary

This paper artistically processed the thinking of solving all kinds of user problems encountered in the development of jsonDiff function for flow playback in the department, and analyzed the problems. The story is completely fabricated.

To answer the question of the title, why isn’t there a perfect jsondiff tool? Since almost all jsondiff libraries in the open source community define jsondiff rules in favor of generalization, if you need to compare a small amount of JSON, even if the generalization tool does not meet all of your diff needs, you can still use a visual comparison method, which is certainly acceptable. However, the business scenarios we encounter are almost always thousands of JSON alignments, and at this level of alignment, visual alignment is obviously unacceptable. Therefore, our expectation of JsonDiff functionality is to meet as many user customization requirements as possible.

The solutions described in this paper have not yet been fully realized, part of the ideas refer to github.com/yudai/gojso… The author of GojsonDiff also found some bugs while redeveloping the gojsondiff library, issue: github.com/yudai/gojso…

Finally, if you have better solutions or problems, you are welcome to discuss with us

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Why isn’t there a perfect JSONDIFF tool?

String matching

2. Json tree analysis

Three. Introduce noise

Array matching strategy based on LCS

Five. Double traverse the matching strategy of the number group

Array matching strategy based on similarity + LCS

Array matching strategy based on similarity + eight queens

8. Open similarity algorithm

9. Predefined similarity algorithm

summary

Why isn’t there a perfect JSONDIFF tool?

String matching

2. Json tree analysis

Three. Introduce noise

Array matching strategy based on LCS

Five. Double traverse the matching strategy of the number group

Array matching strategy based on similarity + LCS

Array matching strategy based on similarity + eight queens

8. Open similarity algorithm

9. Predefined similarity algorithm

summary

Related Posts

How to Build Rails APIs Following the json:api Spec

Linux gets the UUID and serial number WWID of the disk

Python Image Processing OpenCV (13) : Scharr operator and LOG operator edge detection techniques