The simplest way to do it

In most cases, a lot of duplicate text is generally not a good thing, such as copied news, mass text spam, advertising copywriting, etc., which can lead to the homogenization of online content and increase the storage burden of database, or worse, reduce the quality of text content. Therefore, an accurate and efficient text deduplication algorithm is needed. The simplest approach is to put all the text are compared, and two simple easy to understand, the most conforms to the human instincts, for a small amount of text, is also easy to implement, but for massive amounts of text, it obviously doesn’t work, because its time complexity is, in view of the text to heavy billion level, time consumption may be in years, that the road is blocked.

In addition, when we talk about de-weighting, we actually imply two aspects. The first is what is the more efficient way to compare, and the second is what is the standard of de-weighting when comparing. In the field of text, it is how to measure the similarity between two texts, usually including editing distance, Jaccard distance, Cosine distance, semantic distance, etc. Different similarity measurement methods are selected in different fields and scenes. This is not the focus of this paper, so click not to list. The following focuses on how to make efficient comparisons.

core idea

Key to reducing time complexity: > Try to group potentially similar texts together, thus greatly narrowing the scope for comparison

SimHash algorithm

SimHash algorithm is a set of algorithms proposed by Google and applied to the actual web page deduplication algorithm. The main feature of simHash algorithm is that it maps text to a 01 string, and the 01 string generated between similar text is also similar, except for a few positions where 0 and 1 are different. In order to represent the similarity of the original text, the number of different positions between two 01 strings can be calculated, which is known as the Hamming distance, which is used to represent the similarity between two texts under simHash algorithm. Generally speaking, the more similar the text is, the smaller the Hamming distance between 01 strings mapped by simHash is.

To make the process clearer, here’s a simple example.

T1 = “Mama called you for dinner” T2 = “Mama called you for dinner”

As you can see, although the above two strings are only one word different, the Hash value obtained by a simple Hash algorithm may be completely different, so the obtained Hash value cannot be used to represent the similarity of the original text. However, after mapping the simHash algorithm, the resulting simHash value looks like this:

SH1 = “1000010010101101[1]1111110000010101101000[0]00111110000100101[1]001011” SH2 = “1000010010101101 [0] 1111110000010101101000 [1] 00111110000100101 [0] 001011”

If you look closely, the two simHash values above differ in only three places (the differences are marked by “[]”), so the Hamming distance between the original text is 3. Generally speaking, the hamming distance judgment criterion used for similar text detection is 3. That is to say, when the Hamming distance between the corresponding simHash of two texts is less than or equal to 3, the two texts are considered similar, and only one of them can be left in case of duplicate deletion.

The process of deduplicating the simHash algorithm is very simple, starting with a key point: > if similar text judgment standard is 3 hamming distance, at a stay to heavy corpora concentrated there are two similar text, that is to say the hamming distance between the two similar text a maximum of three corresponding hash value (up to 3 different) where, if simHash for 64, the 64 – bit hash value can be from high to low, Divided into four consecutive 16 bits, the three different positions can fill at most any three of the four intervals (conversely, if all four intervals are filled, it becomes a Hamming distance of 4). That is, two similar texts must be identical on one of the consecutive 16 bits.

Want to understand the key points, can to the entire stay to heavy text a simHash mapping (64, for example) used in this article, then these 01 string from high to low were divided into four segments, according to the above discussion, there must be one of the two similar text segment, is still in the example above, is divided into four sections as follows:

T1 = “Mom called you to dinner” SH1 = “1000010010101101[1]1111110000010101101000[0]00111110000100101[1]001011” SH1_1 = “1000010010101101” SH1_2 = “[1]111111000001010” # SH1_3 = “1101000[0]00111110” # SH1_4 = “000100101[1]001011” # SH1_4 = “000100101[1]001011” # SH1_3 = “1101000[0]00111110” # SH1_4 = “000100101[1]001011” # SH2 = “1000010010101101[0]1111110000010101101000[1]00111110000100101[0]001011” SH2_1 = “1000010010101101” # first paragraph SH2_2 = “[0]111111000001010” # SH2_3 = “1101000[1]00111110” # SH2_4 = “000100101[0]001011” #

With this done, the next step is to build the index. As discussed above, each simHash is divided into four segments from high to low, and each segment has 16 bits. In the process of building an inverted index, these truncated fragments of 16-bit 01 string are respectively used as the key value of the index, and all the texts with this fragment in the corresponding position are added to the value field of the index. Intuitively, there are four buckets numbered 1,2,3, and 4 (corresponding to the first, second, third, and fourth segments of the 64-bit hash value). In each bucket, there are two buckets numbered from 0000000000000000 to 11111111111111. During index creation, after obtaining the corresponding simHash value for each text, examine each paragraph (determine which bucket is 1,2,3, and 4), and then place the text in the corresponding bucket based on the 16-bit hash value in the paragraph. After the index is established, since similar text must reside in a bucket with a 16-bit hash value, de-duplicating all buckets of these segments (which can be done in parallel) removes all similar text from the text set.

The whole process of using simHash for deduplication is shown in the following figure:


To sum up, there are three steps for simHash deduplication: 1. SimHash mapping for each text to be deduplicated; 2. 2. Create an inverted index by dividing simHash values. 3. Parallelize the deduplication operation in the hash value of each segment.

There are two key points for deduplicating using simHash: the -simHash mapping still preserves the similarity of the original text; – Divide-and-rule greatly reduces the number of unnecessary comparisons.

Therefore, with these two guarantees, simHash algorithm under long text and hamming distance to measure the similarity between texts can greatly reduce the time complexity of the algorithm and achieve good de-duplication effect. But in the essay in this scenario, the measurement effect will become very poor, under normal circumstances, used to measure long am hamming distance threshold to 3, but in this essay, similar to the text between the hamming distance is usually greater than 3, and the algorithm, based on the similarity of hamming distance of threshold value is higher, the higher the time complexity of this algorithm can also, Hamming distance can no longer be used as a measure of similarity in short texts.

Deduplicate algorithm based on text local information

The basic idea of this process is similar to simHash, except that instead of using the hash value, a substring of text is directly used as the key, and any text that has the substring is put into the bucket corresponding to the substring. There is an implied premise: > Any two decidably similar texts must be exactly the same on one or more substrings.

In addition, substrings can be generated by n-grams (corresponding to Shingles if it is a word or word level), directly cut from the original text by sliding window, or cut from the rest of the ordered word combination after removing the stop words, or cut from the original text after summary generation. In short, similar ideas can be used to generate these substrings as indexes as long as they are based on raw text or lossy text within an acceptable range.

The whole de-duplication algorithm is divided into five big frames, including: text preprocessing, the establishment of inverted index, parallel partition and conquer, de-duplication algorithm implementation, text merge and so on.

Text preprocessing

Text preprocessing varies according to the specific substring interception method used. If the substring is a combination of words, the text needs to be segmented, and if the stop word needs to be removed, this is also text preprocessing. In order to simplify the analysis of the process, this paper mainly takes the direct interception of the original text as an example, so the preprocessing work is relatively less.

Establishment of inverted index

It is assumed that the two potentially similar texts (one of which is removed after deduplication is required) are RESPECTIVELY T1 and T2, and the maximum continuous subtext string between them is k, which forms a set and is defined as S = {s1,s2… Sk}, the length of these subtext strings also corresponds to a set L = {l1,l2… , lk}, for this particular two text to heavy, the choice of intercepting child text string length cannot exceed a certain threshold, because if the intercept length exceeds the threshold, the two text will no longer have the same text string index, thus algorithm throughout not to compare the two text, thus unable to reach the purpose of to heavy. This threshold is defined as the maximum deduplicated length on the two texts, as follows:


If all global texts are duplicated, there is also a corresponding global de-duplicated length M, which indicates that if similar texts in this part of global texts are duplicated, an appropriate truncation length should be selected for each text. Generally speaking, the selection of global weight removal length is related to the weight removal rate and the time complexity of the algorithm, and the actual selection is always a compromise between the weight removal rate and time complexity. The smaller the global de-weight length is, the better the de-weight effect of the text is (the de-weight rate will increase), but the corresponding time complexity is also higher. The larger the selection of global deduplication length, the effect of similar text will be worse (partial similar text will not be compared), but the time complexity will be reduced. The reason is that: if the global deduplication length is too high, it will be larger than the maximum deduplication length of many similar texts. Therefore, these similar texts will not be judged as similar texts and the deduplication rate will decrease, but just because the number of comparisons is reduced, the time complexity will be reduced. On the contrary, with the decrease of global deduplication length, more similar texts will be divided into the same index. After similarity calculation, the corresponding similar texts will also be removed, so the global deduplication rate will increase, but due to the increase of comparison times, the time complexity will increase.

Assumes that there is a similar sample from the actual text text set C, can according to the sample set to determine the global to heavy m in length, the actual situation shows that usually when m > = 4 (generally corresponds to the length of the two Chinese word), the algorithm of parallel computing, time complexity has been reduced to an acceptable range, so you can get:


Suppose some text t to be deleted has length N. S is defined as the set of truncated M-gram substrings. According to the size relationship between m and N, there are the following two cases: (1) When n>=m, we can truncate some m-gram substrings according to the size of m, and the size of the set is n-m+1, which can be expressed as S = {s1,s2… , sn – m + 1}; (2) When n<m, the substring of length m cannot be intercepted, so the whole text is added to the substring set as a whole, so S={t}. After the m-gram substring set of each text to be deleted is generated, the elements in the corresponding set are traversed for each text T, and each substring in the set is taken as the key and the original text T as the corresponding value to combine into a key-value pair. After traversing the m-Gram substring set of all text, the inverted index of each text with its N-M +1 M-Gram substring is obtained. Next, all text under the same index key value can be aggregated according to the different index key value, so as to realize the de-duplication logic.


Parallel framework for algorithms

The parallel framework here mainly relies on Spark. The original text set is stored on each node of the cluster in the form of HDFS. After dividing each text into corresponding indexes according to the method described above, each index is used as the key for hash. According to the hash value, all the text to be de-duplicated is allocated to the corresponding machine node (Server in the figure below). Each working node in the distributed cluster only needs to be responsible for the de-duplicated work under its own machine. The distributed framework based on Spark is as follows. Each Server is a working node, and the Driver is responsible for distribution and deployment. The Driver distributes text sets stored in HDFS to these nodes. Therefore, each Server only needs to be responsible for the de-duplication work on the node, and finally each Server is left with the text after the initial de-duplication.


De-heavy implementation

After the parallelization framework is established, the text divided under each index can be compared in pairs (as shown in the above figure, each Server may process the corresponding text of multiple indexes), so as to achieve text de-duplication. According to the analysis in 1, any two decidable similar texts T1 and T2 must be completely consistent on one or more subtext strings. According to the setting in 3.1.1, these maximally consistent substrings form a set S = {s1,s2… , sk} for t1 and t2, the m – “gramm substring in the process, assuming that can get the m – respectively” gramm substring collection for S1 and S2, presupposes a substring in the S for si, it | si | is greater than the length of the global to heavy m in length, so the substring si can be divided into | si | – m + 1 m – “gramm substring, And these substrings must exist in both S1 and S2. Furthermore, t1 and t2 will appear at the same time in this | si | – m + 1 m – “gramm substring in the inverted index for the key.

When de-duplicating, the similarity between pairs can be calculated for all the text under each index. Specifically, a result set is dynamically maintained. In the initial state, a text is randomly selected from the index as the seed text, and then the iterated text is traversed to try to add each text traversed to the result set. In the process of adding, Calculate the traverse to the text and the result set of each text whether can be judged to be similar (using the threshold value of similarity measure), if similar to the result set for a text to achieve the conditions, retreat to traverse the result set, if the result set completely traverse still not trigger similar conditions, indicates the stay to heavy text and known the result set without any repeat, So the text is added to the result set and the next traversal of the text to be erased begins. The similarity measurement between two texts is very important in the process of deduplication, which directly affects the effect of deduplication. Methods you can use include edit distance, Jaccard similarity, and so on. In practice, the calculation of Jaccard similarity generally requires word segmentation of the texts to be compared. Assuming that the set after word segmentation of the two texts to be compared is respectively A and B, then the similarity of the two texts can be obtained according to the definition of Jaccard similarity. Obviously, the Jaccard similarity of two completely inconsistent texts is 0. On the contrary, the Jaccard similarity of two identical texts is 1, so the Jaccard similarity is a number between 0 and 1. In the process of deduplication, an appropriate threshold value can be determined according to actual needs, and all texts greater than this threshold value will be judged as similar texts and thus removed.

The entire de-duplication implementation pseudocode is as follows:

Initial state: text set T = {t_1,t_2… ,t_n} Deduplication result R = {} similarity threshold sim_th Output result: deduplication result R Algorithm process: for I in T: flag = true for j in R: if( similarity(i,j) < sim_th ) flag = false break -> next i else continue -> next j if( flag ) R.append(i) # indicates that the I text and any text in the current result set are not duplicated, so I is added to the result set

Text merge and deduplicate

The main purpose of this one step is the text on the section in the different machine nodes according to the arrangement in advance good id, to conduct a common hash to heavy, because according to the step of the process, may be in different substring corresponding barrels will leave the same text, this step after hash to heavy, will be to get rid of these repeated id. The end result is that all but one of the duplicated texts in the entire text set is retained, fulfilling the purpose of duplicating. The whole de-duplication process is shown in the figure below:


Compare with simHash

Compared with simHash, in terms of time complexity and accuracy,

Firstly, the time complexity is greatly reduced – the number of buckets changes dynamically according to the size of the text volume, which is about twice the number of text. On average, there is less than one text in a single bucket, and the computational complexity in the bucket is greatly reduced. In simHash algorithm, the number of buckets is fixed 4*216= 260,000 – Generally, only similar texts have similar word combinations, so similar texts occupy the majority in a particular word combination, and the deduplication time complexity within a single bucket tends to be O(N). Accordingly, simHash still has a lot of dissimilar text in a single bucket, and the de-repetition time complexity tends to be O(N^2).

Secondly, similarity measurement is more accurate: – More accurate similarity measurement tools can be used, but The Hamming distance of simHash does not work in short texts, the recall is too low, and many similar texts do not meet the hamming distance less than 3 condition

conclusion

In fact, it can also be applied to the requirement of long text. Theoretically, the time complexity can be much lower than simHash, and the effect can be similar to simHash. The only drawback is the storage space will be larger, because the algorithm requires a copy of the stored a lot of text, but in the store a copy of the text, you can use a globally unique id instead, so storage pressure does not increase a lot, compared with the time complexity is greatly reduced, this space storage pressure is completely can bear.