background

One day last week, I scoured the web, combined incomplete code snippets, dozens of SimHash projects on GitHub, and dozens of articles from web sources, and finally came up with a reasonably accurate Java version of the SimHash algorithm.

Output is a simple standard to check the mastery of a knowledge point. This paper will introduce the principle and implementation process of similar text retrieval based on SimHash algorithm in detail.

The application of text similarity

Recently, I am working on a vulnerability database crawler project, which needs to comprehensively analyze and merge the vulnerability information of several vulnerability websites. The release of different vulnerabilities is basically similar, but different websites may have slight differences. How to merge the records of the same vulnerability released by different websites?

This involves the calculation of text similarity. The basic idea is: when parsing web vulnerabilities, first search the system’s vulnerability database with the vulnerability introduction or title to see if there are similar vulnerabilities. If there are, it is regarded as the same vulnerability information and the source of vulnerability can be added.

The classic solution to this problem is SimHash, the Locality Sensitive Hashing algorithm that Google uses to handle mass text decontamination. The full name of this algorithm is slightly different from the abbreviation Sim.

The basic process of judging text similarity using SimHash is as follows:

  1. SimHash values of the two texts are calculated respectively to obtain two binary sequences of constant length.
  2. The haiming distance of two texts is obtained by xOR calculation of two binary numbers.
  3. The Hamming distance is equal to 3 and is considered similar.

Introduction to SimHash algorithms

Java has a HashMap, which uses a function to get a value that identifies the location of an item of data. To avoid Hash collisions, the selected function should be as fragmented as possible.

SimHash is different from familiar hashes in that it requires the same or similar Hash values to be generated for the same or similar text. The degree of similarity and the degree of similarity of the input text can be reflected.

It works like this:

The basic process is:

  1. Word segmentation: Word segmentation of text, get N words word1word_1word1, word2word_2word2, Word3word_3word3…… Wordnword_nwordn;
  2. Weight1weight_1weight1, weight2weight_2weight2, weight3weight_3weight3… Weightnweight_nweightn;
  3. Hash: Calculate the hash value hashihash_ihashi of each participle to obtain a fixed length binary sequence, usually 64 bits, but can also be 128 bits;
  4. Weighted weight: Weightiweight_iweighti Transforms hashihash_ihashi of each participle wordiword_iwordi, changing 1 to positive weightiweight_iweighti, 0 to −weighti-weight_i−weighti, Get a new sequence weightHashiweightHash_iweightHashi;
  5. Weight stacking: Add the weight of each weightHashiweightHash_iweightHashi bit to get a sequence, lastHashlastHashlastHash, where each bit is the sum of the weight of all the participles.
  6. Dimension reduction: Then transform lastHashlastHashlastHash into simHash of 01 sequence. The method is as follows: the position of weight value greater than zero is set to 1, and the position less than 0 is set to 0, which is the local hash value of the whole text, namely fingerprint.

Here’s a classic example of the process from text to SimHash fingerprint:

Three technical points need to be solved in algorithm implementation:

  1. Word segmentation, hanLP word segmentation, IKAnalyzer word segmentation and other tools can be used;
  2. Weight, assign reasonable weight to each part of the word, you can customize the statistical algorithm, can also use word frequency statistics;
  3. Hash algorithm, which you can choose, or you can use an existing algorithm like Murmur hash, JDK hash.

Hamming distance

Take a look at this explanation from Baidu Baike:

The Hamming distance is named after Richard Wesley Hamming. In information theory, the Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings. In other words, it is the number of characters that need to be replaced to transform one string into another.

From the definition, it means that the difference of binary sequence xOR calculation is 1, and the total number of 1 in xOR result is the Hamming distance of two strings, so the calculation method of Hamming distance is also very simple.

The revelation of

This article first introduces the basic concepts, the next article will continue to introduce specific algorithm implementation. What is difficult to understand from the flow of the SimHash algorithm is why it is possible to simply replace the 0 or 1 in the hash value with the weight value and then add the weight to all the participles.

Although there is no Java implementation of SimHash algorithm that can be used directly, I have found two detailed articles that are of great help to the author’s understanding of locally sensitive hash algorithm. They are:

Reference link 1 Reference link 2