background

The previous two articles introduced SimHash algorithm flow, SimHash fingerprint segmentation based similar text retrieval process, this paper introduces the specific code implementation.

IT professionals know that coding is the least difficult job, just translating the previous process description into code. This article will translate SimHash algorithms and retrieval tools.

Process review

SimHash algorithm flow:

  1. Word segmentation: Word segmentation of text, get N words word1word_1word1, word2word_2word2, Word3word_3word3…… Wordnword_nwordn;
  2. Weight1weight_1weight1, weight2weight_2weight2, weight3weight_3weight3… Weightnweight_nweightn;
  3. Hash: Calculate the hash value hashihash_ihashi of each participle to obtain a fixed length binary sequence, usually 64 bits, but can also be 128 bits;
  4. Weighted weight: Weightiweight_iweighti Transforms hashihash_ihashi of each participle wordiword_iwordi, changing 1 to positive weightiweight_iweighti, 0 to −weighti-weight_i−weighti, Get a new sequence weightHashiweightHash_iweightHashi;
  5. Weight stacking: Add the weight of each weightHashiweightHash_iweightHashi bit to get a sequence, lastHashlastHashlastHash, where each bit is the sum of the weight of all the participles.
  6. Dimension reduction: Then transform lastHashlastHashlastHash into simHash of 01 sequence. The method is as follows: the position of weight value greater than zero is set to 1, and the position less than 0 is set to 0, which is the local hash value of the whole text, namely fingerprint.

Similar text retrieval tool flow based on SimHash:

  1. Computing SimHash: Computes the SimHash value of the target text. The length is 64 bits.
  2. The disassembled SimHash value is 4 segments, each 16-bit segment is stored in the array hashs;
  3. RedisUtil. Get (key); RedisUtil. Get (key);
  4. If matchedSimHash is empty, the SimHash value of the target text is cached, and the retrieval is complete. Otherwise, continue.
  5. Iterate over the matchedSimHash list to calculate the Hamming distance between the current SimHash and each SimHash value in the list. If a value smaller than 3 is found, the similar text is found and the process ends. Otherwise, continue.
  6. If no similar record is found after the traversal of each segment in the outer layer and the traversal of each matching list in the memory, the SimHash value of the target text is cached and the retrieval is complete.

The class diagram design

The main difficulty in the implementation of SimHash algorithm lies in word segmentation and weight assignment. This paper uses Word, a word segmentation tool on GitHub, to implement it by inheriting its TextSimilarity. The functions involved are as follows:

Depend on the preparation

To create a project, pom.xml adds two dependencies:

<dependency> <groupId>org.apdplat</groupId> <artifactId>word</artifactId> <version>1.3.1</version> </dependency> Alibaba </groupId> <artifactId>fastjson</artifactId> <version>1.2.70</version> </dependency>Copy the code

Note that the word dependency is not easy to download successfully, you can directly put word.1.3.1.jar into the project directory, manually add the dependency.

The implementation code

According to the SimHash algorithm flow, the code for realizing SimHashBaseOnWord is as follows:

public class SimHashBaseOnWord extends TextSimilarity { private static final Logger LOGGER = LoggerFactory.getLogger(SimHashBaseOnWord.class); // Generate 64-bit SimHash private int hashBitCount = 64; public SimHashBaseOnWord(){ } public SimHashBaseOnWord(int hashBitCount) { this.hashBitCount = hashBitCount; } @Override protected double scoreImpl(List<Word> words1, List > < Word words2) {/ / with the weight of Word to describe Word taggingWeightWithWordFrequency (words1 words2); // Calculate SimHash String simHash1 = SimHash (words1); String simHash2 = simHash(words2); Int hammingDistance = hammingDistance(simHash1, simHash2); If (hammingDistance == -1){logger. error(" SimHash value of text 1 and text 2 is not equal, cannot calculate hammingDistance "); The return of 0.0; } int maxDistance = simHash1.length(); double score = (1 - hammingDistance / (double)maxDistance); return score; } /** * Calculate the SimHash of the text * @param text * @return */ public String SimHash (String text) {List<Word> words = seg(text); return this.simHash(words); } /** * Computes the Hamming distance of the SimHash value of equal length * If the distance cannot be compared (the two text segments are not of equal length), Return 64 * @param simHash1 SimHash value 1 * @param simHash2 SimHash value 2 * @return hammingDistance */ public int hammingDistance(String) simHash1, String simHash2) { if (simHash1.length() ! = simHash2.length()) { return this.hashBitCount; } int distance = 0; int len = simHash1.length(); for (int i = 0; i < len; i++) { if (simHash1.charAt(i) ! = simHash2.charAt(i)) { distance++; } } return distance; } /** * computes the SimHash value of the word list, * @param Words Word List * @return SimHash value */ private String SimHash (List<Word> words) {float[] hashBit = new float[hashBitCount]; words.forEach(word -> { float weight = word.getWeight()==null? 1:word.getWeight(); BigInteger hash = hash(word.getText()); for (int i = 0; i < hashBitCount; i++) { BigInteger bitMask = new BigInteger("1").shiftLeft(i); if (hash.and(bitMask).signum() ! = 0) { hashBit[i] += weight; } else { hashBit[i] -= weight; }}}); StringBuffer fingerprint = new StringBuffer(); for (int i = 0; i < hashBitCount; i++) { if (hashBit[i] >= 0) { fingerprint.append("1"); }else{ fingerprint.append("0"); } } return fingerprint.toString(); } /** * calculates the hash value of the text, A Hash algorithm of common * * @ param word word @ * Hash value is the return/private BigInteger Hash (String word) {if (word = = null | | word. The length ()  == 0) { return new BigInteger("0"); } char[] charArray = word.toCharArray(); BigInteger x = BigInteger.valueOf(((long) charArray[0]) << 7); BigInteger m = new BigInteger("1000003"); BigInteger mask = new BigInteger("2").pow(hashBitCount).subtract(new BigInteger("1")); long sum = 0; for (char c : charArray) { sum += c; } x = x.multiply(m).xor(BigInteger.valueOf(sum)).and(mask); x = x.xor(new BigInteger(String.valueOf(word.length()))); if (x.equals(new BigInteger("-1"))) { x = new BigInteger("-2"); } return x; } @param text @return */ private List<Word> seg(String text) {if(text == null){return Collections.emptyList(); } Segmentation segmentation = SegmentationFactory.getSegmentation(SegmentationAlgorithm.MaxNgramScore); List<Word> words = segmentation.seg(text); If (filterStopWord) {// stop words filterStopWord. FilterStopWords (words); } return words; } public static void main(String[] args) throws Exception{String text1 = "I like reading "; String text2 = "Reading is my hobby "; String text3 = "I like reading "; SimHashBaseOnWord textSimilarity = new SimHashBaseOnWord(); double score1pk2 = textSimilarity.similarScore(text1, text2); double score1pk3 = textSimilarity.similarScore(text1, text3); double score2pk3 = textSimilarity.similarScore(text2, text3); String sim1 = textSimilarity. SimHash (" My hobby is reading "); String sim2 = textSimilarity. SimHash (" Reading is my hobby "); LOGGER. The info (" my hobby is reading a book "+" and reading are my hobbies hamming distance is: "+ textSimilarity. HammingDistance (sim1 and sim2)); Println (text1+" and "+text2+" similarity score: "+score1pk2); Println (text1+" and "+text3+" similarity score: "+score1pk3); Println (text2+" and "+text3+" similarity score: "+score2pk3); }}Copy the code

As a result, the Hamming distance between “my hobby is reading” and “reading is my hobby” is 0:

Retrieval tool

The complete code of SearchBaseOnSimHash based on SimHashBaseOnWord class:

Public class SearchBaseOnSimHash {// Constant definition: id of similar record in database public static final String idKey = "ID "; Public static Final String fingerKey = "finger"; / / logging class private static final Logger Logger. = LoggerFactory getLogger (SearchBaseOnSimHash. Class); Private static int similarityThreshold = 3; /** * Retrieve similar record numbers based on SimHash values. Algorithm flow: * 1. Divide SimHash into four segments and search the global table for the complete fingerprint of the current segment using each segment as the Key * 2. If the current SimHash matches the information of a segment in the buffer, calculate the distance between the fingerprint of the matched segment and the current fingerprint * 3. If the distance is less than the similarity threshold, it is considered found. Returns an object; * 4, otherwise, @param simHash * @return */ public static JSONObject search(String [] segments, String simHash) { if(segments == null || segments.length ! = 4) { return null; } // Create a SimHash object to calculate the Hamming distance SimHashBaseOnWord simHashByWord = new SimHashBaseOnWord(); Long start = system.currentTimemillis (); for (int i= 0; i<segments.length ; Map<String, String> hashSets = RedisUtil. Get (segments[I]); if(hashSets == null) { continue; String finger = hashsets.get (fingerKey); int hammingDistance = simHashByWord.hammingDistance(finger, simHash); if (hammingDistance <= similarityThreshold){ long end = System.currentTimeMillis(); JSONObject responseObj = new JSONObject(); responseObj.put(idKey, hashSets.get(idKey)); return responseObj; } } return null; * @param segments * @param segments * @param segments * @param segments * @param segments * @param segments * @param simHash * @param data */ public static void push(String[] segments,String simHash, JSONObject data) { if(segments == null || segments.length ! = 4) { return; } Map<String,String> redisData = new HashMap<>(); redisData.put(fingerKey, simHash); redisData.put(idKey, data.getString(idKey)); for (int i= 0; i<segments.length ; i++){ RedisUtil.put(segments[i],redisData); }} public static void main(String[] args) {String text1 = ""; String text2 = "Reading is my hobby "; String text3 = "I like reading "; SimHashBaseOnWord textSimilarity = new SimHashBaseOnWord(); String simhash = textSimilarity. Simhash (" My hobby is reading "); String[] hashs=new String[4]; hashs[0]= simhash.substring(0, 16); hashs[1] = simhash.substring(16, 32); hashs[2] = simhash.substring(32, 48); hashs[3] = simhash.substring(48, 64); JSONObject target = search(hashs,simhash); If (target == null) {target = new JSONObject(); target.put(idKey,"123"); push(hashs,simhash,target); } else { System.out.println(target.get(idKey)); } String sim2 = textSimilarity. SimHash (" Reading is my hobby "); hashs[0]= sim2.substring(0, 16); hashs[1] = sim2.substring(16, 32); hashs[2] = sim2.substring(32, 48); hashs[3] = sim2.substring(48, 64); target = search(hashs,sim2); If (target == null) {target = new JSONObject(); target.put(idKey,"456"); push(hashs,sim2,target); } else { System.out.println(target.get(idKey)); }}}Copy the code

The revelation of

Theoretically, the longer the text similarity based on SimHash is, the more accurate the calculation result will be. However, the algorithm based on word segmentation is more accurate for both long text and short text. Why?

Analyze its startup log:

Here’s a thing or two:

  1. Large number of resource words;
  2. In the process of word segmentation, word frequency statistics and weight calculation are carried out automatically.
  3. The “heavyweight” tools take longer to calculate than other word segmentation tools.