• preface
  • Jaccard similarity
  • Sorensen Dice similarity coefficient
  • Levenshtein
  • Hamming distance
  • Cosine similarity
  • conclusion
  • Refer to the article

preface

It has been two months since I wrote my last article on September 11th. During this time, I have been busy with some work. Today I am finally free, so I am writing an article for relaxation.

In the usual coding, we often need to judge the similarity of two texts, whether it is used for text correction or repetition, etc., so what dimension should we judge the similarity? How are these algorithms implemented? This article makes a record of common calculation methods.

Jaccard similarity

The first is the Jaccard similarity coefficient, which is defined and calculated by wikipedia.

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name Coefficient de communaute by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

In fact, the summary is a sentence: the proportion of the intersection of sets and the union of sets.

Java code implementation is as follows:

    public static float jaccard(String a, String b) {
        if (a == null && b == null) {
            return 1f;
        }
        // all are null with similarity of 1
        if (a == null || b == null) {
            return 0f;
        }
        Set<Integer> aChar = a.chars().boxed().collect(Collectors.toSet());
        Set<Integer> bChar = b.chars().boxed().collect(Collectors.toSet());
        // The number of overlaps
        int intersection = SetUtils.intersection(aChar, bChar).size();
        if (intersection == 0) return 0;
        // Number of combinations
        int union = SetUtils.union(aChar, bChar).size();
        return ((float) intersection) / (float)union;
    }
Copy the code

Sorensen Dice similarity coefficient

Similar to Jaccard, Dice coefficient is also a way of calculating the similarity between simple sets. Unlike Jaccard, the calculation is slightly different. Here’s the definition.

The Sørensen — Dice coefficient (see below for other names) is a statistic used to gauge The similarity of two samples. It Was independently developed by the botanists Thorvald Sørensen[1] and Lee Raymond Dice,[2] who Published in 1948 and 1945 respectively.

Note that this is: 2 times the intersection of sets divided by the addition of two sets. It’s not a union.

Java code implementation is as follows:

    public static float SorensenDice(String a, String b) {
        if (a == null && b == null) {
            return 1f;
        }
        if (a == null || b == null) {
            return 0F;
        }
        Set<Integer> aChars = a.chars().boxed().collect(Collectors.toSet());
        Set<Integer> bChars = b.chars().boxed().collect(Collectors.toSet());
        // Find the number of intersections
        int intersect = SetUtils.intersection(aChars, bChars).size();
        if (intersect == 0) {
            return 0F;
        }
        // Complete set
        int aSize = aChars.size();
        int bSize = bChars.size();
        return (2 * (float) intersect) / ((float) (aSize + bSize));
    }
Copy the code

Levenshtein

Levenstein distance, also known as Levenshtein distance, is a type of editing distance. The minimum number of editing operations required to convert one string to the other.

To put it simply, string similarity is represented by edit distance. The smaller the edit distance is, the more similar the strings are.

Java implementation code is as follows:

    public static float Levenshtein(String a, String b) {
        if (a == null && b == null) {
            return 1f;
        }
        if (a == null || b == null) {
            return 0F;
        }
        int editDistance = editDis(a, b);
        return 1 - ((float) editDistance / Math.max(a.length(), b.length()));
    }

    private static int editDis(String a, String b) {

        int aLen = a.length();
        int bLen = b.length();

        if (aLen == 0) return aLen;
        if (bLen == 0) return bLen;

        int[][] v = new int[aLen + 1][bLen + 1];
        for (int i = 0; i <= aLen; ++i) {
            for (int j = 0; j <= bLen; ++j) {
                if (i == 0) {
                    v[i][j] = j;
                } else if (j == 0) {
                    v[i][j] = i;
                } else if (a.charAt(i - 1) == b.charAt(j - 1)) {
                    v[i][j] = v[i - 1][j - 1];
                } else {
                    v[i][j] = 1 + Math.min(v[i - 1][j - 1], Math.min(v[i][j - 1], v[i - 1][j])); }}}return v[aLen][bLen];
    }
Copy the code

The classical dynamic programming method is used to solve the edit distance in the code.

We use ** 1 – (edit distance/maximum length of two strings) ** to indicate similarity, so that we can get similarity that matches our semantics.

Hamming distance

Hamming distance is a special case of editing distance and is used only to calculate the number of inconsistent characters in two equal length strings.

Therefore, hamming distance does not need to be added or deleted, but only needs to be compared, so the implementation is relatively simple.

A similarity of two strings can be expressed by similarity= Hamming distance/length.

The Java code is as follows:

    public static float hamming(String a, String b) {
        if (a == null || b == null) {
            return 0f;
        }
        if(a.length() ! = b.length()) {return 0f;
        }

        int disCount = 0;
        for (int i = 0; i < a.length(); i++) {
            if (a.charAt(i) != b.charAt(i)) {
                disCount++;
            }
        }
        return (float) disCount / (float) a.length();
    }

Copy the code

Here are the test cases:

        Assert.assertEquals(0.0 f, StringSimilarity.hamming("Java development"."What are you doing for New Year?"), 0f);
        Assert.assertEquals(0.6666667 f, StringSimilarity.hamming("Eat meat for Chinese New Year."."What are you doing for New Year?"), 0f);
Copy the code

Cosine similarity

The first is the definition of cosine similarity:

Cosine similarity measures the similarity between two vectors by measuring the cosine of the Angle between them. The cosine of an Angle of 0 degrees is 1, and the cosine of any other Angle is not greater than 1; And its minimum value is minus 1. The cosine of the Angle between the two vectors determines whether they point roughly in the same direction. When two vectors have the same direction, the cosine similarity value is 1. When the Angle between the two vectors is 90°, the cosine similarity value is 0. Cosine similarity is -1 when two vectors are pointing in opposite directions. It doesn’t depend on the length of the vector, it just depends on the direction the vector is pointing in. Cosine similarity is usually used in positive Spaces, so the value given is between 0 and 1.

The calculation formula is as follows:

Cosines are familiar, so how do we use them to calculate the similarity between two strings?

First we vectorize the strings, and then we can figure out the cosine of the Angle between them in a plane space.

How do I do vectorization of strings? Let me give you a simple example:

A: Huyan twelve B: Huyan twenty-three Their union vector is the frequency at which each character in the union appears in its own part. Vector A: [1,1,1,1,0] vector B: [1,1,1,1,1] then call the above formula to calculate.Copy the code

Java code implementation is as follows:

    public static float cos(String a, String b) {
        if (a == null || b == null) {
            return 0F;
        }
        Set<Integer> aChar = a.chars().boxed().collect(Collectors.toSet());
        Set<Integer> bChar = b.chars().boxed().collect(Collectors.toSet());

        // Count the word frequency
        Map<Integer, Integer> aMap = new HashMap<>();
        Map<Integer, Integer> bMap = new HashMap<>();
        for (Integer a1 : aChar) {
            aMap.put(a1, aMap.getOrDefault(a1, 0) + 1);
        }
        for (Integer b1 : bChar) {
            bMap.put(b1, bMap.getOrDefault(b1, 0) + 1);
        }

        / / to quantify
        Set<Integer> union = SetUtils.union(aChar, bChar);
        int[] aVec = new int[union.size()];
        int[] bVec = new int[union.size()];
        List<Integer> collect = new ArrayList<>(union);
        for (int i = 0; i < collect.size(); i++) {
            aVec[i] = aMap.getOrDefault(collect.get(i), 0);
            bVec[i] = bMap.getOrDefault(collect.get(i), 0);
        }

        // Calculate three parameters separately
        int p1 = 0;
        for (int i = 0; i < aVec.length; i++) {
            p1 += (aVec[i] * bVec[i]);
        }

        float p2 = 0f;
        for (int i : aVec) {
            p2 += (i * i);
        }
        p2 = (float) Math.sqrt(p2);

        float p3 = 0f;
        for (int i : bVec) {
            p3 += (i * i);
        }
        p3 = (float) Math.sqrt(p3);

        return ((float) p1) / (p2 * p3);
    }

Copy the code

Running the test case against the above code, you can see that it basically meets our expectations.

        Assert.assertEquals(0.70710677 f, StringSimilarity.cos("apple"."app"), 0f);
        Assert.assertEquals(0.8944272 f, StringSimilarity.cos("Huyan twelve"."Huyan 23"), 0f);
        Assert.assertEquals(0.0 f, StringSimilarity.cos("Data Engineering"."Travel to Japan"), 0f);
Copy the code

conclusion

This paper briefly introduces several different ways to calculate the similarity between plain text. They are all effective to a certain extent, but they also have their own meanings, such as editing distance to describe some, and vector Angle to describe some. Therefore, when using the method in this paper, you should know more about its principles and choose one or several of them according to your own business practice.

Refer to the article

wikipedia


To the end.

ChangeLog


All the above are personal thoughts, if there is any mistake welcome to comment.

Welcome to reprint, please sign and keep the original link.

Contact email: [email protected]

For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten