This article originated from personal public account: TechFlow, original is not easy, for attention


Today, the 30th article in LeetCode, looks at a string grouping problem.

The question

Given an array of strings, you are asked to group all the strings according to their composition.

For example, a given array is [eat, ATE, tea, tan, NAT, bat].

The words eat, ate and tea all use e, T and A. [tan, ate, tea], [tan, NAT] and [bat].

violence

Again, let’s start with the simplest way of thinking about how we group each character in a string. So we can take all the elements of each string and put them into a dict, and then we can use this dict as a grouping criterion, putting dict strings together.

For example, eat we change it to {‘e’: 1, ‘a’: 1, ‘t’: 1}. Since more than one letter can occur, we also record the number of occurrences. One problem is that dict is dynamic data, and in Python we can’t use it as a key for another dict. An easier way to solve this problem is to write a method to concatenate the dict into a string, such as’ e1a1T1 ‘. Let’s use this as our key. But here’s the problem: Dict keys aren’t always ordered, so we need to sort dict, as shown in the following figure.


That is, we need to implement a function whose input is a string and whose output is an element of that string.

def splitStr(s):

    d = defaultdict(int)

    for c in s:

        d[c] += 1

    ret = ' '

    Order the contents of dict

    for k,v in sorted(d.items()):

        ret += (str(k) + str(v))

    return ret

Copy the code

From there, we can create a dict in the outer layer to store the results of the grouping. We can easily write code:

from collections import defaultdict



class Solution:

    def splitStr(self, s):

        d = defaultdict(int)

        # Split elements in a string

        for c in s:

            d[c] += 1

        ret = ' '

        Order the contents of dict

        for k,v in sorted(d.items()):

            ret += (str(k) + str(v))

        return ret

    

    def groupAnagrams(self, strs: List[str]) -> List[List[str]]:

        groups = defaultdict(list)

        for s in strs:

            Group the split string as a key

            key = self.splitStr(s)

            groups[key].append(s)

            

        ret = []

        for _, v in groups.items():

            ret.append(v)

        return ret

        

Copy the code

Some of you may be aware of a problem: since we convert to dict and then concatenate strings, why don’t we just sort the strings and use the result as the key? If the elements are the same, the result of sorting must be the same.

For example, apple and Pplae are both aELPP after sorting, is this feasible?

The idea is OK, but the commit doesn’t pass. The reason is simple, in three words: complexity. This is very complicated, because the length of the string is not fixed, and sorting them one by one requires a lot of overhead. In addition, we use the sorted result as the key, which also consumes storage resources. So that’s not a good way to do it.

hash

And now it’s time for our leader, as you can see from the title, to do something with the hash algorithm.

To be fair, the hash algorithm name is familiar, we hear it all the time, but many of us don’t know what a hash algorithm is for or where we actually need it. The one you hear a lot about is hashMap.

In fact, the content of the hash algorithm is very simple, which can be easily understood as a mapping. Our input can be anything, it can be a number, it can be an array or an object, but our output is a fixed number of bytes of information. For example, modulo 4 in the figure below is a hash function, and we can classify the numbers into different buckets based on the result of modulo 4.


We can put some data in the same bucket as we want, or we can make the result of each data hash as different as possible, which achieves information compression. Let’s say we hash a 2MB instance to get a 32-bit string. This means that we can use 32 bits of string to represent the original 2MB content, so that we can do efficient queries and other operations. For example, the current facial recognition module can be easily understood as a hash function. The camera takes a photo, the algorithm hashes the photo into an ID, and then finds the personal information corresponding to the ID in the database to complete the identification process.

In this case, we want to design a hash function that reads a string and hashes the contents of the string to ensure that the same string hash yields the same result. We can divide buckets by the result of this hash. Essentially, the above method can also be regarded as a hash method. However, since sorting is involved, it is a little more complicated, and the final return is a string. From the perspective of time complexity and space complexity, there is still room for optimization. Let’s look at a commonly used hash algorithm.

In this algorithm, we choose a prime number as the hash factor, so that the probability of hash collisions is low. To construct a hash result from a power of a prime number, let’s look at the code:

def hash(dict):

    ret = 0

    prime = 23

    # k represents characters and v represents occurrences

    for k, v in dict:

        ret += v * pow(23, ord(k) - ord('a'))

        # modulo 2^32, control the return value in 32 bits

        ret %= (1 << 32)

    return ret

Copy the code

In this case, ORD is the ASCII operation, that is, the English letter into a number. We use different powers of a certain prime number to represent different letters, multiplied by the number of times the letter appears as a coefficient. Finally, the hash value of each letter is summed up to obtain the hash value of the entire string. If the coefficients and powers of constructing the same string are the same, the sum will obviously be the same. This conclusion is not a problem. But conversely, are strings with equal hash values really the same?

In fact, if we think about it, we can think about counterexamples, for example, if we have a single character a, the hash value of that isThe hash value of a single b, which is also easy to compute, is 23. What is the hash value of 23 A’s? If you do the math, it’s also 23. Because even though we’re using different powers, they have the same baseYou can make up for the difference in the exponent by using the preceding coefficients. The case where different objects hash the same result is calledHash collisionThis is not what we expected. But to be sure, hash collisions are almost inevitable in most cases. Because the purpose of our hash is to replace the original complex and large data with a simple number or string, just like taking a photo, two different people may look very similar.

We have a lot of information compression or information loss in this process. For example, we use 10 bits to represent the original 2000 bits of data. No matter what strategy we use, there’s only so much data that can be expressed in 10 bits. According to the pigeonhole principle, as long as we hash enough data, there will always be two different data hash results collision.


However, the probability of collision can be reduced or controlled by designing appropriate factors and algorithms. For example, in this problem, the larger the number of primes we choose, the less likely the collision will occur, because the larger the number of primes we choose, the more characters we need to repeat to form the collision, and the lower the possibility.

Finally, let’s look at the complete code:

from collections import defaultdict

import math





class Solution:

    def splitStr(self, s):

        hashtable = [0 for _ in range(30)]

        ret = 0

        for c in s:

            hashtable[ord(c) - ord('a'+ =)]1

            

        # the hash algorithm

        for i in range(26) :

            ret = ret * 23 + hashtable[i]

            Control the return value within 32 bits

            ret %= (1 << 32)

        return ret

    

    def groupAnagrams(self, strs: List[str]) -> List[List[str]]:

        groups = defaultdict(list)

        for s in strs:

            key = self.splitStr(s)

            groups[key].append(s)

        

            

        ret = []

        for _, v in groups.items():

            ret.append(v)

        return ret

        

Copy the code

One detail in the code, we just wrote v * pow(23, k), why did we write it differently here?

Since pow functions in Python return floating-point numbers, the accuracy will be lost, which will greatly increase the probability of hash collisions, so we changed the method that does not apply to POw functions.

conclusion

Today’s problem is not so difficult in terms of difficulty, even people who don’t know hash algorithms can figure it out. However, our goal is not to cut to the chase, but to grow from the chase. Besides, hash algorithm is a very important and basic algorithm, which can be used in many applications and scenarios. So let’s understand the principle of hash algorithm, which is very helpful for our future growth.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.