This is the 31st day of my participation in the August More Text Challenge

Data structure – hash table

Hash thought

There are three common data structures in programming languages: scalars, sequences, and maps

A map such as HashMap, HashTable, etc., is a HashTable data structure

A hash table is essentially an extension of an array based on the random access property of an array

Examples of hashing ideas:

10 students participated in the contest, numbered from 0 to 9. How to store student information into a suitable data structure?

Use an array of length 10, stored with player numbers as subscripts, so that the complexity of retrieving stored information is O(1).

If each student’s number is grade number + class number + contestant number, how should it be stored?

Intercept the contestant number part of the number and store it as an array subscript

In the above example, the hash idea is used, where the contestant number is called the key, the method of interception is called the hash function, and the value obtained by the hash function is called the hash value or hash value, which determines the array index

The hash function

A hash function is a function that converts key values into hash values. The hash function plays a key role in a hash table

The value obtained by the hash function is used as the subscript of the underlying array of the hash table, so the basic design requirements of the hash function are

  1. The hash function evaluates to a hash value that is a non-negative integer
  2. If x1 is equal to x2, then f of x1 is equal to f of x2.
  3. If the x1! X2 is equal to f of x1 factorial. = f(x2)

For the third point, which is our ideal case, it’s almost impossible to find a hash function that has a different hash value for a different key in a real implementation. Famous hashing algorithms such as MD5, SHA and CRC are also unable to avoid this situation, which is called hash conflict

Moreover, arrays have limited storage space, which increases the probability of hash collisions

Hash conflict

There are two common algorithms for resolving hash conflicts, open addressing and zippers

Develop addressing methods

The idea behind developing addressing is that if a hash conflict occurs, meaning that two different keys need to be stored in the same subscript, we can re-probe a free location for storage

Relatively simple detection method – linear detection

When a conflict occurs, we can start at the hash position and work backwards to see if there is a free position until we find one

As shown in the figure, when key is hashed, subscript 2 is obtained, but there is already data at this position and the key value is not equal to key, so we search backward in turn. When no free position is found at the end of the traversal, we start from the beginning, and finally find an empty position at subscript 1

When you look for an element, you first get the hash value, and then compare it with the data in the position. If the key is not equal, you look back until you find it. If you haven’t found it until you find a free position, the element doesn’t exist

Existing problems:

When deleting an element, when we find the storage location of the element, we delete the element of the location, so when the hash conflict occurs, the element behind the location will not be found, resulting in the algorithm failure

Solutions:

The location to be deleted is marked with a special marker, and the detection continues when a special marker is encountered during linear detection

In addition to linear detection, there are two more classical detection methods, quadratic detection and double hashing

Quadratic detection is similar to linear detection, but the step size is 1 squared, 2 squared, 3 squared, and so on

Double hashing is the use of multiple hash functions, and in case of conflict, the next function is used to hash

Regardless of the detection method, the probability of collisions increases significantly when there are not many free positions, so you need to ensure that a certain percentage of the hash table is free. The number of vacancies is expressed by the loading factor

Load factor of the hash table = number of elements filled in/length of the hash tableCopy the code

Hash table performance degrades as the load factor increases and conflicts increase

When the data volume is small and the loading factor is small, the open addressing method is suitable. This is why ThreadLocalMap in Java uses open addressing to resolve hash conflicts

Zipper method

A more common hash conflict resolution method is the zipper method, in which the array stores the linked list structure, and when a conflict occurs, the elements with the same hash value are directly stored in the corresponding linked list

The time of linked list insertion is O(1), so the time of zipper insertion is also O(1).

When searching or deleting an element, first calculate the array subscript through the hash function, and then search or delete through the linked list. The time complexity of these two operations is proportional to the length of the linked list K, that is, O(k). For the evenly distributed hash function, K = number of data/length of the hash list array

The hash collision handling method based on linked list is more suitable for storing large objects and large amounts of data. Moreover, compared with open addressing method, it is more flexible and supports more optimization strategies, such as replacing linked list with red-black tree

In Java, HashMap uses the zipper method with red-black tree to resolve hash conflicts

Usage scenarios

How to design an enterprise-level hash table?

  1. Design a proper hash function, not too complex, and the hash values need to be evenly distributed

  2. Select the appropriate hashing conflict resolution

  3. Define appropriate load factor thresholds and dynamically expand or shrink as needed

    For the problem of data relocation during capacity expansion, it is possible not to move all at once, but to move part of the old data each time new data is inserted, and the complexity can be amortized to O(1).

  1. Java HashMap

    The default initial size of the HashMap is 16. You can change the default initial size to reduce the number of dynamic expansion times

    The default maximum load factor is 0.75. When the number of elements in the HashMap exceeds the size of the hash table at 0.75 *, expansion is initiated, and each expansion is doubled in size

    The underlying HashMap uses a zip approach to resolve conflicts. When zippers become too long, the performance of a HashMap can be severely affected. In JDK1.8, red-black trees were introduced to further optimize HashMap. When the length of the linked list is too long (more than 8 by default), the linked list is converted to a red-black tree to improve the performance of adding, deleting, modifying, and searching

    HashMap hash function (capicity specifies the size of the hash table)

    int hash(Object key) {
        intH = key. HashCode ();return (h ^ (h >>> 16)) & (capicity -1); 
    }
    Copy the code

    The hashCode() function gets the hash code of the object, and the hashCode method of the String class:

    public int hashCode(a) {
      int var1 = this.hash;
      if(var1 == 0 && this.value.length > 0) {
        char[] var2 = this.value;
        for(int var3 = 0; var3 < this.value.length; ++var3) {
          var1 = 31 * var1 + var2[var3];
        }
        this.hash = var1;
      }
      return var1;
    }
    Copy the code
  2. Java LinkedHashMap

    The data in the hash table is stored irregularly after being scrambled by the hash function, so a normal HashMap is unordered

    LinkedHashMap traverses elements in increment and access order through a hash and linked list combination

    // 10 is the initial size, 0.75 is the load factor, and true is sorted by access time
    HashMap<Integer, Integer> m = new LinkedHashMap<>(10.0.75 f.true);
    m.put(3.33);
    m.put(1.11);
    m.put(2.22);
    
    m.put(3.26);
    m.get(2);
    
    for (Map.Entry e : m.entrySet()) {
      System.out.println(e.getKey());
    }
    Copy the code

    LinkedHashMap uses the same hash algorithm as HashMap but extends Entry (Node in JDK8)

    Before and after the node added two Pointers, constitute a bidirectional linked list, used to maintain the order

    If sorting by access order is set, the order of the linked list elements will be adjusted when the elements are accessed

  3. The LRU cache

    The principle of LRU cache elimination algorithm by linked list:

    Maintain a linked list sorted by access time, the closer you are to the head, the earlier you are used. When the cache space is insufficient, the header node is removed

    When some data needs to be cached, it is searched to see if it exists. If it does not, it is appended to the tail. If it does, it is moved to the tail

    The complexity of LRU cache is O(n) only with linked list, and the time complexity can be reduced to O(1) with hash table.

    Data is stored in a bidirectional linked list, with each node having a precursor pointer prev, a successor pointer next to join the data, and an hnext pointer to zip resolution in the hash table

    Through the hash table quick lookup feature, the complexity of searching data is O(1), and the time complexity of bidirectional linked list deleting nodes only needs O(1), so the complexity of deleting elements is also O(1).