An in-depth understanding of HashMaps (and the real subtleties of hash functions)

Original address:www.iteye.com/topic/53946…

/**

*@author annegu

*@date 2009-12-02

*/

Hashmap is a very common and widely used data type, and it’s a good time to review some recent research on it. There are many articles about hashmap on the Internet, but in the end it is a summary of my own learning, so I will send it out to share and discuss with you.

1. Data structure of hashMap

To know what a HashMap is, we must first understand its data structure. In the Java programming language, the most basic structure is two kinds, one is an array, the other is a mock pointer (reference). All data structures can be constructed with these two basic structures, and HashMap is no exception. A Hashmap is actually a combination of an array and a linked list (commonly referred to as a “linked list hash” in data structures), as shown below (the horizontal rows represent arrays and the vertical rows represent array elements).

As you can see from the figure, a hashmap is an array structure. When a hashmap is created, an array is initialized. Let’s look at the Java code:

/** * The table, resized as necessary. Length MUST Always be a power of two. ** FIXME  
transient Entry[] table;  
Copy the code

static class Entry<K.V> implements Map.Entry<K.V> {  
        final K key;  
        V value;  
        final inthash; Entry<K,V> next; . }Copy the code

The Entry above is the element in the array, and it holds a reference to the next element, which forms a linked list.

When we put an element into a HashMap, we first hash the key to its position in the array (that is, the index), and then we can put the element into its position. If there are already other elements in the same place, the elements in the same place are stored in a linked list, with the new ones at the head and the first ones at the end. When we get elements from a Hashmap, we first evaluate the hashcode of the key, find an element at the corresponding position in the array, and then use the key’s equals method to find the desired element in the linked list at the corresponding position. If there is only one element in the list at each location, then the hashmap will have the highest get efficiency. However, the ideal is always beautiful, and the reality is always difficult to overcome. You need to find the position in the array based on the hash value of the key. How you compute that position is the hash algorithm. As mentioned above, the data structure of hashMap is the combination of array and linked list, so of course we want the positions of the elements in the Hashmap to be as evenly distributed as possible, so that the number of elements in each position is only one, so when we use the hash algorithm to solve this position, We immediately know that the element in that position is the one we want, without having to go through the list.

So the first thing we thought of was modulo the length of the array with HashCode, so that the distribution of elements is relatively uniform. However, the consumption of “modular” operation is still relatively large, can we find a faster, less consumption way that? I did this in Java,

static int indexFor(int h, int length) {  
     return h & (length-1);  
}  
Copy the code

First compute the hashcode value of the key, then perform an ampersand (&) with the array length -1. It looks simple, but there’s more to it. For example, if the length of an array is 2 to the fourth power, hashCode will do an “and” operation with 2 to the fourth power -1. A hashmap is most efficient when the array initialization size is a power of 2. Let me use a power of 2 as an example to explain why hashMap accesses are most efficient when the array size is a power of 2.

In the figure below, the two groups on the left are arrays of length 16 (2 ^ 4) and the two groups on the right are arrays of length 15. Two groups of hashcode are 8 and 9, but it is clear that, when they and 1110 “and” produced the same results, which means that they will be in an array to locate the same position, this creates a collision, 8 and 9 will be on the same list, the query need to traverse the list, when I was 8 or 9, This reduces the efficiency of the query. We can also see that when the array length is 15, the value of hashcode will be “and” with 14 (1110), so that the last bit of hashcode will always be 0, and 0001,0011,0101,1001,1011,0111,1101 will never be able to store elements. Worse, in this case, the number of places the array can be used is much smaller than its length, which further increases the chance of collisions and slows down the efficiency of the query!

Therefore, when the length of the array is 2 to the power of n, it is less likely for different keys to calculate the same index, so that the data is evenly distributed in the array, that is to say, there is less chance of collision. Comparatively, when querying, there is no need to traverse the linked list at a certain position, so the query efficiency is higher.

At this point, let’s go back to the default array size in hashMap. Check the source code to see why it’s 16, not 15, and not 20. It’s clear from Annegu’s explanation above that 16 is an integer power of 2. In the case of small data volume, 16 can reduce the collision between keys more than 15 and 20, and speed up the efficiency of query.

Therefore, when storing large amounts of data, it is a good idea to specify a hashMap size as a power of 2. If not specified, it will be initialized to a power greater than and closest to the specified value of 2, as shown in the HashMap constructor:

// Find a power of 2 >= initialCapacity  
   int capacity = 1;  
   while (capacity < initialCapacity)   
   capacity <<= 1;  
Copy the code

As the number of elements in a hashmap increases, the probability of collisions increases (because the size of the array is fixed). Therefore, to improve query efficiency, we need to expand the array of the HashMap. Array enlargement also occurs in the ArrayList. So this is a general operation, and many people have expressed doubts about its performance, but think about the principle of “amortization.” After the hashMap array is expanded, the most performance costly point occurs: the data in the original array must be recalcated and placed in the new array. This is resize. So when will hashMap be expanded? Array expansion occurs when the number of elements in a hashMap exceeds the array size *loadFactor. The default value of loadFactor is 0.75, that is, the default array size is 16. We can double the size of the array to 2*16=32, and then recalculate each element’s position in the array. This is a very performance expensive operation, so if we already know the number of elements in the HashMap, the preset number of elements can effectively improve the performance of the HashMap. For example, we have 1000 elements new HashMap(1000), but theoretically new HashMap(1024) is more appropriate, but as Annegu said above, HashMap automatically sets it to 1024 even if it is 1000. New HashMap(2048) (1024) is not suitable, because 0.75*1000 < 1000, so that 0.75*1000 > 1000, we must use new HashMap(2048). Resize issues are also avoided. Annegu () {get (); annegu () {get (); First compute the key’s Hashcode to find an element at the corresponding position in the array, and then use the key’s equals method to find the desired element in the linked list at the corresponding position. So, the hashcode and equals methods are two key methods for finding the corresponding elements. The key of a Hashmap can be any type of object, such as User. To ensure that the hashcodes of two users with the same attributes are the same, we need to rewrite the hashcode method. For example, we need to associate the calculation of the hashcode value with the ID of the User object. So as long as the user objects have the same ID, their Hashcodes will be consistent, so that they can find their position in the HashMap array. If there are multiple elements at the same location, you need to use the key’s equals method to find the elements in the linked list of the corresponding location. Therefore, rewriting hashCode is not enough. The equals method also needs to be overwritten. For example, determine whether two users are equal based on the ID of the User object. (1) Reflexivity: a.equals(a) must be true. (2) Symmetry: that is to say, a. quals(b)=true, and B. quals(a) must also be true. (3) Transitivity: that is to say, a quals(b)=true, and b quals(c)=true, a quals(c) must also be true. By overwriting the equals and HashCode methods of key objects, we can use any business object as a map key(if you really need to). Summary: This paper mainly describes the structure of HashMap, the implementation of hash functions in HashMap, and the characteristics of this implementation. It also describes the root cause of performance consumption caused by resize in HashMap, and the basic requirements of using common domain model objects as keys. In particular, the implementation of hash function can be said to be the essence of the whole HashMap. Only by truly understanding this hash function can we say that we have a certain understanding of HashMap. This is the first hashMap article, which mainly discusses the data structure of hashMap and the algorithm to compute hash. Annegu will write a second post on LinkedHashMap and LRUHashMap. First make a warning, ha ha ~

An in-depth understanding of HashMaps (and the real subtleties of hash functions)

Related Posts

How to elegantly design a distributed API return format?

zen

Small white learn how to log