HashMap is a common focus, not the List implementation classes. The following analysis is based on JDk1.8.0_102.

Internal storage

The internal storage of a HashMap is an array (bucket). The element Node of the array implements the map. Entry interface (Hash, key, value, next). When next is not empty, it points to another Entry with the same location, as shown in the figure:

image.png

Capacity and loadFactor

Simply put, capacity is the size of the bucket, and loadFactor is the maximum percentage of the bucket that is filled. If the number of entries in a bucket (rather than the number of occupied locations) is greater than capacity*loadFactor, you need to expand the capacity and double the current size of the bucket. Also, if the size of the initial capacity is a power of 2 (greater than or equal to the minimum power of the capacity), the size of the bucket will be a power of 2 before and after the capacity expansion (important, because resize is very convenient).

Tips: The default capacity is 16 and loadFactor is 0.75, but if you need to optimize it, you need to consider specific usage scenarios.

  • If you have high requirements on iteration performance, do not set capacity to too large or loadFactor to too small. Otherwise, too many empty positions in the bucket will waste performance
  • If the performance of random access is very high, do not set the loadFactor too high, otherwise it will cause frequent collisions during access and time complexity degrades to O(n)
  • If the data is growing rapidly, or if the size of the data is predictable, you can proactively set Capacity when creating a HashMap

The hash and positioning

As API designers, we can’t assume that the user implements a good hashCode method, so we usually evaluate hashCode one more time:

static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}Copy the code

HashCode methods

Note the polymorphic use of key.hashcode (). The focus is on hash methods.

Implement and locate the hash method

As mentioned earlier, when get and put compute the subscript, hashCode is hash first and then further compute the subscript using the hash value, as shown in the following figure:

image.png

A review of the hash method’s source code shows that the hash method basically does what it does: the high 16bit stays the same, and the low 16bit and high 16bit make an xor.

Javadoc says this:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, Because many common sets of hashes are already reasonably distributed (so don’t benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

When designing hash functions, since the current table length n is a power of 2, we can use bitwise and & instead of modulo % when calculating subscripts:

(n - 1) & hashCopy the code

The designers think this method is prone to collisions. Why do you say that? Consider that at n-1, 15(0x1111), the only significant bits that really work are the lower 4bits, which are collision prone.

Therefore, the designers came up with a holistic approach (considering speed, function, and quality) by xorning the high and low 16bits. The designers also explained that since most hashcodes are now well distributed, collisions are done using O(logn) trees. Only xor, not only reduces the overhead of the system, but also will not cause collisions because the high level does not participate in the calculation of the subscript (table length is relatively small).

But I don’t understand why “very” is prone to collisions. Designed this way, the hash distribution is uniform and extremely simple; After xor the high 16bits and the low 16bits, the hash distribution becomes a little more complex, more “near” random, but still uniform. It is estimated that the author starts from the perspective of practical use, because in general, the distribution of key also conforms to the “locality principle”, and the probability of low bits being the same is greater than the probability of xor remaining the same, thus reducing the probability of collision.

The collision

When we call the PUT method, collisions can occur, even though we try to avoid collisions to improve the performance of the HashMap. The collision rate is said to be quite high, with an average load rate of 10% when collisions start. We use open hashing to handle colliding nodes.

The reference of the old entry is assigned to the next attribute of the new entry, and the new entry is placed in this position instead — that is, a linked list is stored in this position, and the conflicting nodes are inserted from the head of the linked list. In this way, the insertion of new entry does not need to traverse the linked list, and the time complexity is O(1). But if the list is too long, query performance degrades to O(n). Java8 adds a threshold for the length of a linked list. If the threshold exceeds the threshold, the linked list is converted to a red-black tree and the query time complexity is reduced to O(logn), improving the performance when the linked list is too long.

Erratum: a friend contacted me online to point out this error. In Java8, a collision traverses to the end of the list.

This article is based on my early analysis of the source code of Java7. I thought that except for the red-black tree part, it is basically the same, so I directly copy it, which leads to this error. Of course, now I think that the analysis of HashMap in Java7 is also wrong. But it still takes order n time to traverse before insertion. But other articles involving source code are honest source code written, can rest assured to read. Tell yourself, to stick to steadfast, stick to doubt. The following is an analysis of the correction.

Start with the entry at that position and iterate until the same key is found (replaced), or there is no same key at the end (concatenated at next on the last node). The query process is the same. Regardless of any optimization, the insert and query performance is O(n). Java8 adds a threshold for the length of a linked list. If the threshold exceeds the threshold, the linked list is converted into a red-black tree, and the time complexity of insertion and query is reduced to O(logn), which improves the performance when the linked list is too long.

Call the get method, locate the location, iterate through the red-black tree, compare the key value to find the desired element:

    final Node<K,V> getNode(int hash, Object key) {
        Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
        if((tab = table) ! =null && (n = tab.length) > 0 &&
            (first = tab[(n - 1) & hash]) ! =null) {
            if (first.hash == hash && // always check first node((k = first.key) == key || (key ! =null && key.equals(k))))
                return first;
            if((e = first.next) ! =null) {
                if (first instanceof TreeNode)
                    return ((TreeNode<K,V>)first).getTreeNode(hash, key);
                do {
                    if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                        return e;
                } while((e = e.next) ! =null); }}return null;
    }Copy the code

The classic design for determining element equality takes advantage of the short-circuit nature of bool expressions: hash values are compared first; If the hash value is equal, it is compared by ==; If == is not equal, then compare with equals. Hash is computed in advance; If there are no overloaded operators (and this is usually not recommended), == generally compares the reference values directly; The equals method is the most likely to cost performance, as String’s equals method takes O(n) time, where n is the length of the String. Be sure to keep in mind the order of judgment here, which is useful for understanding the source code for collision handling.

For the use of HashMap, there are two important points to note here when overwriting the hashCode and equals methods:

  • After overwriting, make sure that hashCode returns the same value when equals determines equality.
  • For the class chosen as key, make sure that put is equal to the value returned by hashCode when get is called, and equals has the same property.

resize

Resize is the hardest part of a HashMap to understand.

Resize occurs when the put method is called and the current bucket usage exceeds the loadFactor. In simple terms, the bucket is doubled, the index is recalculated, and the node is placed in the new bucket.

Javadoc says this:

Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.

That is, resize occurs when the limit is exceeded, and since we are using a 2-power extension, the element is either in its original position, or moved to a 2-power position.

How do you understand that? For example, when we expand from 16 to 32, the changes are as follows:

image.png

If the bucket size n=2^k, and the element is recalculated, since n is twice, the new position is (2^(k+1)-1)&hash. 2^(k+1)-1=2^k+2^ K-1, which is equivalent to the mask range of 2^k-1 at the high level of 1bit(red)(remind again, the original length of n is also a power of 2), this 1bit is either 1 or 0. As shown in figure:

image.png

So when we resize, we don’t need to reposition, we just need to check whether the new bit of the original hash value is 1 or 0. 0 is the same position, and 1 is “old position +oldCap”. If the code is longer, it will not be posted. Here is the schematic diagram of resize expanded from 16 to 32:

image.png

This design is very clever. Whether the new bit is 0 or 1 can be considered random, so the resize process evenly distributes the previously conflicting nodes into the new bucket.


Reference:

  • Java HashMap work principle and realizing | Yikun

This article is published under the Creative Commons Attribution – Share Alike 4.0 International License, and is welcome to be reproduced, interpreted, or used for commercial purposes. The attribution and link to this article must be reserved.