preface

In this article, we mainly read the source code of HashMap to understand its related functions. We can focus on the following questions to understand it purposefully:

  1. Threshold, why is it 12 when the HashMap no-argument constructor is called?
  2. Why does the loading factor default to 0.75 instead of something else?
  3. Why is the bucket size 2 to the n?
  4. How to determine the bucket array index position?
  5. Implementation of put method – Implementation of capacity expansion?
  6. How to evenly distribute hash algorithm results? How to achieve the balance of time and space?
  7. Red-black tree structure?
  8. When do I need to switch to a red black tree?
  9. Why is the threshold for lists to become red-black trees 8?
  10. When do I switch from red-black tree to linked list?

The following source code are JDK1.8 version.

Understanding the basic class structure of HashMap

The structure of HashMap is as follows: array + list + red-black tree

field

Public class HashMap<K,V> extends AbstractMap<K,V> implements Map<K,V>, Cloneable, Serializable {/** * Default initialization capacity - must be a power of 2. */ static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16 /** * Maximum capacity, which is used if any of the parameterized constructors implicitly specify a higher value. Must be a power of 2 <=1<<30. Static final int MAXIMUM_CAPACITY = 1 << 30; static final int MAXIMUM_CAPACITY = 1 << 30; /** * The load factor to use when not specified in the constructor. */ static final float DEFAULT_LOAD_FACTOR = 0.75f; /** * count thresholds for trees instead of lists. When you add elements to a container that has at least that many nodes, the container is converted to a tree. The value must be greater than 2 and at least 8 to comply with the assumption in tree removal about shrinking back to a normal container. */ static final int TREEIFY_THRESHOLD = 8; /** * Cancels the count threshold for the tree structure during the resizing operation. The value should be smaller than TREEIFY_THRESHOLD, with a maximum of 6, and shrink detection should be performed on removal. */ static final int UNTREEIFY_THRESHOLD = 6; /** * Convert bins to the minimum table size of the tree. Otherwise, if there are too many nodes in bin, the table (that is, the array) will be resized. * Avoid conflicts between resizing and treefy_threshold by having at least 4 times the TREEIFY_THRESHOLD. */ static final int MIN_TREEIFY_CAPACITY = 64; /** * table, initialized when used for the first time and resized as needed. When you distribute, the length is always a power of two. (We also allow zero length in some operations to allow a bootstrap mechanism that is not currently needed.) * * here is the map structure of the array + list of arrays, also known as buckets. */ transient Node<K,V>[] table; /** * Save the cached entrySet (). Note that AbstractMap fields are used for keySet () and values (). * */ transient Set<Map.Entry<K,V>> entrySet; /** * map element count */ TRANSIENT int size; /** * The number of times this HashMap has been structurally modified Structural changes are those that change the number of mappings in the HashMap or * otherwise modify its internal structure (e.g., rehash). This field is used to quickly fail iterators on the HashMap collection * view. (see ConcurrentModificationException). */ transient int modCount; /** * The next size value to resize (capacity*load factor). * * @serial */ / (Javadoc description is true at serialization. Also, if the table array has not been allocated, this field holds the initial array capacity, or zero indicates the default initial capacity. int threshold; /** * The load factor of the hash table. * * @serial */ final float loadFactor; }Copy the code

Inner classes – Node

The base hash bin node, used for most entries. (See the TreeNode subclass below, as well as the Entry subclass in LinkedHashMap.)

You can see that the node holds the key and value, the reference to the next node, and the hash of the node. A typical single-linked list.

static class Node<K,V> implements Map.Entry<K,V> { final int hash; final K key; V value; Node<K,V> next; Node(int hash, K key, V value, Node<K,V> next) { this.hash = hash; this.key = key; this.value = value; this.next = next; } public final K getKey() { return key; } public final V getValue() { return value; } public final String toString() { return key + "=" + value; } public final int hashCode() { return Objects.hashCode(key) ^ Objects.hashCode(value); } public final V setValue(V newValue) { V oldValue = value; value = newValue; return oldValue; } public final boolean equals(Object o) { if (o == this) return true; if (o instanceof Map.Entry) { Map.Entry<? ,? > e = (Map.Entry<? ,? >)o; if (Objects.equals(key, e.getKey()) && Objects.equals(value, e.getValue())) return true; } return false; }}Copy the code

Inner classes – TreeNode

Expand the LinkedHashMap. Entry, which in turn extends the node, can therefore be used as an extension of a regular node or a linked node.

static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> { TreeNode<K,V> parent; // red-black tree links TreeNode<K,V> left; TreeNode<K,V> right; // A list of nodes that can be used to quickly find the preceding node when an element is deleted. It means that the list is preserved when it goes to the tree. TreeNode<K,V> prev; // needed to unlink next upon deletion boolean red; TreeNode(int hash, K key, V val, Node<K,V> next) { super(hash, key, val, next); }}Copy the code

The constructor

Construct an empty HashMap with the specified initial capacity and load factor.

As you can see, this is a simple parameter verification. The core is getting threshold, using tableSizeFor method, which we’ll talk about next.

public HashMap(int initialCapacity, float loadFactor) { if (initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); this.loadFactor = loadFactor; // Returns a power of 2 for the given target capacity. this.threshold = tableSizeFor(initialCapacity); }Copy the code

Overloading on top

public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
Copy the code

The other fields use the default value, but I can’t see how threshold is given the default value. This constructor is the one we often use. We will answer it by answering the questions in the introduction.

public HashMap() {
    this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}
Copy the code

TableSizeFor method

Returns the size to the second power of the given target capacity

static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code

According to the method’s comments, we know to return a minimum power of 2 greater than or equal to cap. So how does a HashMap work?

We can forget about the implementation, but if we implement it ourselves, to implement it: if it’s 9 we’ll return 16, if it’s 7 we’ll return 8, if it’s 8 we’ll return 8. If I’m doing it myself I’m going to keep dividing by two, and I’m going to multiply by two as I go along, and I’m not going to divide by enough, and if it’s not equal to the dividend I’m going to multiply it by two.

However, it is obvious that my efficiency is relatively low, and the efficiency of HashMap using bit operation will be relatively high. We can see that n is first reduced by 1 and then combined with the result of unsigned right shift of N by 1 to perform or operation, and so on, continue to move 2 bits, 4 bits until 16 bits, and finally add 1.

  • Why the or operation? So that preserves the 1.
  • Why unsigned right move? Because the capacity needs to be less than 0.
  • Why is int unsigned shifted 1,2,4,8,16 bits to the right? Since int is 4 bytes 32 bits, the longest move is half so that the highest 1 can copy all bits lower than it.
  • Why did we subtract 1 from the beginning? It is mainly on the whole power of 2 digits, can be convenient to unify behind the 1 carry.

To summarize, it is to subtract 1 and copy the highest 1 to the lowest 1, and finally add 1 to carry, so as to return the lowest power greater than or equal to the given number.

The author questions

Why do you have to decide if it’s less than 0 at the end, because if you do it at the beginning, it’s not more efficient to do bitwise operations? At first, it makes sense and logical sense, but otherwise, it is necessary to consider that in most cases the developer will not assign cap less than 0! So it doesn’t mean that much.

The hash method

Evaluates key.hashcode () and extends (XOR) the high part of the hash to the low part. Since the table uses a quadratic power mask, hash sets that change only on bits above the current mask will always conflict. (Known examples include a set of Float keys containing consecutive integers in a small table.) So we apply a transformation that propagates the high influence down. There is a trade-off between the speed, practicality and quality of the bits. Because many common hash set has been reasonably allocated, so can’t benefit from the transmission), and because we use to deal with a lot of conflicts in container in trees, so we only to the lowest possible cost for some shift a exclusive or, in order to reduce system loss, and the effect of combination of highest, otherwise, because the table borders, highest will never be used to index calculation.

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

H = key.hashcode () unsigned move 16 bits and then xor with h in order to avoid the size of the bucket is too small so that the high level does not affect the positioning of the bucket subscript.

If the key is null, the subscript must be 0.

The default bit size of the bucket in the following figure is 16.

The get method

To get a value based on the key, the hash method is called to compute the hash and then the getNode method is called to get the value. The two methods are described below.

public V get(Object key) {
    Node<K,V> e;
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}
Copy the code

GetNode method

Map.get and related methods are implemented.

final Node<K,V> getNode(int hash, Object key) { Node<K,V>[] tab; Node<K,V> first, e; int n; K k; if ((tab = table) ! = null && (n = tab.length) > 0 && (first = tab[(n - 1) & hash]) ! = null) { if (first.hash == hash && // always check first node ((k = first.key) == key || (key ! = null && key.equals(k)))) return first; if ((e = first.next) ! = null) { if (first instanceof TreeNode) return ((TreeNode<K,V>)first).getTreeNode(hash, key); do { if (e.hash == hash && ((k = e.key) == key || (key ! = null && key.equals(k)))) return e; } while ((e = e.next) ! = null); } } return null; }Copy the code

(n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash = (n-1) hash If the binary of n-1 is 1 from the first 1 to the lowest 1, then the and operation with the hash can be directly equivalent to taking the modulus and using as many bits of the hash as possible so as to be more discrete (to make elements more evenly distributed). For example, if n=14, we need to modulo 13. The binary value of 13 (1 byte is used here) is 0000 1101. If the hash value is 0000 1111, the result of the and operation is 0000 1101.

For other things, it is easier to find matching elements. If it is a list, it is a list to traverse, but not a tree query. For red-black trees, you can read “Red-black Tree Understanding and Manual Implementation”.

Put method

Replace the old value if there is a key mapping between the mappings.

public V put(K key, V value) {
    return putVal(hash(key), key, value, false, true);
}
Copy the code
/** * Implements Map. Put and related methods. ** @param hash Hash for key Obtain * @param key the key * by using the hash method @param value the value to put * @param onlyIfAbsent if true, don't change existing value @param evict if false, the table is in creation mode if false * @return previous value, or null if none */ final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { Node<K,V>[] tab; Node<K,V> p; int n, i; // If bucket/array is empty or length is 0, Initialized, call resize () or increase the if ((TAB = table) = = null | | (n = TAB. Length) = = 0) n = (TAB = the resize ()). The length; If (p = TAB [I = (n-1) & hash]) == null) // a new node will be created at the corresponding location. tab[i] = newNode(hash, key, value, null); else { Node<K,V> e; K k; // If the hash is the same and the key is the same, there is no need to traverse the list or tree. if (p.hash == hash && ((k = p.key) == key || (key ! = null && key.equals(k)))) e = p; // If it is a node of the tree, check whether there are any existing nodes in the tree with the same key, if so, return, if not, new a node and put it into the tree. else if (p instanceof TreeNode) e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value); For (int binCount = 0; for (int binCount = 0; ; ++binCount) {if ((e = p.ext) == null) {p.ext = newNode(hash, key, value, null); If (binCount >= treeify_threshold-1) // -1 for 1st treeifyBin(TAB, hash); break; } / / if the same exit traversal list if (e.h ash = = hash && ((k = e.k ey) = = key | | (key! = null && key.equals(k)))) break; p = e; }} // Key already exists if (e! = null) { // existing mapping for key V oldValue = e.value; //onlyIfAbsent Is false, so you must replace value with if (! onlyIfAbsent || oldValue == null) e.value = value; afterNodeAccess(e); return oldValue; } // Change count will be accumulated ++modCount; If (++size > threshold) resize(); afterNodeInsertion(evict); return null; }Copy the code

The general flow chart is as follows

The more complex put method is the expansion (including initialization) and the increase of nodes in the red black tree (including the list to red black tree). The red black tree is not described in this article. For more information, you can read the “Red black Tree Understanding and Manual Implementation”.

Now let’s look at the resize method.

The resize method

Initialize or double the table size. If null, it will be allocated according to the initial capacity target in the field threshold (DEFAULT_INITIAL_CAPACITY is 16 by default, so when the new HashMap uses the no-parameter constructor, the table will initialize the array when put. The array size is 16, So the value of threshold is 12). Otherwise, because we are using a quadratic expansion, the elements in each container must either remain in the same index or move with a quadratic offset in the new table. And that’s the nice thing about cap being a second power.

Final Node<K,V>[] resize() {// Get the current old table Node<K,V>[] oldTab = table; Int oldCap = (oldTab == null)? 0 : oldTab.length; Int oldThr = threshold; int newCap, newThr = 0; If (oldCap > 0) {if (oldCap >= MAXIMUM_CAPACITY) {threshold = integer.max_value; if (oldCap >= MAXIMUM_CAPACITY) {oldCap = integer.max_value; return oldTab; Else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= NewThr = oldThr << 1; newThr = oldThr << 1; Double threshold} else if (oldThr > 0) // Initial capacity was placed in threshold NewCap = oldThr; newCap = oldThr; Else {// cap does not use default DEFAULT_INITIAL_CAPACITY when new HashMap is used; And initialize threshold newCap = DEFAULT_INITIAL_CAPACITY. newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } // newThr == 0 when oldCap is greater than 0,oldCap is twice the size of newCap equal to the maximum capacity limit, or oldCap is less than the default value 16; When the new HashMap parameter constructor is called; If (newThr == 0) {float ft = (float)newCap * loadFactor; // If newCap and ft are both smaller than MAXIMUM_CAPACITY, the threshold is allowed to continue to be obtained. NewThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY? (int)ft : Integer.MAX_VALUE); } // set new threshold = newThr; @SuppressWarnings({"rawtypes","unchecked"}) Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap]; // Assign a new array table = newTab; If (oldTab! For (int j = 0; j < oldCap; ++j) { Node<K,V> e; If ((e = oldTab[j])! OldTab [j] = null; oldTab[j] = null; If (e.next == null) newTab[e.hash & (newcap-1)] = e; if (e.next == null) newTab[e.hash & (newcap-1)] = e; Else if (e instanceof TreeNode) // If TreeNode is a tree, the tree may need to split((TreeNode<K,V>)e).split(this, newTab, j, oldCap); else { // preserve order Node<K,V> loHead = null, loTail = null; Node<K,V> hiHead = null, hiTail = null; Node<K,V> next; do { next = e.next; If ((e.hash & oldCap) == 0) {if (loTail == null) loHead = e; else loTail.next = e; loTail = e; } oldCap else {if (hiTail == null) hiHead = e; else hiTail.next = e; hiTail = e; } } while ((e = next) ! = null); // Put the index in the bucket if (loTail! = null) { loTail.next = null; newTab[j] = loHead; } // oldCap in bucket if (hiTail! = null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } } return newTab; }Copy the code

The author questions

Why is the final value of HashMap threshold not determined in the constructor, but in the resize method? I didn’t understand it at first and felt the readability of the code became worse, but in fact it conforms to the single responsibility. Resize is used to initialize the table and expand the capacity. It is more convenient to maintain the unified processing here.

Differences with JDK1.7

Below is the expansion code for jdk1.7

void transfer(Entry[] newTable) { Entry[] src = table; // SRC references the old Entry array int newCapacity = newtable.length; for (int j = 0; j < src.length; J ++) {// Iterate over the old Entry array Entry<K,V> e = SRC [j]; // Get each element of the old Entry array if (e! = null) { src[j] = null; Do {Entry<K,V> next = e.next; do {Entry<K,V> next = e.next; int i = indexFor(e.hash, newCapacity); / /!!!!! Recalculate the position of each element in the array e.ext = newTable[I]; // mark [1] newTable[I] = e; // Place the element on the array e = next; } while (e! = null); }}}Copy the code

You can see that the subscripts of each element are recalculated each time. newTable[i] = e; Using the head insertion method, the original insertion order is not protected as far as possible, there will be inversion.

Jdk1.8, the method in resize above, makes the following adjustments:

Instead of recalculating the subscript, oldCap and oldCap will be used to calculate and change the subscript according to whether it is equal to 0. In this case, it will also be J + oldCap. Therefore, there is no optimization for performance, only the subscript positioning needs to be reduced by 1. At the same time, no head insertion method is used, so the original insertion order is as good as possible, but the HashMap is already unordered and it doesn’t feel like it.

The change of subscripts is shown as follows:

jdk1.7

It can be concluded that recalculating hash is either unchanged or +oldCap.

OldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap +oldCap

Threshold, why is it 12 when the HashMap no-argument constructor is called?

The resize method of the put method is the resize method of the put method. DEFAULT_INITIAL_CAPACITY is 16 by default. Therefore, if the new HashMap uses the no-parameter constructor, the table will initialize the array with the size of 16. Therefore, the value of threshold is 12.

See the “resize method in the put method” section of the source.

Why does the loading factor default to 0.75 instead of something else?

Note the source code in JDK1.7:

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries  divided by the load factor, no rehash operations will ever occur.Copy the code

The default value of 0.75 is the balance between space and time consumption. Higher values reduce space consumption but increase query costs (reflecting most operations in a HashMap, including GET and PUT). When setting the initial capacity, the expected number of entries in the map and its load factor should reduce the number of rehashes. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operation occurs.

This means:

  • If the load factor is set to 0.5 CAP, use the default 16. If the load factor is greater than threshold=0.5*16, expand the load factor to 32 when 8, 64 when 16, 128 when 32, 64 when 64. The further back you go, the more space you waste. If the load factor is 1, the default cap is 16. Although the space consumption will be reduced, the efficiency of PUT will be affected if the capacity is expanded each time when the capacity is used up. When the load factor is less than 0, the space difference between the number to be expanded will be larger.
  • If the number of elements in the map is expected, try to configure cap based on threshold to avoid unnecessary rehash.

Why not 0.75?

Poisson distribution Poisson distribution Poisson distribution

Why is the bucket size 2 to the n?

  • It is mentioned in the getNode method of the GET method section and will not be described here.
  • For capacity expansion, it is mentioned in the section “Resize method in put method” and will not be described here.

How to determine the bucket array index position?

It is mentioned in the getNode method of the GET method section and will not be described here.

Implementation of put method – Implementation of capacity expansion?

The “put method” section does not repeat it here.

How to evenly distribute hash algorithm results? How to achieve the balance of time and space?

First of all, the hashCode obtained by the hashCode method of the key should be sufficiently discrete (that is, have a good hash algorithm and reduce hash collisions), and then the HashMap has been optimized to allow the high order to participate in the calculation so that the hash is discrete enough to retrieve the index without excluding the high order. If you have a good hash algorithm, but the essence of calculating subscripts is to use the compound hash and the modulus of the bucket size minus 1, so the larger the bucket size, the larger the range of remainder will be, and it will be more uniform, but the bucket takes up too much space, so the HashMap sets a threshold and expands.

Red-black tree structure?

See Red-black Tree Understanding and Manual Implementation.

When do I need to switch to a red black tree?

The traversal complexity of the linked list is O(n) and the red-black tree is close to O(logn). At the same time, the maintenance efficiency of red-black tree is higher than AVL.

Why is the threshold for lists to become red-black trees 8?

Since the frequency of node distribution in bucket follows Poisson distribution, the probability of the number of nodes in the list being 8 is very small, so the number of nodes in the list is greater than or equal to 8, so it is converted into a red-black tree. Even if the maintenance efficiency of a red-black tree is not as good as that of a linked list, it can be quickly queried in the case of a really long linked list. It’s also fast if it’s less than 8.

(n = tab.length) < MIN_TREEIFY_CAPACITY will be checked and the capacity will be expanded first.

When do I switch from red-black tree to linked list?

UNTREEIFY_THRESHOLD is equal to 6, so if it is less than or equal to 6, it will be converted to a linked list. Because linked lists are fast enough, red-black trees do not have obvious advantages and high maintenance costs.

reference

HashMap (JDK1.8)

Red black Tree understanding and Manual implementation.

The Java 8 series reimagines HashMap