1. An overview of the
In this article, we’ll take a look at a collection class that is commonly used in everyday development – HashMap. HashMap first appeared in JDK 1.2, and the underlying implementation is based on a hash algorithm. HashMap allows NULL keys and NULL values, and NULL key hashes are 0 when calculating the hash value of a hash key. HashMap does not guarantee the order of the key-value pairs, which means that the order of the key-value pairs may change after certain operations are performed. Also, it’s important to note that HashMap is a non-thread-safe class and can be problematic in a multithreaded environment.
In this article, I will analyze the common methods, important attributes, and related methods in a HashMap. Note that the source code for HashMap is too parsed to cover in this article, so please forgive me.
Principle 2.
As mentioned in the previous section, the underlying implementation of HashMap is based on the hash algorithm, which can be divided into hash reprobe and zip type. HashMap uses a drop-down hash algorithm and introduced red-black tree optimization for excessively long linked lists in JDK 1.8. The schematic diagram of the data structure is as follows:
In the case of the drop-down hash algorithm, the data structure is composed of an array and a linked list (or tree structure). When doing operations such as adding, deleting, etc., the element’s location in the bucket should be located first, and then the element should be located from the linked list. For example, if we want to query whether the above structure contains element 35, the steps are as follows:
- Positioning elements
35
The position of the bucket,index = 35 % 16 = 3
- in
3
The number of bucket points to the linked list continues to search, found 35 in the linked list.
That’s how the underlying data structure of HashMap works. The basic operations of HashMap are a wrapper around the basic operations of the pull-down hash algorithm. The difference was the introduction of red-black trees in JDK 1.8. The underlying data structure changed from array + linked list to array + linked list + red-black tree, but the essence remained the same. Well, the principle of the first part of this, then talk about the source code implementation.
3. Source code analysis
The source code version analyzed in this article is JDK 1.8. JDK 1.8 has some optimizations for HashMap compared to JDK 1.7. For example, the introduction of red-black tree to solve the problem of too long linked list low efficiency. Rewrite the resize method, remove the alternative hashing correlation method, avoid recalculating key hashes, etc. This article does not attempt to analyze these optimizations, however. It will only examine the common methods of HashMap and some important attributes and related methods. If you are interested in red black trees, you can read another article of mine – red black tree detailed analysis.
3.1 Construction method
3.1.1 Analysis of construction method
There are only a few HashMap constructors, only four. The HashMap constructor does a relatively simple job of initializing important variables such as loadFactor and Threshold. The underlying data structure is deferred until the key-value pair is inserted. The HashMap correlation constructor is as follows:
Public HashMap() {this.loadFactor = DEFAULT_LOAD_FACTOR;} public HashMap() {this.loadFactor = DEFAULT_LOAD_FACTOR; Public HashMap(int InitialCapacity) {this(InitialCapacity, int initialCapacity, int initialCapacity, int initialCapacity); DEFAULT_LOAD_FACTOR); } public HashMap(int initialCapacity, int initialCapacity, int initialCapacity) float loadFactor) { if (initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); this.loadFactor = loadFactor; this.threshold = tableSizeFor(initialCapacity); } /** public HashMap(Map<? extends K, ? extends V> m) { this.loadFactor = DEFAULT_LOAD_FACTOR; putMapEntries(m, false); }
Of the above four constructors, the first one is probably the one you use the most. The first constructor is simple and simply sets the loadFactor variable to its default value. Constructor 2 calls constructor 3, and constructor 3 still just sets some variables. Constructor 4, which is less common, copies a Map from another Map into its own storage structure.
The above is a brief introduction to constructors. There is not much in constructors themselves, so I will leave it at that. Next, let’s talk about some of the variables initialized by the constructor.
3.1.2 Initial capacity, load factor and threshold
In general, we create a HashMap using a no-argument constructor. However, when we have time and space complexity requirements, sometimes using the default values may not meet our requirements. In this case, we need to manually adjust the parameters. In the HashMap constructor, we can adjust two parameters, initialCapacity and loadFactor. By setting these two parameters, you can further influence the threshold size. However, the initial threshold is only calculated by the shift operation of InitialCapacity. Their functions are as follows:
The name of the | use |
---|---|
initialCapacity | Initial capacity of HashMap |
loadFactor | The load factor |
threshold | The maximum number of key-value pairs that the current HashMap can hold, beyond which capacity will need to be expanded |
The relevant code is as follows:
/** The default initial capacity - MUST be a power of two. */
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4;
/** The load factor used when none specified in constructor. */
static final float DEFAULT_LOAD_FACTOR = 0.75f;
final float loadFactor;
/** The next size value at which to resize (capacity * load factor). */
int threshold;
If you look at the source code, you’ll see that initialCapacity is not defined in the HashMap variable. This is also not hard to understand, as the parameter name indicates that this variable represents an initial capacity and is only used once in the constructor. There is no need to define a variable to hold it. But if you look closely at the constructor of the HashMap above, you’ll see that the data structure that stores the key-value pairs is not initialized in the constructor. It is called the initial capacity, but it is not used to initialize the data structure, so what is the use of passing this parameter? I’m not going to explain that, but I’m going to keep you in suspense, and we’ll come back to that.
By default, HashMap has an initial capacity of 16 and a load factor of 0.75. There is no default threshold because the threshold can be calculated by multiplying the capacity by the loadFactor (described in the comments), which is threshold = capacity * loadFactor. But when you look closely at construct 3, you will see that the threshold is not calculated by the formula above, but by a method. Does this mean that the threshold variable was miscommented? Or is it just a special treatment here, followed by the formula elsewhere? About this question, here also do not explain first, later in the analysis of expansion method, and then to explain this problem. Next, let’s take a look at what the initializer method looks like. The source is as follows:
/**
* Returns a power of two size for the given target capacity.
*/
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
The length of the code above is a bit ugly, but I didn’t understand what it was doing when I first looked at it. But then I drew on paper and learned what it was for. To sum it up in one sentence: find the smallest power of 2 greater than or equal to cap. We will explain why this is so later. Let’s first look at the diagram of the tableSizeFor method:
Cap = 536,870,913 = 229 + 1. After multiple calculation, n + 1 = 1,073,741,824 = 230. It should be easy to understand the use of this method by looking at the diagram, but I won’t go into it here.
After the calculation of the initial threshold, let’s talk about the loadFactor. The load factor is an important parameter for a HashMap that reflects the use of the HashMap bucket array (assuming that the key-value pairs of nodes are evenly distributed across the bucket array). By adjusting the load factor, HashMap can behave differently in terms of time and space complexity. As we lowered the load factor, the logarithmic number of keys that HashMap could hold became smaller. When you restore the key-value pairs in a new bucket array, the number of collisions between keys decreases, and the length of the linked list becomes shorter. At this point, HashMap operations such as additions, deletions, changes, searches, etc. will become more efficient, which is a typical trade of space for time. Conversely, if you increase the load factor (which can be greater than 1), the logarithmic number of keys that a HashMap can hold increases, resulting in a higher space utilization, but also a higher collision rate. This means that as the list gets longer, it becomes less efficient, which is a trade of time for space. As for how to adjust the load factor, this depends on the use of the scene. In general, we’ll just use the default.
3.2 find
The search operation of HashMap is relatively simple, and the search steps are the same as described in the principle, that is, first locate the position of the bucket where the key-value pair is located, and then search the linked list or red-black tree. The search can be completed through these two steps, and the relevant code for this operation is as follows:
public V get(Object key) { Node<K,V> e; return (e = getNode(hash(key), key)) == null ? null : e.value; } final Node<K,V> getNode(int hash, Object key) { Node<K,V>[] tab; Node<K,V> first, e; int n; K k; If (TAB = table)! = null && (n = tab.length) > 0 && (first = tab[(n - 1) & hash]) ! = null) { if (first.hash == hash && // always check first node ((k = first.key) == key || (key ! = null && key.equals(k)))) return first; if ((e = first.next) ! = null) {// 2. If first is of type treeNode, If (first instanceof treeNode) return ((treeNode <K,V>)first).getTreeNode(hash, key); / / 2. To find the do list {the if (e.h ash = = hash && ((k = e.k ey) = = key | | (key! = null && key.equals(k)))) return e; } while ((e = e.next) ! = null); } } return null; }
The core logic of the lookup is encapsulated in the getNode method, and the source code of the getNode method I have written some comments, which should not be hard to read. Let’s take a look at the first step of the lookup process – locating the bucket, which is implemented as follows:
// index = (n - 1) & hash
first = tab[(n - 1) & hash]
So we can just use (n-1)& hash to figure out where the buckets are in the bucket array, and for those of you who don’t understand why we’re doing this, here’s a quick explanation. The size length of the bucket array in a HashMap is always a power of 2, in which case, (n-1) & hash is equivalent to mod length. But the computation efficiency of the remainder is not as high as the bit operation, so (n-1) & hash is also a small optimization. For example, suppose hash = 185 and n = 16. The calculation process is as follows:
The above calculation is not complicated, so I won’t go into it here.
In the above source code, in addition to finding the relevant logic, there is also a method to calculate the hash. The source code for this method is as follows:
Static final int hash(Object key) {int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); }
The logic of this approach looks like it’s recalculating the hash through bitwise operations, so why do we do that? Why not just use the hash generated by the key’s hashCode method? So if you think about it for a second, let me write the answer down here.
There are two advantages to this, and I’ll explain them briefly. Let’s look again at the computed graph above for the remainder, where the hash is generated by the hashCode of the key. When calculating the remainder, because n is relatively small, only the lower four bits of the hash are involved in the calculation, and the calculation of the higher bits can be considered invalid. As a result, the calculation results are only related to the low level information, and the high level data does not play a role. To deal with this defect, we can perform XOR on the hash of the high 4-bit data and the low 4-bit data in the figure above, i.e., hash ^ (hash >>> 4). In this way, the high data and the low data are XOR, so as to increase the randomness of the low information and allow the high data to participate in the calculation in a disguised way. The calculation process is as follows:
In Java, the hash produced by the hashCode method is of type int and is 32 bits wide. The first 16 are the high ones and the last 16 are the low ones, so you move 16 to the right.
One of the benefits of recalculating a hash is that it increases the complexity of the hash. When we override hashCode methods, we may write poorly distributed hashCode methods, resulting in a high hash collision rate. By shifting and XOR operations, the hash can be made more complex, thus affecting its distribution. This is why HashMap does not use the original hash of the key object directly.
3.3 traversal
Like search search, traversal operation is also a relatively high frequency of use of an operation. For traversing a HashMap, we typically use the following method:
for(Object key : map.keySet()) {
// do something
}
or
for(HashMap.Entry entry : map.entrySet()) {
// do something
}
As you can see from the code snippet above, you typically iterate over the Key or Entry collection of a HashMap. The collection produced by foreach traversing the keySet method in the snippet above is converted to iterator traversing at compile time, which is equivalent to:
Set keys = map.keySet();
Iterator ite = keys.iterator();
while (ite.hasNext()) {
Object key = ite.next();
// do something
}
As you walk through a HashMap, you’ll notice that the order of the results is the same when you iterate over the HashMap multiple times. But this order and the order of insertion are generally inconsistent. What are the reasons for this behavior? So let’s think about why. I’ll post the code related to traversal first, as follows:
public Set<K> keySet() { Set<K> ks = keySet; if (ks == null) { ks = new KeySet(); keySet = ks; } return ks; } public final int size() {return size;} public final int size() {return size;} public final int size() {return size; } public final void clear() { HashMap.this.clear(); } public final Iterator<K> iterator() { return new KeyIterator(); } public final boolean contains(Object o) { return containsKey(o); } public final boolean remove(Object key) { return removeNode(hash(key), key, null, false, true) ! = null; }} public final K next() {public final K next();} public final K next();} public final K next();} public final K next();} public final K next();} public final K next(); { return nextNode().key; } } abstract class HashIterator { Node<K,V> next; // next entry to return Node<K,V> current; // current entry int expectedModCount; // for fast-fail int index; // current slot HashIterator() { expectedModCount = modCount; Node<K,V>[] t = table; current = next = null; index = 0; if (t ! = null && size > 0) {// Advance to first entry do {} while (index < t.lNumbers && (next = t[index++]); == null); } } public final boolean hasNext() { return next ! = null; } final Node<K,V> nextNode() { Node<K,V>[] t; Node<K,V> e = next; if (modCount ! = expectedModCount) throw new ConcurrentModificationException(); if (e == null) throw new NoSuchElementException(); if ((next = (current = e).next) == null && (t = table) ! Do {} while (index < t.lNums&& (next = t[index++]) == null); } return e; } // Omit part of the code}
The above source code, traverse all the keys, first to get the KeySet KeySet object, and then through the iterator KeyIterator KeySet traversal. The KeyIterator class inherits from the HashIterator class, in which the core logic is also encapsulated. The logic of the hashIterator is not complicated. When initialized, the hashIterator first finds the bucket containing the list node reference from the bucket array. The bucket is then iterated over the linked list. When the traversal is complete, continue to look for the next bucket that contains a linked list node reference and continue traversing. If it is not found, the traversal ends. For example, suppose we traverse the structure of the following graph:
When initialized, the hashIterator first iterates through the bucket array to find the bucket that contains the linked list node reference, which in this case is bucket number 3. The nextNode method then iterates over the list to which the bucket points. After traversal of bucket 3, the nextNode method continues to find the next bucket that is not empty, corresponding to bucket 7 in the figure. The process is then similar to the above until the last bucket is traversed. This is the flow of the core logic of the HashIterator, as shown below:
The final result of traverse the figure above is 19 ->, 3 ->, 35 ->, 7 ->, 11 ->, 43 ->, 59. To verify the correctness, simply write some test code and run it. The test code is as follows:
/** * should test under JDK 1.8, */ public class HashMapTest {@Test public void TestTraversal () {HashMap<Integer, String> map = new HashMap(16); map.put(7, ""); map.put(11, ""); map.put(43, ""); map.put(59, ""); map.put(19, ""); map.put(3, ""); map.put(35, ""); System.out.println(" result of traversal: "); for (Integer key : map.keySet()) { System.out.print(key + " -> "); }}}
The traversal results are as follows:
At the end of this section, I have two questions for you. In JDK 1.8, red-black tree optimization was introduced to avoid the impact of excessively long linked lists on HashMap performance. But in the above source code does not find the red-black tree traversal logic, why? So how do we traverse the list that has been converted to a red-black tree? Think about it first, and you can find the answer in the source code or later sections of this article.
3.4 insert
3.4.1 Insert logical analysis
From the previous two sections, you should have a good idea of the underlying data structure of HashMap. Even if I don’t, you should be able to see what the HashMap insert flow looks like. The first step is to locate the bucket to which the key-value pair to insert belongs, and then determine if the bucket is empty. If it is empty, the key-value pair is stored. If it is not empty, the key values need to be docked to the last place in the list, or the key value pairs need to be updated. This is the insertion process of a HashMap. Of course, don’t get excited yet. This is just a simplified version of the insert process. The actual insert process is much more complex. First, the HashMap is a variable-length collection, so you need to consider scaling. Second, in JDK 1.8, HashMap introduced red-black tree optimization for lists that are too long. There are also questions about how long lists need to be optimized and what the optimization process is. After introducing these two problems, you will find that the simple operation is now a little more complicated. In this section, I’ll start by examining the source code for inserts, scaling, treeing (linking lists to red-black trees, the same below), and other operations related to tree structures, and then in two separate subsections. Next, look at the source code for the insert operation:
public V put(K key, V value) { return putVal(hash(key), key, value, false, true); } final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { Node<K,V>[] tab; Node<K,V> p; int n, i; / / initialize the bucket array table, the table is delayed to insert new data to initialize the if ((TAB = table) = = null | | (n = TAB. Length) = = 0) n = (TAB = the resize ()). The length; // If the bucket does not contain a key-value reference to a node, If ((p = TAB [I = (n-1) & hash]) == null) TAB [I] = newNode(hash, key, value, null); else { Node<K,V> e; K k; / / if the key in the hash values, and node is equal to the first key values in the list of nodes, will point to the key/value pair if e (p.h ash = = hash && ((k = p.k ey) = = key | | (key! = null && key.equals(k)))) e = p; // If the reference type in the bucket is TreeNode, Else if (p instanceof treeNode) e = ((treeNode <K,V>)p).putTreEval (this, TAB, hash, key, value); Else {// For (int binCount = 0; ; ++binCount) {if ((e = p.ext) == null) {p.ext = newNode(hash, key, value, null); // If (binCount >= TREEIFY_THRESHOLD -1) // -1 for 1st treeifyBin(TAB, hash); break; } / / condition is true, according to the current list contains to insert the key/value pair, termination of traverse the if (e.h ash = = hash && ((k = e.k ey) = = key | | (key! = null && key.equals(k)))) break; p = e; }} // Check if the key value pair to be inserted exists in the HashMap if (E! = null) { // existing mapping for key V oldValue = e.value; // OnlyIfAbsent indicates whether to update the key-value pair value only if oldValue is null. onlyIfAbsent || oldValue == null) e.value = value; afterNodeAccess(e); return oldValue; } } ++modCount; If (++ Size Threshold) resize(); if (++ Size Threshold); afterNodeInsertion(evict); return null; }
The entry method for the insert operation is put(K,V), but the core logic is in the V putVal(int, K,V, Boolean, Boolean) method. The putVal method does a few things:
- When the bucket table array is empty, initialize the table by scaling
- Finds whether the key-value pair to be inserted already exists, and if so, determines whether to replace the old value with the new value
- If not, the key-value pair is linked into the list, and the list is turned into a red-black tree based on the length of the list
- Determine whether the number of key-value pairs is greater than the threshold value. If so, enlarge the capacity
That’s the logic of HashMap insertion, but it’s not too complicated, so I won’t go into it here. Next, let’s analyze the expansion mechanism.
3.4.2 Capacity expansion mechanism
In Java, arrays have a fixed length, which means they can only store a fixed amount of data. But in the process of development, there are many times when we don’t know how big an array to make. It is not enough to build a small, large and inexhaustible, resulting in waste. It would be nice if we could implement a variable-length array and allocate as much space as needed. Fortunately, instead of implementing variable-length arrays ourselves, the Java collections framework already implements variable-length data structures. Such as ArrayList and HashMap. Scaling is a very important operation for this type of array-based variable length data structure. Let’s talk about HashMap scaling.
Before going into detail, here’s some background on capacity expansion:
In a HashMap, the length of the bucket array is all a power of 2, and the threshold size is the product of the length of the bucket array and the load factor. When the number of key-value pairs in a HashMap exceeds a threshold, the scale is expanded.
The scaling mechanism of HashMap is different from that of other variable length sets. HashMap expands by 2 times the length of the current bucket array, and the threshold value is also 2 times the original value (if the threshold overflow returns to zero during the calculation, the threshold formula will be used to recalculation). After capacity expansion, the key-value pairs should be recalculated and moved to the appropriate position. That’s the size of the HashMap, but let’s see how it’s implemented:
final Node<K,V>[] resize() { Node<K,V>[] oldTab = table; int oldCap = (oldTab == null) ? 0 : oldTab.length; int oldThr = threshold; int newCap, newThr = 0; If (oldCap > 0) {if (oldCap > 0) {if (oldCap > 0) {if (oldCap > 0) { If (oldCap >= MAXIMUM_CAPACITY) {threshold = Integer.max_value; return oldTab; Else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap bb0 = DEFAULT_INITIAL_CAPACITY) newThr = oldThr << 1; // double threshold} else if (oldThr > 0) // initial capacity was placed in the set /* * * HashMap uses the threshold variable to temporarily hold the initialCapacity parameter */ newCap = oldThr; * HashMap uses the threshold variable to temporarily hold the initialCapacity parameter */ newCap = oldThr; Else {// zero initial threshold signifies using defaults /* * When a no-argument constructor is called, the size of the bucket array is the default size, * threshold is the product of default capacity and default load factor */ newCap = DEFAULT_INITIAL_CAPACITY; newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } if (newThr == 0) {float ft = (float)newCap * loadFactor; newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE); } threshold = newThr; Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap]; table = newTab; if (oldTab ! = null) {// If the old bucket array is not empty, traversal the bucket array and map the key-value pairs to the new bucket array for (int j = 0; j < oldCap; ++j) { Node<K,V> e; if ((e = oldTab[j]) ! = null) { oldTab[j] = null; if (e.next == null) newTab[e.hash & (newCap - 1)] = e; Else if (e instanceof TreeNode) // split the red-black tree ((TreeNode<K,V>)e). Split (this, newTab, j, oldCap); else { // preserve order Node<K,V> loHead = null, loTail = null; Node<K,V> hiHead = null, hiTail = null; Node<K,V> next; Do {next = e.next; next = e.next; if ((e.hash & oldCap) == 0) { if (loTail == null) loHead = e; else loTail.next = e; loTail = e; } else { if (hiTail == null) hiHead = e; else hiTail.next = e; hiTail = e; } } while ((e = next) ! = null); // If (loTail! = null) { loTail.next = null; newTab[j] = loHead; } if (hiTail ! = null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } } return newTab; }
The above source code is a little long, I hope you patience to understand its logic. The above source code does a total of three things, respectively:
- Calculates the capacity of the new bucket array, NewCap, and the new threshold, NewThr
- Create a new bucket array based on the computed newCap, where the bucket array Table is initialized
- Remap the key-value pair node into the new bucket array. If the node is of type TreeNode, you need to tear down the dividend black tree. If it is a normal node, the nodes are grouped in their original order.
In the three columns above, creating a new bucket array is a single line of code, needless to say. Next, the first and third points, the newCap and newThr calculation process. This calculation corresponds to the first and second conditional branches of the resize source code, as follows:
If (oldCap >= MAXIMUM_CAPACITY) {if (oldCap >= MAXIMUM_CAPACITY) {if (oldCap >= MAXIMUM_CAPACITY) { } else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) {... } } else if (oldThr > 0) {... } else {... } // The second conditional branch if (newThr == 0) {... }
Through these two conditional branches to judge different situations, and then calculate different capacity values and thresholds. They cover the following:
A branch:
conditions | coverage | note |
---|---|---|
oldCap > 0 | The bucket array Table has been initialized | |
oldThr > 0 | Threshold > 0, and the bucket array is uninitialized | This happens when the hashMap (int) and hashMap (int, float) constructors are called, in which case newCap = oldThr, which is computed in the second conditional branch |
oldCap == 0 && oldThr == 0 | The bucket array is uninitialized and the threshold is 0 | This happens when you call the HashMap() constructor. |
I’m going to take the oldThr > case out as a separate one. In this case, oldThr is assigned to newCap, which is equivalent to newCap = threshold = tableSizeFor(InitialCapacity). The initialCapacity parameter we passed in during initialization is passed through threshold and finally assigned to newCap. This answers the previous question: The InitialCapacity parameter is not saved, so how does it participate in the bucket initialization process?
Nested branches:
conditions | coverage | note |
---|---|---|
oldCap >= 230 | The bucket array capacity is greater than or equal to the maximum bucket capacity of 230 | No more capacity expansion in this case |
newCap < 230 && oldCap > 16 | The size of the new bucket array is less than the maximum and the size of the old bucket array is greater than 16 | In this case, the new threshold newThr = oldThr << 1, and the shift may cause overflow |
Here is a simple example of overflow caused by shift. When the loadFactor decimal is 0 and the integer bits are divisible by 2 and greater than or equal to 8, the overflow of newThr may return to zero in some calculation. See below:
Branches 2:
conditions | coverage | note |
---|---|---|
newThr == 0 | The first conditional branch does not evaluate newThr or the nested branch causes newThr overflow to return to zero during the computation |
After finishing the calculation process of newCap and newThr, let’s analyze the process of key-value to node remapping.
In JDK 1.8, remapping a node requires consideration of the node type. For the tree node, it is necessary to split the dividend black tree and then map. For the linked list type nodes, the lists need to be grouped first and then mapped. It is important to note that after grouping, the relative position of the nodes within the group remains the same. The logic of red-black tree splitting will be explained in the next section, but let’s look at how linked lists are grouped and mapped.
We all know that when we insert a node into the underlying data structure, we usually calculate the position of the bucket by modulo calculation, and then put the node into the bucket. In fact, we can think of remapping as an insertion operation. In JDK 1.7, that’s exactly what happened. In JDK 1.8, however, this process has been somewhat optimized and is logistically slightly more complex. Before we dive into the details, let’s review the hash redundancy process:
In the figure above, the bucket array size n = 16, hash1 is not equal to hash2. But because only the last four are involved, the results are the same. When the bucket array is enlarged, n is changed from 16 to 32, and the hash value above is remapped:
After capacity expansion, the number of digits involved in modular calculation changed from 4 to 5. Since the fifth value of the two hashes is different, the results of the two hashes are also different. The above calculation is not difficult to understand, continue to analyze.
Assuming that the bucket array in the figure above is expanded, the capacity after expansion is n = 16, the remap process is as follows:
The linked list is iterated and the value of the node hash & oldCap is calculated. As shown in the figure below
If the value is 0, point loHead and loTail to the node. If the following node hash & oldCap is 0, then the node is linked to the list that loHead points to, and loTail points to the node. If the value is non-zero, make hiHead and hiTail point to the node. After the traversal, you may have two linked lists, which completes the grouping of the linked lists:
Finally, the two links are stored in the corresponding bucket to complete the capacity expansion. The diagram below:
As can be seen from the figure above, the order of the nodes in the two linked lists does not change after remapping, but still keeps the order before capacity expansion. That’s the code for scaling HashMap in JDK 1.8. As a side note, the JDK 1.8 version of HashMap scales more efficiently than previous versions. If you look at the JDK 1.7 source code, it introduces random seeds into the calculation of hashes to prevent denial-of-service attacks caused by hash collisions. To enhance the randomness of the hash so that the key-value pairs are evenly distributed across the bucket array. In the process of capacity expansion, the relevant method will determine whether new random seeds need to be generated according to the capacity, and recalculate the hash of all nodes. In JDK 1.8, this approach was replaced by the introduction of red-black trees. Thus, the operation of multiple calculation of hash is avoided and the expansion efficiency is improved.
So I’ll leave you with that in this section, and now I’m going to talk about the conversion between lists and red-black trees.
3.4.3 List treing, red-black treing and resolution
JDK 1.8 has made improvements to the HashMap implementation. The biggest improvement is the introduction of red-black trees to handle frequent collisions, which increases the complexity of the code. Previously, for example, all you needed to do was implement a set of methods for linked list operations. After the introduction of red-black tree, we need to implement red-black tree related operations. Red-black tree is a kind of self – balancing binary search tree, itself is more complex. This article is not intended to expand on red-black trees; it will only cover the aspects of linked list treing that need to be noted. As for the detailed introduction of red-black trees, if you are interested, you can refer to my other article – red-black tree detailed analysis.
Before expanding, post the tree-related code as follows:
static final int TREEIFY_THRESHOLD = 8; /** * static final int MIN_TREEIFY_CAPACITY = 64; /** * static final int MIN_TREEIFY_CAPACITY = 64; static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> { TreeNode<K,V> parent; // red-black tree links TreeNode<K,V> left; TreeNode<K,V> right; TreeNode<K,V> prev; // needed to unlink next upon deletion boolean red; TreeNode(int hash, K key, V val, Node<K,V> next) { super(hash, key, val, next); }} /** * Final void treeifyBin(Node<K,V>[] TAB, int hash) {int n, index; Node<K,V> e; / / bucket array capacity is less than MIN_TREEIFY_CAPACITY, priority if an expansion rather than tree (TAB = = null | | (n = TAB. Length) < MIN_TREEIFY_CAPACITY) the resize (); else if ((e = tab[index = (n - 1) & hash]) ! TreeNode<K,V> HD = null, Tl = null; TreeNode<K,V> p = replacementTreeNode(e, null); if (tl == null) hd = p; else { p.prev = tl; tl.next = p; } tl = p; } while ((e = e.next) ! = null); // if ((TAB [index] = HD)! = null) // Convert the tree to hd.treeify(TAB); } } TreeNode<K,V> replacementTreeNode(Node<K,V> p, Node<K,V> next) { return new TreeNode<>(p.hash, p.key, p.value, next); }
In the process of capacity expansion, treeing must meet two conditions:
- The length of the linked list is greater than or equal to TREEIFY_THRESHOLD
- The bucket array capacity is greater than or equal to MIN_TREEIFY_CAPACITY
The first condition is easy to understand, so I won’t talk about it here. Here is the reason for adding the second condition. In my opinion, the reasons are as follows:
When the bucket array size is small, the collision rate of key-value against node hashes may be high, resulting in a long linked list. At this time should be priority expansion, not immediately tree. After all, the high collision rate is caused by the small size of the bucket array, which is the main reason. When capacity is small, priority expansion can avoid a series of unnecessary treing processes. At the same time, when the barrel capacity is small, the expansion will be more frequent, and the expansion needs to demolish the dividend black tree and re-map. So in the case of a relatively small bucket capacity, the long list into a red-black tree is a thankless task.
Back in the source code above, let’s move on to the treeIFYBIN method. The main function of this method is to convert the ordinary linked list into a list composed of TreeNode type nodes, and the last call to TreeIfy is to convert the linked list into a red-black tree. TreeNode inherits from the Node class, so the TreeNode still contains the next reference, and the order of the nodes in the original linked list is eventually saved by the next reference. Let’s assume that before treeing, the linked list structure is as follows:
HashMap was not designed with red-black trees in mind for later optimization. As a result, the key class is not required to implement the Comparable interface or provide a corresponding comparator, as is the case with TreeMap. However, since the treing process requires comparing the sizes of two key objects, comparing the sizes between keys becomes a tricky issue when the key class does not implement the Comparable interface. To solve this problem, HashMap does three steps to ensure that the two key sizes can be compared, as follows:
- Compare the size of the hash between keys, and if the hash is the same, continue the comparison
- Tests whether the key class implements the Comparable interface, if the implementation calls the compareTo method for comparison
- If the size is not yet comparable, you need to quorum, using tieBreakOrder (see the source code for yourself)
A tie break is a tennis term. It can be interpreted as an overtime match, so it’s a fun name.
Through the above three comparisons, you can finally compare which is bigger and which is smaller. After the size is compared, the red-black tree can be constructed. The final red-black tree is as follows:
The orange arrow represents the next reference to the TreeNode. The prev reference is not drawn due to limited space. It can be seen that after the list is converted into a red-black tree, the original order of the list will still be referenced and retained (the root node of the red-black tree will be moved to the first place in the list), and we can still traverse the top red-black tree as we traverse the list. This structure will pave the way for the segmentation of the red-black tree and the transformation of the red-black tree into a linked list, which we will continue to analyze.
The red-black tree splits
After capacity expansion, ordinary nodes need to be remapped, and red-black tree nodes are no exception. In general, we can convert the red-black tree to a linked list, and then remap the list. This approach is easy to think of, but it will lose some efficiency. Instead of the above approach, HashMap is implemented with the idea of “yay yay” (yay yay, please pay me the AD). As mentioned in the previous section, when converting a normal linked list to a red-black tree, HashMap preserves the order of the nodes in the original linked list with two additional references next and prev. In this way, when remapping the red-black tree, it can be done in the way of mapping linked lists. In this way, the red-black tree is avoided to be transformed into a linked list and then mapped, virtually improving the efficiency.
The above is the red black tree split logic, let’s see the specific implementation:
Static final int UNTREEIFY_THRESHOLD = 6; static final int UNTREEIFY_THRESHOLD = 6; final void split(HashMap<K,V> map, Node<K,V>[] tab, int index, int bit) { TreeNode<K,V> b = this; // Relink into lo and hi lists, preserving order TreeNode<K,V> loHead = null, loTail = null; TreeNode<K,V> hiHead = null, hiTail = null; int lc = 0, hc = 0; /* * The next reference is still retained by the red-black tree node, so you can still traversal the red-black tree as a linked list. */ for (treeNode <K,V> e = b, next; e ! = null; e = next) { next = (TreeNode<K,V>)e.next; e.next = null; if ((e.hash & bit) == 0) { if ((e.prev = loTail) == null) loHead = e; else loTail.next = e; loTail = e; ++lc; } else { if ((e.prev = hiTail) == null) hiHead = e; else hiTail.next = e; hiTail = e; ++hc; } } if (loHead ! If (lc <= UNTREEIFY_THRESHOLD) TAB [index] = lohead.untreeify (map); if (lc <= untreeify(map); else { tab[index] = loHead; If (hiHead! == null); if (hiHead! == null); if (hiHead! == null); = null) loHead.treeify(tab); }} // similar to above if (hiHead! = null) { if (hc <= UNTREEIFY_THRESHOLD) tab[index + bit] = hiHead.untreeify(map); else { tab[index + bit] = hiHead; if (loHead ! = null) hiHead.treeify(tab); }}}
As you can see from the source code, the logic of remapping red-black trees is basically the same as the logic of remapping linked lists. The difference is that after remapping, the red-black tree is split into two linked lists made up of TreeNodes. If the length of the list is less than UNTREEIFY_THRESHOLD, the list is converted to a normal linked list. Otherwise, the treeNode list is re-treed according to the condition. For example, suppose that the red-black tree in the figure above is remap after capacity expansion, and the mapping result is as follows:
Red-black tree chaining
As mentioned earlier, the original list node order is still preserved in the red-black tree. With that in mind, it’s much easier to turn the red-black tree into a linked list, just turn the TreeNode list into a Node list. The relevant code is as follows:
final Node<K,V> untreeify(HashMap<K,V> map) { Node<K,V> hd = null, tl = null; // Replace for (Node<K,V> q = this; q ! = null; Q = q.next) {// Node<K,V> p = map.replacementNode(q, null); if (tl == null) hd = p; else tl.next = p; tl = p; } return hd; } Node<K,V> replacementNode(Node<K,V> p, Node<K,V> next) { return new Node<>(p.hash, p.key, p.value, next); }
The above code is not complicated and is not difficult to understand, so I won’t go into it here. To this expansion of relevant content said, I do not know everyone understand.
3.5 delete
If you stick with the previous section, you can relax in this section. Of course, if you don’t look at the deletion of the red-black tree. But red black trees are not the focus of this article, and they won’t be covered in this section, so don’t worry.
The removal of a HashMap is not complicated and can be done in just three steps. The first step is to locate the bucket, the second step is to traversal the linked list and find nodes with equal keys, and the third step is to delete the node. The relevant source code is as follows:
public V remove(Object key) { Node<K,V> e; return (e = removeNode(hash(key), key, null, false, true)) == null ? null : e.value; } final Node<K,V> removeNode(int hash, Object key, Object value, boolean matchValue, boolean movable) { Node<K,V>[] tab; Node<K,V> p; int n, index; if ((tab = table) ! = null && (n = tab.length) > 0 && // 1. (p = TAB [index = (n-1) & hash])! = null) { Node<K,V> node = null, e; K k; V v; / / if the key value and list the first node are equal, then the node to the node if (p.h ash = = hash && ((k = p.k ey) = = key | | (key! = null && key.equals(k)))) node = p; else if ((e = p.next) ! = null) {// If the type is TreeNode, If (p instanceof treeNode) node = ((treeNode <K,V>)p).getTreeNode(hash, key); The else {/ / 2. Traverse the list, find to do {if delete node (e.h ash = = hash && ((k = e.k ey) = = key | | (key! = null && key.equals(k)))) { node = e; break; } p = e; } while ((e = e.next) ! = null); }} // 3. Delete the node and repair the list or red-black tree if (node! = null && (! matchValue || (v = node.value) == value || (value ! = null && value.equals(v)))) { if (node instanceof TreeNode) ((TreeNode<K,V>)node).removeTreeNode(this, tab, movable); else if (node == p) tab[index] = node.next; else p.next = node.next; ++modCount; --size; afterNodeRemoval(node); return node; } } return null; }
Delete operation itself is not complicated, with the previous foundation, it is not difficult to understand, here is not much.
3.6 Other details
The previous section examined the common operations of HashMap and the associated source code, and this section adds a few more things.
Table variable modified by transient
If you look closely at the source code for HashMap, you’ll see that the bucket array table is declared transient. Transient means mutable. In Java, variables modified by this keyword are not serialized by the default serialization mechanism. Going back to the source code, consider this: the buckets table is an important underlying data structure in HashMap. If you don’t serialize it, how can anyone restore it?
To be clear, HashMap does not use the default serialization mechanism. Instead, it implements the readObject/writeObject methods to customize the serialized content. There’s a reason for this. What is stored in a HashMap? Needless to say, you know it’s a key-value pair. So once we serialize the key-value pairs, we can reconstruct a HashMap from the key-value pairs. Some friends may want to serialize the table is not a step in place, after the direct restore not on the line? It is reasonable to think so. But there are two problems with serializing Talbe:
- In most cases, the table will not be full, serializing unused parts and wasting space
- The same key-value pair may be in different buckets under different JVMs, and deserialization of the table under different JVMs may result in an error.
Of the above two questions, the first question is easier to understand, and the second question is explained. The first step of methods such as get/put/remove in HashMap is to find the bucket position of the key based on the hash. However, if the key does not override the hashCode method, the calculation of the hash will eventually call the hashCode method in Object. However, the hashCode method in Object is native, and different JVMs may have different implementations and generate different hashes. This means that the same key in different platforms may produce different hashes, and then continue to operate in the same table, there will be a problem.
With all of this in mind, you should see why HashMap doesn’t serialize tables.
3.7 summarize
This chapter provides a detailed analysis of the code associated with the common operations of HashMap and adds some additional details at the end. In this chapter, the section on insertion is the most talking-about, mainly because insertion involves so many points, one ring after another. Including but not limited to “table initialization, capacity expansion, tree” and so on, in general, insert operation is difficult to analyze. Fortunately, the analysis was done.
Although this chapter is relatively large, it still does not analyze all the points of the HashMap. For example, the red-black tree add, delete and other operations. Of course, in my personal opinion, the above analysis is enough. After all, you are the user of the class library, not the designer, so there is no need to understand every detail. So if you don’t understand any of the details, just skip it. For our development purposes, it’s good to know the general principles of HashMap.
Well, that’s the end of this chapter.
4. Write it at the end
I can finally breathe a sigh of relief here. This article has taken me over a week. Before I wrote this article, my knowledge of HashMap was limited to the principles, not the depth. At first, I didn’t think there was much to write about HashMap, since everyone knows more or less about HashMap. But once I read the source code of HashMap, I realized that this was wrong. It is not that there is nothing to write, but that there are so many points to write that I don’t know how to write them. The 1.8 version of the JDK implementation of HashMap is much more complex than in previous versions, and it is difficult to understand the many details. Only their own understanding is not enough, but also to write out, the difficulty is greater, this article is basically in the state of reading the source code while writing completed. Due to the limited time and ability, as well as the relatively large length of the article, it is difficult to ensure that the analysis process and the mapping are correct. If there is any mistake, I hope you can point it out and I will correct it in time. Thank you in advance.
Well, that’s all for this article, thank you for reading!
reference
- What is the principle of the hash method of HashMap in the JDK source code? – zhihu
- HashMap – The Meituan Technology Blog
- Python’s built-in hash function, for a string, gets a different value every time, right? – zhihu
- The Java HashMap keyword transient is a puzzle – SegmentFault
This article is distributed under a Creative Commons license 4.0
Author: coolblog
For a better classified reading experience,
Please go to my personal blog:
http://www.coolblog.xyz