Why is HashMap always asked in Java interviews?
Hashmaps have long been a favorite topic for Java interviewers, and no matter what level you’re at, it seems that once in many rounds of technical interviews you’ll be asked about hashmaps.
Why should a Java interview delve into HashMap? Because of the characteristics of the design structure and principle of HashMap, it can not only test beginners’ understanding of Java collection, but also find candidates’ in-depth knowledge of data structure.
Questions surrounding HashMap can be very shallow but can also be very detailed, talking about data structures, or even the underlying computer.
What’s different about Java1.8 HashMap
We know that there is a big difference between the internal implementation of HashMap in Jdk1.7 and Jdk1.8 (and later), but since Jdk1.8 is currently the dominant version, we will only cover HashMap in Jdk1.8 here.
Compared with Jdk1.7, Jdk1.8 is mainly optimized in two aspects, which improves the efficiency of data storage and query.
- storage
From the original storage structure of array + linked list to array + linked list + red and black number structure, we know from the data structure knowledge point that the characteristics of linked list are: difficult to address (query), easy to insert and delete. With the increase of stored data, the length of the linked list will continue to grow, and the query efficiency will become lower and lower. The query efficiency can be improved by transforming into red-black tree.
- Addressing optimization
In the original Jdk1.7, the value is located in the lower table of the array by modulating the Hash of the key value and then stored in the linked list of the corresponding subscript. The data is obtained in the same way during query. An optimization was made in Jdk1.8 to reduce the hash collision rate.
To understand HashMap we need only focus on its three apis: PUT, Get, and Resize. Next, we track the source code of these three methods for detailed analysis respectively.
What does the put value do
Let’s take a look at what the put method does. The put method takes two values: key and value
public V put(K key, V value) {
return putVal(hash(key), key, value, false.true);
}
Copy the code
So you can see that the implementation is not here, but there’s a putVal method, and all the logic is in the putVal method.
final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) {
// TAB: the table array, n: the length of the array, I: the subscript of the hashed array, p: the first node of the linked list or red-black tree stored in the array subscript I,
Node<K,V>[] tab; Node<K,V> p; int n, i;
// If the table array is empty or its length is 0, the resize method is called to initialize the array
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
// If the array subscript is empty, a new node is created, and the current k-v is the first data in the node
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
// when the subscript node is not null
else {
Node<K,V> e; K k;
// if the current k-v is equal to the first node hash and key, assign p->e
if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
e = p;
// The current node is a red-black tree, and the value of k-v is added as a red-black tree
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
else {// The node type is a linked list type, iterating through the list, adding new elements instead of updating the same element value
for (int binCount = 0; ; ++binCount) {
// The node is the tail node, and the current k-v becomes the new node and is added to the end of the list
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
Treeify_threshold-1 = 7, binCount starts from 0
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab, hash);
break;
}
// If the hash value and key of e are equal to k-v, the loop will exit, because only new nodes are processed here
if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
break;
// The current node e is not the tail node, e->p, continue traversalp = e; }}// Process the update operation and replace the old value with the new value
if(e ! =null) { // existing mapping for key
V oldValue = e.value;
//onlyIfAbsent Is false or the old value is null
if(! onlyIfAbsent || oldValue ==null)
e.value = value;
// An empty function that can be overridden by the user as needed
afterNodeAccess(e);
// Return the old value
return oldValue;
}
}
++modCount;
// If the number of k-V keys in the current map exceeds threshold, expand the map
if (++size > threshold)
resize();
// An empty function that can be overridden by the user as needed
afterNodeInsertion(evict);
return null;
}
Copy the code
After reading putVal’s source code, we get the following points:
- The new value is added to the end of the list. If the list length reaches 8, the list will be converted to a red-black tree, which optimizes the problem of slow list queries. As you can see from resize, when the length drops to 6, the red-black tree turns into a linked list.
- If the total number of k-V keys in the map exceeds the threshold, the map will be expanded. The threshold value is calculated in resize, and the initial value is
(int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY)
namely16 * 0.75 = 12
. You can see from the putVal method that there are two callsresize()
, respectively during initialization and capacity expansion.
The putVal method has five input arguments, the first of which seems to call a hash method and pass key. So let’s look at this hash method.
static final int hash(Object key) {
int h;
return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code
When key is empty, it returns 0, which we can understand. When key is not empty, what is that string? Stem what? It moves the hashCode and hashCode values of the key 16 bits to the right for xor. Why do I do that? It’s kind of confusing.
In fact, this is Jdk1.8 addressing optimization, what is the benefit of this?
Hash (key) % length is an inefficient hash(key) % length operation. Hash (key) & (n-1) in mathematics yields the same result as modulo hash(key) % n, but with much better performance than modulo hash n. So TAB [I = (n-1) & hash] is the modulo hash of the array length.
HashCode (h = key.hashcode ()) ^ (h >>> 16)
The hashCode of the object is an integer of type int. Assume that the hashCode value of the key is H =514287204853 and convert it to binary format
The xOR operation is performed to obtain the hash value. Observe the characteristics of the hash value
Shifting h 16 bits to the right means that the value of the higher 16 bits is placed in the lower 16 bits, and the higher 16 bits complement 0. After such processing, xOR operation is performed with H to obtain a hash value.
As can be seen from the result, compared with the original hashCode value, the higher 16 bits of the calculated hash value can be ignored, while the lower 16 bits are the xOR new value after the original 16 bits of the higher and lower 16 bits, so it has the characteristics of the original 16 bits of the higher and lower 16 bits.
The hash value is then combined with (n-1). N is the length of the array, and the initial value is 16. Each expansion is a multiple of 2, so the value of n must not be very large. The top 16 bits are essentially zero, and only the bottom 16 bits have a value. Take n=16 as an example.
Since the top 16 bits of (n-1) are all zeros, any value that is evaluated with and will not affect the top 16 bits of index, but only the bottom 16 bits will be affected. That’s 16 bits too high to be useful.
What’s the problem with that? If the hashCode values of two objects are the same 16 bits lower but different 16 bits higher, the results of their operations must be the same, resulting in hash collisions that locate the same subscript of the array.
After the right shift 16-bit xOR operation, it is equivalent to the fusion of the high 16 bits and the low 16 bits, and the low 16 bits of the operation result has the joint characteristics of the high 16 bits and the low 16 bits of H. In this way, hash conflicts can be reduced to ensure the uniform distribution of data to a certain extent.
After reading putVal’s source code, we learned about the optimization of the storage structure and hash addressing, but there are still some confusion
Why do linked lists and red-black tree conversions?
The conversion between a linked list and a red-black tree is based on the tradeoff of time and space. A linked list only has Pointers to the next node and data values, while a red-black tree requires left and right Pointers to point to the left and right Nodes respectively. TreeNodes occupy twice the space of ordinary Nodes, so a red-black tree requires more storage space than a linked list. But red-black trees are more efficient than linked lists.
Of course, these advantages are based on the amount of data, only when the number of nodes in the container is enough will the red black tree. When the amount of data is small, there is not much difference in the query efficiency between the two. However, red-black tree requires more storage capacity. Therefore, a conversion threshold of 8 and 6 should be set respectively.
So why are the thresholds 8 and 6?
The HashMap designer notes in the source code that many of the questions can be answered by reading the source code
/* Because TreeNodes are about twice the size of regular nodes, we * use them only when bins contain enough nodes to warrant use * (see TREEIFY_THRESHOLD). And when they become too small (due to * removal or resizing) they are converted back to plain bins. In * usages with well-distributed user hashCodes, tree bins are * rarely used. Ideally, under random hashCodes, the frequency of * nodes in bins follows a Poisson distribution * (http://en.wikipedia.org/wiki/Poisson_distribution) With a * parameter of about 0.5 on average for the default resizing * threshold of 0.75, although with a large variance because of * resizing granularity. Ignoring variance, The expected * occurrences of list size k are (exp(-0.5) * POw (0.5, k) / * factorial(k)). The first values are: * * 0: 0.60653066 * 1: 0.30326533 * 2: 0.07581633 * 3: 0.01263606 * 4: 0.00157952 * 5: 0.00015795 * 6: 0.00001316 * 7: 0.00000094 * 8: 0.00000006 * More: less than 1 in ten million */
Copy the code
In the ideal case, that is, the hash value discreteness is very good and the hash collision rate is very low, the data is evenly distributed in the linked list of the container and there is no data concentration. In this case, the red-black tree is not necessary. But in reality, the hash algorithm of each object is highly random, so it may lead to uneven data distribution.
Chose 8 from the Angle of probability, ideally, under random hash code algorithm container node follows the poisson distribution, the length of a linked list in the Map of eight probability is very low, you can see eight probability is 0.00000006, if the low probability of things that have taken place in the length of the list is really long. As for why not choose the same value as the threshold is to buffer, can effectively prevent the linked list and red-black tree frequent conversion.
How to get the value
Get (Object key) putVal putVal putVal putVal putVal
public V get(Object key) {
Node<K,V> e;
return (e = getNode(hash(key), key)) == null ? null : e.value;
}
Copy the code
This is not a concrete implementation, but the logic to get is in the getNode method, which will also call the hash(key) method to find the array subscript position of the key.
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
// The array is not empty, the length of the array is greater than 0, and the node at the subscript position is not empty
if((tab = table) ! =null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) ! =null) {
// Returns if the current key is stored on the first node
if (first.hash == hash && // always check first node((k = first.key) == key || (key ! =null && key.equals(k))))
return first;
// Continue traversal if it is not the first node
if((e = first.next) ! =null) {
// If it is a red-black tree, the value is obtained as a red-black tree
if (first instanceof TreeNode)
return ((TreeNode<K,V>)first).getTreeNode(hash, key);
// Get the value as a linked list
do {
if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
return e;
} while((e = e.next) ! =null); }}return null;
}
Copy the code
How to resize expansion
The entire hashMap is the most complex part of the expansion, involving data migration and re-addressing, and a lot of code, requiring some patience.
final Node<K,V>[] resize() {
Node<K,V>[] oldTab = table;
int oldCap = (oldTab == null)?0 : oldTab.length;
int oldThr = threshold;
int newCap, newThr = 0;
if (oldCap > 0) {
// If the capacity has reached the maximum, it cannot be expanded, so set the threshold to the maximum and return the original array
if (oldCap >= MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return oldTab;
}
// Double the capacity and double the threshold
else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
oldCap >= DEFAULT_INITIAL_CAPACITY)
newThr = oldThr << 1;
}
else if (oldThr > 0) // initial capacity was placed in threshold
newCap = oldThr;
else {//oldCap=0 or oldThr=0, that is, the setting at initialization
newCap = DEFAULT_INITIAL_CAPACITY;
newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
}
if (newThr == 0) {
float ft = (float)newCap * loadFactor;
newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
(int)ft : Integer.MAX_VALUE);
}
threshold = newThr;
@SuppressWarnings({"rawtypes","unchecked"})
// Initialize the new array based on the new capacity
Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
table = newTab;
// Above is the capacity expansion of the container, where the data in the original container is migrated to the new container
if(oldTab ! =null) {
// Iterate over the array container
for (int j = 0; j < oldCap; ++j) {
Node<K,V> e;
// If the old hash bucket array is not empty at j, copy it to e
if((e = oldTab[j]) ! =null) {
oldTab[j] = null;// Set the old hash bucket array to null at j for gc
// If e has no Node after it, it means that the current data subscript has only one data
if (e.next == null)
// Insert e into the new array's corresponding index
newTab[e.hash & (newCap - 1)] = e;
else if (e instanceof TreeNode)
// If e is the type of red-black tree, then migrate data according to red-black tree. Split refers to the red-black tree list
((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
else {
// Define two new linked lists lower, higher
Node<K,V> loHead = null, loTail = null;
Node<K,V> hiHead = null, hiTail = null;
Node<K,V> next;
do {
// Assign Node next to next
next = e.next;
// If the hash value of node e and the length of the array is 0, place it in the new linked list lower
if ((e.hash & oldCap) == 0) {
if (loTail == null)
loHead = e;// Assign e to loHead
else
loTail.next = e;// Otherwise assign e to lotail.next
loTail = e;// Then copy e to loTail
}
// If the hash value of node e is not 0 when combined with the length of the original array, put it in higher
else {
if (hiTail == null)
hiHead = e;// Assign e to hiHead
else
hiTail.next = e;// If hiTail is not empty, copy e to hitail.next
hiTail = e;// Copy e to hiTail}}while((e = next) ! =null);// end the loop until e is empty, the end of the list
if(loTail ! =null) {
loTail.next = null;// Set lotail.next to null
newTab[j] = loHead;// Assign loHead to the new Hash bucket array [j]
}
if(hiTail ! =null) {
hiTail.next = null;// Set hitail.next to null
newTab[j + oldCap] = hiHead;// Assign hiHead to the new array [j+ array length]
}
}
}
}
}
return newTab;
}
Copy the code
MAXIMUM_CAPACITY = 1 << 30 MAXIMUM_CAPACITY = 1 << 30 MAXIMUM_CAPACITY = 1 << 30 The threshold is set to threshold = integer.max_value.
The last part is the data migration logic, which moves the data from the original array to the new container by iterating through the for loop. It is divided into three cases
- If the original array has only one node at its subscript, that node is hashed by the length of the new array
hash&(newCap - 1)
Locate the new subscript position. - Call if the node at the subscript of the original array is a red-black tree structure
split()
Method for data migration, if the number of data nodes is less than 6, there will be a red-black tree list. - If the node at the subscript of the original array is a linked list, the data is migrated in a linked list manner.
Will the location of data change after migration?
LoHead = null; loTail = null; loHead = null; loTail = null; loHead = null; Node
hiHead = null, hiTail = null; TreeNode
loHead = null, loTail = null; TreeNode
hiHead = null, hiTail = null; Where LO should be the abbreviation of lower and HI should be the abbreviation of higher. The reason for this is because we are using power-of-two expansion, according to the source code.
Therefore, the data in the node chain at a subscript of the original array will be split into two parts during migration. Here, the linked list is taken as an example. It will perform a sum operation (E. hash & oldCap) between the hash of the node and the length of the original array.
The index of lower is newTab[4], which is oldTab[4]. Higher will add the length of the array to the original subscript, that is, oldTab[4], so newTab[4+ oldCap] will be in the new array.
Last but not least, do you know why the expansion is needed?
Actually very simple reason, this is from the array data structure, we often say that an array of query, this statement is based on the subscript for addressing, due to the elements in the array in memory is stored in a row, when we defined the length of the array after this has been fixed array can’t change it, calculate the opportunity to allocate a block of successive memory space for it, It’s continuous so we know the address of a[0], and we know the address of a[n] by adding n. Because of this feature, if the original array is full, a new array must be created in memory and the data migrated from the original array.
The end of the
We have gone through the source code and key points of HashMap, and we can see that a simple collection container contains rich ideas and technical capabilities of the designer. We read the source code can help us understand this knowledge point and better use it, and can learn from the design ideas so that we can use for reference in the work, the importance of reading the source code.