directory
• Write it first
, a HashMap profile
The Hash function,
• Initial capacity and load factor
• See the underlying structure through PUT and get
• HashMap with high concurrency
• Write it first
For most people, HashMap is a familiar and unfamiliar class. We often use HashMap where we need to use key-value pairs. However, if you want to explain the implementation details of HashMap, you will probably be hesitant. The underlying source code version of HashMap is JDK1.8, which can be synchronized with the article directly by viewing the source code.
, a HashMap profile
As we all know, HashMap is a data structure that stores key-value pairs (many people like to say HashMap is a collection, but technically speaking, HashMap does not belong to the Java collection class, HashMap implements the Map interface, and Map does not belong to the Java collection class Collections). Each stored key-value pair is also called an Entry. However, HashMap is a data structure realized by using arrays and linked lists. This also foreshadows that HashMap can use hash algorithm to calculate the index address of the key in the array, and then store it into the array. HashMap is an implementation of the Map interface based on a hash table. It provides all optional mapping operations and allows the use of NULL values and NULL keys. It does not guarantee the order of the mapping, in particular, it does not guarantee that the order is constant. There can be problems (HashTable, which is often compared with it, is thread-safe, and the two are much the same except that HashMap is not synchronized and allows null). The initial value of a HashMap is Null, as follows.
The Hash function,
As mentioned in the introduction, we use the Hash algorithm to get the index subscript based on the Key, which is a Hash function in the source code. How do we implement a uniformly distributed Hash function? What kind of operation do we do with the HashCode value of the Key to map to the storage location? It’s easy to think of modulo operations, where the Key’s HashCode and capacity are modulo mapped to the storage location. However, this is not the case, because modulo operations are simple but very inefficient, so designers use bitwise operations in order to achieve efficient Hash algorithms. I pasted the source code inside the implementation
A more simplified analogy can be formulated as follows
index = HashCode(Key) & (Length - 1)
Copy the code
For example, if we use “book” as the Key, the first step is to compute the hash value of “book” (just write your own code and print “book”.hashcode ()). The result is 3029737 in decimal. Conversion to binary is 1011100011101011101001, because the default initial capacity is 16 (Length in the above equation), then leng-1 is 15, and binary is 1111. 1011100011101011101001&1111 = 1001, which is the decimal 9, so index is 9. The final result of the Hash algorithm depends entirely on the last few bits of the HashCode value of the Key.
• Initial capacity and load factor
The default initial capacity of a HashMap is 16, and the default load factor is 0.75 (for Hash). Note that the initial capacity is set too high or the load factor is set too low. The performance of HashMap iteration deteriorates. The core of a HashMap is a hash table, and the initial capacity and load factor are also used for the hash table. The capacity is actually the number of buckets in the hash table. The initial capacity is just the capacity of the hash table when it is created, and the load factor is a measure of how full the hash table can be before its capacity is automatically increased. When the amount of storage in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehash (that is, the internal data structure is rebuilt, which is equivalent to expansion. The size of expansion is also specified. Whether automatic or manual expansion, the size after each expansion must be a power of 2). Here I’ll post these two things from the source code.
If you think about it, why do you choose 16 to start with? This is actually to serve the Hash algorithm that maps from Key to index. Now, let’s repeat the Hash algorithm, but if we change the capacity from 16 to 10, 1011100011101011101001 &1001 = 1001, it looks the same, right? Let’s change the HashCode, For example, 10111000111010111010111&1001 = 1001, still 1001. I’m not convinced. Let’s get another 10111000111010111011111111111&1001 is 1001 again. Can’t think of anything now, yes, if we change the length to 10, with the chances of getting some index value is bigger, and some will never occur (for example, 0111 will not appear in this example), so don’t conform to the principle of uniform distribution of hash algorithm, on the contrary, we put the length is set to a power of 2, In this case, the result of index is equivalent to the value of the last few bits of HashCode. As long as the input HashCode itself is evenly distributed, the result calculated by our Hash algorithm will be uniform.
• See the underlying structure through PUT and get
Put (” HHH “,1), which means that we insert an element with Key HHH. In this case, we use the hash function to determine the insertion position of the Entry. If we calculate index 2, then the result is as follows:
However, the length of a HashMap is limited. As more and more entries are inserted, the perfect hash function will inevitably have index conflicts, similar to the following.
What about this time? Direct expansion? What if a HashMap has only one Entry and a conflict occurs? Of course, not directly expand this method, but use linked lists to solve. Each element of the HashMap array is not only an Entry object, but also the head of a linked list. Each Entry object points to its Next Entry node through a Next pointer. When the new Entry maps to a conflicting array location, we simply insert it into the corresponding linked list, as shown below
Note that the new Entry node is inserted into the linked list using a “header” method, which will be explained in the get section below. Now that we know the general process of PUT, let’s talk about GET.
When using GET, we first need to enter the Key and perform a hash mapping to obtain the corresponding index. The general access process is similar to that of PUT. I won’t go into details here, but I will focus on conflicts. So we’re going to have to go down the top of the list one by one, and let’s say the Key we’re looking for is HHH.
Entry6 has a banana Key, which does not match. Entry1 has a HHH Key. To explain why the new Entry is placed in the head, the designers think that the inserted Entry is more likely to be looked up.
• HashMap with high concurrency
If we want to get into this, we have to know Rehash, which we mentioned earlier, is the extension of HashMap. The capacity of HashMap is limited. When HashMap reaches a certain saturation, the probability of location conflict of Key map will gradually increase. If hashmap. Size >= Capacity * LoadFactor, the LoadFactor is 0.75f. The procedure for capacity expansion is as follows: First, create a new empty array of entries twice the length of the original array, traverse the original array, recalculate the Hash of all the entries, and map them to the new array. Let me paste the resize in the source code.
/**
* Initializes or doubles table size. If null, allocates in
* accord with initial capacity target held in field threshold.
* Otherwise, because we are using power-of-two expansion, the
* elements from each bin must either stay at same index, or move
* with a power of two offset in the new table.
*
* @return the table
*/
final Node<K,V>[] resize() {
Node<K,V>[] oldTab = table;
int oldCap = (oldTab == null)?0 : oldTab.length;
int oldThr = threshold;
int newCap, newThr = 0;
if (oldCap > 0) {
if (oldCap >= MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return oldTab;
}
else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
oldCap >= DEFAULT_INITIAL_CAPACITY)
newThr = oldThr << 1; // double threshold
}
else if (oldThr > 0) // initial capacity was placed in threshold
newCap = oldThr;
else { // zero initial threshold signifies using defaults
newCap = DEFAULT_INITIAL_CAPACITY;
newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
}
if (newThr == 0) {
float ft = (float)newCap * loadFactor;
newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
(int)ft : Integer.MAX_VALUE);
}
threshold = newThr;
@SuppressWarnings({"rawtypes","unchecked"})
Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
table = newTab;
if(oldTab ! =null) {
for (int j = 0; j < oldCap; ++j) {
Node<K,V> e;
if((e = oldTab[j]) ! =null) {
oldTab[j] = null;
if (e.next == null)
newTab[e.hash & (newCap - 1)] = e;
else if (e instanceof TreeNode)
((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
else { // preserve order
Node<K,V> loHead = null, loTail = null;
Node<K,V> hiHead = null, hiTail = null;
Node<K,V> next;
do {
next = e.next;
if ((e.hash & oldCap) == 0) {
if (loTail == null)
loHead = e;
else
loTail.next = e;
loTail = e;
}
else {
if (hiTail == null)
hiHead = e;
elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
if(loTail ! =null) {
loTail.next = null;
newTab[j] = loHead;
}
if(hiTail ! =null) {
hiTail.next = null;
newTab[j + oldCap] = hiHead;
}
}
}
}
}
return newTab;
}
Copy the code
If you look closely, you will see the traversal process described above. However, HashMap is not a perfect design because all the previous processes are executed in a single thread, but HashMap is not thread-safe.
Now let’s take a look at what happens to a HashMap when it’s concurrent. Suppose a HashMap has reached the critical point of Resize. Two threads, A and B, Put the HashMap at the same time, as follows.
At this point, both threads have reached the ReHash step. Let’s review the code for ReHash:
If thread B is traversing the Entry3 object, the thread is suspended as soon as it completes the line in the red box. For thread B:
e = Entry3
next = Entry2
Copy the code
Thread A is rehashing unimpeded, and when the Rehash is complete, the result looks like this (e and Next in the figure, representing two references to thread B) :
Up until this point, everything looks fine. Thread B then resumes and continues with its own ReHash. Thread B was in the following state:
e = Entry3
next = Entry2
Copy the code
Repeat on this basis and end up in the following state.
At this point, the problem hasn’t arisen directly. When Get is called to find a nonexistent Key and the Hash value of the Key is equal to 3, the program will go into an infinite loop because position 3 has a circular list!
In this case, another class ConcurrentHashMap is usually used in high-concurrency scenarios. This class has thread safety and performance considerations.