I am a kite, the public number “ancient kite”, a both depth and breadth of programmers encourage division, a plan to write poetry but write up the code of rural code farmers! Articles will be included in JavaNewBee, and there will be a Java back-end knowledge map, which will cover the path from small white to big cow.
This is the fun HashMap of the last post, a text version that even a 25-year-old can understand. Many students said that the strip version is more interesting, simple and easy to understand, but after all, the picture can not draw so detailed, can only be understood from the big picture.
For real details, you have to read this one. Actually, this article was finished first, and then drew a lot of pictures, so I wrote a picture version. This article contains more than 7,000 words and three suggestions.
The JDK version for this article is 1.8.
In Java, the most commonly used data types are the 8 primitive types and their wrapper types and string types, followed by ArrayList and HashMap. A HashMap stores key-value pair data, which is fast to store and retrieve, high performance, and a very useful data structure that every Java developer has used.
Moreover, the design of HashMap is clever, and its structure and principle are often used as interview questions. There are many clever algorithms and designs, such as Hash algorithm, zipper method, red and black tree design, that every developer should learn from.
After thinking for a long time, how can I explain HashMap simply and easily? Let’s talk about how I understand it. The best way to understand something is to understand the whole structure and then go into the details. So let’s start with structure.
Start with structure
My own experience, for example, a kite, I as a professional have no sense of direction, for this thing was never lost, while living in Beijing for many years, but only in zhongguancun can distinguish between the north and the south, other places, even a company every day, every day I live work is not too clear direction, can only recognize a way home, if you take a taxi in road home, also get confused, Let’s just say, you can go home in front of the community, but you can’t find a home behind the community. When you go somewhere new, you have to stare at a map. At that time, I thought, if I could look down on the streets above the city, I would never be afraid of finding my way home again. Isn’t this the dimension reduction blow in the three-body? It’s much easier to understand the low-dimensional things from the standpoint of higher dimensions.
The same goes for understanding data structures. Most of the time, we are stuck at the level of being able to use them, understanding principles that are fragmented, stumbling around in the maze of data institutions, desperate for a map or a helicopter.
Let’s take a look at the integration diagram of the entire Map family. There are a lot of things to look at, but the other ones are probably not used much. Only HashMap is the most familiar.
The following description may be unprofessional, but it only briefly describes the structure of a HashMap.
If you define a HashMap with a length of 8, you can say that it is an array of 8 buckets. When we insert data into arrays, most of the time we store elements of type Node, which is a static inner class defined in the HashMap.
When inserting data (that is, call the put method), and is not stored in one by one in sequence backward, HashMap defines a set of special index selection algorithm, called a hash, but hash calculation is a kind of situation, called a hash collision, also is the two different key hash computed hash value is consistent, To do this, use zipper extensions, such as the blue linked list, so that different keys with the same hash value can fall into the same bucket without overwriting the previous ones.
However, as more and more elements are inserted, the probability of collisions increases, and the list in a bucket grows longer and longer until it reaches a threshold that the HashMap can’t handle. To improve performance, the list that exceeds the threshold is converted into a red-black tree structure, which is 8. If the number of linked list nodes in a bucket is greater than 8, the list becomes a red-black tree.
The above general description is the overall structure of the HashMap and a blueprint for further research into the details. We’ll take a few key points out of this and explain them one by one, from the whole to the details, reducing dimension to strike the HashMap.
What follows is a detailed explanation of why the structure is designed this way and how it is generated from a simple array to an in-bucket list and then converted into a red-black tree.
Recognize a few key concepts
Storage containers
Because the contents of a HashMap are stored in an array, the array is defined as follows:
transient Node<K,V>[] table;
Copy the code
The Node type
Table is an array of type Node, which is a static internal class defined with hash, key, Value, and next attributes. For example, when we later use the put method to add key-value pairs, it will be converted to Node.
static class Node<K.V> implements Map.Entry<K.V> {
final int hash;
final K key;
V value;
Node<K,V> next;
} Copy the code
TreeNode
As mentioned earlier, when the bucket list reaches 8, it is converted to a red-black tree, the TreeNode type, which is also a static inner class defined in the HashMap.
static final class TreeNode<K.V> extends LinkedHashMap.Entry<K.V> {
TreeNode<K,V> parent; // red-black tree links
TreeNode<K,V> left;
TreeNode<K,V> right;
TreeNode<K,V> prev; // needed to unlink next upon deletion
boolean red; } Copy the code
Capacity and default capacity
Capacity is the length of the table array, which is what we call the number of buckets. It is defined as follows
int threshold;
Copy the code
The default is 16, or 16 if we didn’t specify the size at initialization. Of course, we can specify the initial size ourselves, and HashMap requires that the initial size be a power of two.
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
Copy the code
Number of elements
Capacity is the number of buckets specified, and size is how many key-value pairs are actually stored in the HashMap.
transient int size;
Copy the code
The maximum capacity
The length of a table is also limited and cannot be infinite. HashMap specifies a maximum length of 2 to the 30th power.
static final int MAXIMUM_CAPACITY = 1 << 30;
Copy the code
The load factor
This is a coefficient that, in combination with threshold, defaults to 0.75. Don’t change it in general.
final float loadFactor;
Copy the code
Capacity threshold
Threshold = capacity x load factor. Assuming that the current HashMap capacity is 16 and the load factor is 0.75 by default, expansion will be triggered when the size reaches 16 x 0.75= 12.
Initialize the HashMap
Using a HashMap must be initialized, and in many cases it is created using a no-argument constructor.
Map<String,String> map = new HashMap<>();
Copy the code
In this case all attributes are default values, such as capacity 16 and load factor 0.75.
Another recommended method of initialization is to give a default capacity, such as specifying a default capacity of 32.
Map<String,String> map = new HashMap<>(32);
Copy the code
But a HashMap requires that the initial size be 2 to the power of n, but it does not require every developer to specify the initial size as required. What if we specify the initial size as 7 or 18?
No matter, there is a method in the HashMap that converts the value of the parameter to the nearest value greater than or equal to 2 to the NTH power of the specified parameter. For example, if the size is specified, the actual capacity is 8, and if the size is specified, the actual capacity is 32.
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); this.loadFactor = loadFactor; this.threshold = tableSizeFor(initialCapacity); } Copy the code
The tableSizeFor method performs the conversion. After the conversion, the final result is assigned to the threshold variable, which is the initial capacity, or the number of buckets mentioned in this article.
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8; n |= n >>> 16; return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1; } Copy the code
TableSizeFor If the initial parameter is 18, tableSizeFor if the initial parameter is 18, tableSizeFor if the initial parameter is 18, tableSizeFor if the initial parameter is 18, tableSizeFor if the initial parameter is 18, tableSizeFor if the initial parameter is 18, tableSizeFor
So this is an interesting algorithm, if you give me an initial size of 63, you get 64, and if you give me an initial size of 65, you get 128, and you always get something that’s at least as close to 2 to the NTH power as you can get.
Decrypt the core principles from the PUT method
The put method is the most commonly used method for adding key-value pairs. It is also the most complex process. The process of adding key-value pairs involves the core principles of HashMap, including the following:
- When will the capacity be expanded? What are the rules for expansion?
- How to determine indexes when inserting key-value pairs,
HashMap
They’re not inserted sequentially, so that would be an array. - How do I ensure that keys are unique?
- What happens if you have a Hash collision?
- What is the zipper method?
- How does a list in a single bucket turn into a red-black tree?
Here is the source code for the PUT method, which I have commented in it.
public V put(K key, V value) {
return putVal(hash(key), key, value, false.true);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { HashMap.Node<K,V>[] tab; // Declare Node array TAB HashMap.Node<K,V> p; // Declare a Node variable p int n, i; / * ** TABLE transient Node<K,V>[] table; Used to store Node nodes* If the current table is empty, the resize() method is called to allocate array space* / if ((tab = table) == null || (n = tab.length) == 0) n = (tab = resize()).length; // n is always a power of 2, (n-1) &hash determines the index in tab.length // Then create a Node Node to assign to the current index if ((p = tab[i = (n - 1) & hash]) == null) tab[i] = newNode(hash, key, value, null); else { // What if the current index position already has a value // the zip method is used HashMap.Node<K,V> e; K k; // Check the uniqueness of the key // p is the current value at the index to be inserted // Hash values are consistent and (key of current position == key to be inserted (note the == symbol), or key is not null and key.equals(k)) if (p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k)))) // Overwrite if the current node has only one element and it is the same as the insert key // Temporarily assign the p (current index) node to e e = p; else if (p instanceof HashMap.TreeNode) // If the current index node is a tree node // Insert into the node tree and return e = ((HashMap.TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value); else { // The current index has neither a single node nor a tree, indicating that it is a linked list for (int binCount = 0; ; ++binCount) { if ((e = p.next) == null) { // Find the last node without next // Create a node and assign it to p.ext p.next = newNode(hash, key, value, null); // If the current position is greater than TREEIFY_THRESHOLD after +1, treify it if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st // Perform the tree operation treeifyBin(tab, hash); break; } // If a key conflict occurs again, the node will be overwritten by the same key if (e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k)))) break; p = e; } } if(e ! =null) { // existing mapping for key V oldValue = e.value; if(! onlyIfAbsent || oldValue ==null) e.value = value; afterNodeAccess(e); return oldValue; } } ++modCount; // Resize when the actual length is greater than threshold if (++size > threshold) resize(); afterNodeInsertion(evict); return null; } Copy the code
Initialize arrays and expand capacity for the first time
When executing the PUT method, the first step is to check whether the table array is empty or zero in length. If so, this is the first time that a key-value pair has been inserted and the table array needs to be initialized.
In addition, as more and more key/value pairs are added, the size of the HashMap becomes larger and larger. Note that size is the actual number of key/value pairs. It starts to expand when it reaches a threshold, and the threshold, as I said above, is capacity x load factor.
The reason for this is that both initial initialization and expansion are done using the same method, called resize(). Here’s my annotated resize() method.
final HashMap.Node<K,V>[] resize() {
// Save the table copy and copy it into the new array
HashMap.Node<K,V>[] oldTab = table;
// The current size of the table is length instead of size
int oldCap = (oldTab == null)?0 : oldTab.length;
// The current bucket size int oldThr = threshold; int newCap, newThr = 0; if (oldCap > 0) { // If the current capacity is greater than 0, it is not the first initialization (in expansion scenario). if (oldCap >= MAXIMUM_CAPACITY) { // Cannot exceed the maximum allowed capacity threshold = Integer.MAX_VALUE; return oldTab; } else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) // Double the capacity newThr = oldThr << 1; // double threshold } else if (oldThr > 0) // Initialize the scene (given the default capacity), such as new HashMap(32) newCap = oldThr; // Set capacity to the value of threshold else { // New HashMap() // Set the capacity to DEFAULT_INITIAL_CAPACITY newCap = DEFAULT_INITIAL_CAPACITY; // If the threshold exceeds the threshold, capacity expansion is triggered newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } if (newThr == 0) { // Initialization for a given default capacity float ft = (float)newCap * loadFactor; newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE); } // Save the new threshold threshold = newThr; // Create a new expanded array and copy the old elements over @SuppressWarnings({"rawtypes"."unchecked"}) HashMap.Node<K,V>[] newTab = (HashMap.Node<K,V>[])new HashMap.Node[newCap]; table = newTab; if(oldTab ! =null) { for (int j = 0; j < oldCap; ++j) { HashMap.Node<K,V> e; // get the element assigned to e if((e = oldTab[j]) ! =null) { // If the bucket is not empty oldTab[j] = null; // empty reclaim if (e.next == null) // if next is empty, find the drop point again newTab[e.hash & (newCap - 1)] = e; else if (e instanceof HashMap.TreeNode) // If it is a tree node // Red-black tree nodes are handled separately ((HashMap.TreeNode<K,V>)e).split(this, newTab, j, oldCap); else { // Keep the order HashMap.Node<K,V> loHead = null, loTail = null; HashMap.Node<K,V> hiHead = null, hiTail = null; HashMap.Node<K,V> next; do { next = e.next; if ((e.hash & oldCap) == 0) { if (loTail == null) loHead = e; else loTail.next = e; loTail = e; } else { if (hiTail == null) hiHead = e; else hiTail.next = e; hiTail = e; } } while((e = next) ! =null); if(loTail ! =null) { loTail.next = null; newTab[j] = loHead; } if(hiTail ! =null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } } return newTab; } Copy the code
First initialization
The put method checks if the table array is empty and initializes it if it is.
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
Copy the code
There are two types of initial initialization: no parameter initialization and parameter initialization. As mentioned in the previous section on HashMap initialization, the default value of no parameter initialization is 16, i.e. the length of the table is 16. When a parameter is initialized, tableSizeFor() is used to determine the actual capacity and a Node array is generated.
HashMap.Node<K,V>[] newTab = (HashMap.Node<K,V>[])new HashMap.Node[newCap];
Copy the code
Where newCap is the capacity, default 16 or custom.
An important step in this process is to maintain expansion thresholds.
capacity
In the PUT method, the capacity expansion is triggered when the size (actual number of key/value pairs) reaches threshold.
// Resize when the actual length is greater than threshold
if (++size > threshold)
resize();
Copy the code
The HashMap follows the double expansion rule, and the size after each expansion is twice the size before expansion. In addition, after all, the underlying storage is still an array. There is no such thing as a real dynamic array in Java. An array is as big as it was when it was initialized, it will always be that big.
The following situations may occur during copying:
- If the next attribute of the node is empty, it indicates that this is a most normal node, not a bucket linked list or a red-black tree, and such a node recalculates the index position and inserts.
- If it is a red-black tree, use
split
Method, the principle is to split the red-black tree into two TreeNode linked lists, and then determine whether the length of each linked list is less than or equal to 6. If so, the TreeNode is converted into an in-bucket linked list, otherwise it is converted into a red-black tree. - If it is an in-bucket list, copy the list to the new array, ensuring that the order of the list remains the same.
Determine insertion point
When we call the PUT method, the first step is to hash the key to find the drop point, which bucket to insert into the table array.
The hash algorithm works like this: Take the hashCode of the key, shift the hashCode 16 bits to the right, and then xor the result of the shift to the right with the hashCode. This code is called a “perturbation function.” The reason why you don’t take hashCode directly is to increase randomness. Reduce the number of hash collisions.
/ * ** To compute the hash value of the key* * /
static final int hash(Object key) {
int h;
return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16); } Copy the code
Once we get the hash value, we do the operation I = (n-1) &hash, where I is the final index position.
There are two scenarios where this index formula is used, the first being when the PUT method inserts key-value pairs. The second scenario is that in the resize expansion, after the new array is created, the existing Node is moved to the new array. If the Node is not a linked list or a red-black tree, but a common Node Node, it will recalculate and find the index position in the new array.
And then look at the picture, which makes it clear.
HashMap requires that the capacity be 2 to the NTH power, and the binary representation of 2 to the NTH power, which I’m sure you’re all familiar with, is 2 to the sixth power, which is 6 zeros from right to left, and then the seventh bit is 1. The figure below shows the binary representation of 2 to the sixth power.
And then this n minus one operation, when you subtract one, all the zeros after the 1 in the previous binary representation become 1’s and the bits where the 1 is become 0’s. For example, if 64 minus 1 becomes 63, the binary representation looks like this.
In the following figure, the first four lines show that when the map has a capacity of 8, 16, 32, and 64, assuming a capacity of N, the binary representation of the corresponding N-1 looks like this, with a red tail and a 1 at the end.
That’s right, plugging such a binary representation into the formula (n-1) &hash will eventually determine the index bits to insert. If the current HashMap is 64 and a key is inserted into the HashMap and the hash result is 99, substitute the formula to calculate the index value, that is (64-1) &99, and the final result is 35. That is, the key will fall into the table[35] position.
Why does a HashMap have to be a power of 2? The binary representation shows that if there are bits of 1, the result of the and operation with the hash value will be more uniform. This is largely determined by the hash value.
How do I ensure that keys are unique
The same key is not allowed in a HashMap. How to ensure the uniqueness of keys?
if (p.hash == hash &&
((k = p.key) == key || (key ! =null && key.equals(k))))
Copy the code
The values must first be equal using the hash algorithm, and the result is int, so you can use the == sign to determine. But that doesn’t work, because if you want to know what hash collision means, it’s possible that two different keys will end up with the same hash.
And the key to be inserted == the existing key of the current index, or key.equals(the existing key of the current index). Note that == equals is or. The equals sign means that this is the same object, and equals is used to determine that two objects have the same content.
If key is a primitive data type, such as int, then the same values must be equal and the resulting hashCode is consistent.
The String type is the most commonly used key type, and we all know that the same String produces the same hashCode, and that strings can be checked for equality using equals.
But what if I use a reference type as a key, say I define a MoonKey as a key value
public class MoonKey {
private String keyTile;
public String getKeyTile(a) {
return keyTile; } public void setKeyTile(String keyTile) { this.keyTile = keyTile; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null|| getClass() ! = o.getClass())return false; MoonKey moonKey = (MoonKey) o; return Objects.equals(keyTile, moonKey.keyTile); } } Copy the code
Then add it twice with the following code. Do you say size is 1 or 2 in length?
Map<MoonKey, String> m = new HashMap<>();
MoonKey moonKey = new MoonKey();
moonKey.setKeyTile("1");
MoonKey moonKey1 = new MoonKey();
moonKey1.setKeyTile("1");
m.put(moonKey, "1"); m.put(moonKey1, "2"); System.out.println(hash(moonKey)); System.out.println(hash(moonKey1)); System.out.println(m.size()); Copy the code
MoonKey and moonKey1 cannot have the same hash value because MoonKey does not override the hashCode method. When not overwriting the hashCode method, the default value is inherited from the Object hashCode method. The hash value of each Object is unique.
To highlight, the right thing to do would be to add a hashCode rewrite.
@Override
public int hashCode(a) {
return Objects.hash(keyTile);
}
Copy the code
This is one reason why you are required to override hashCode as well as equals. If two objects are equal by calling equals, then both objects calling hashCode must return the same integer. This foundation is needed to ensure that the key of a HashMap or HashSet is unique.
What if I have a hash collision
The hashCode generated by equal objects must also be equal, but unequal objects can be computed using the hash method to produce the same value. This is called a hash collision. Although collisions have been largely avoided through algorithms, they cannot be avoided.
After the collision, the index of the table array will be the same as the bucket. In this case, how to put multiple key-value pairs in a bucket?
Zipper method
As I mentioned at the beginning of this article, a HashMap is not just an array. Accept collisions when they happen. There’s a method called the zipper method, not the zipper on clothes. Instead, when a collision occurs, a linked list is pulled over the current bucket, which makes sense.
The Node type was mentioned earlier in the key concepts section, and there is a property called Next that is designed for this type of list, as shown in the figure below. Node1, node2, and node3 are all in the same bucket, so we have to do the linked list, node1.next = node2, node2.next = node3. Node3.next = null, indicating that this is the tail of the list.
When a new element is ready to be inserted into a linked list, it is inserted by tail, not by head. JDK 1.7 uses header insertion, but the problem with header insertion is that when two threads perform resize(), it can create a loop that causes the GET method to loop.
Lists are converted to trees
The linked list is not the ultimate structure for collision processing, the ultimate structure is the red-black tree, and when the length of the list reaches 8, and new elements come in, it’s time to transition from the linked list to the red-black tree. The method treeifyBin does this.
Red-black trees are used for performance reasons, as they are faster to find than linked lists. So why not just start with a red-black tree, but upgrade to a tree after the list length is greater than 8?
First of all, the hash collision probability is very small, most of the time a Node is a barrel, even if a collision, the probability of collision into a bucket that is much less, so the chain length of the table can rarely have the opportunity to 8, if the chain length of the table to 8, that means the number of elements in the current HashMap is already very big, That’s when using red-black trees to improve performance is desirable. On the other hand, if the total number of elements in a HashMap is very small, even using a red-black tree will not improve performance, and red-black trees use much more space than linked lists.
The get method
T value = map.get(key);
Copy the code
For example, using the above statement to obtain a value from a key is the most commonly used method.
Picture understanding, after calling the get method, the first step is to determine the index position, also known as the location of the barrel, methods and put method, is used first hash this disturbance function to determine hash value, and then use (n – 1) & hash index. This is nonsense, of course, and put the same time, not the same how to find the right position.
After locating the bucket, three things happen:
Single-node type: that is, there is only one key-value pair in the bucket. This is also the most common type in a HashMap, as long as no hash collisions occur. In fact, this is the ideal situation for hashMaps, all of which would be perfect.
List type: If the get key is found in a list structure, we need to traverse the list until we find nodes with the same key.
Red-black tree type: When the list length exceeds 8, it becomes a red-black tree. If the bucket is found to be a red-black tree, it uses the red-black tree’s proprietary quick lookup method.
In addition, the map. containsKey method actually uses the GET method.
The remove method
Remove is similar to put and get methods. It calculates the hash value of the key first, then (n-1) & hash to obtain the index position, and then takes different measures according to the node type.
Single-node type: Replace the current bucket element with the deleted Node. next, which is null.
List type: If it is a list type, set the next property of the previous node of the deleted node to Node.next.
Red-black tree type: If it is a red-black tree, the red-black tree node deletion method is called. Here, if the number of nodes is between 2 and 6, the tree structure is simplified to a linked list structure.
Non-thread-safe
HashMap does not do concurrency control, so use ConcurrentHashMap if you want to use it in multi-threaded high-concurrency environments. If multiple threads perform the PUT operation at the same time and the calculated index (bucket) positions are the same, the previous key will be overwritten by the next key.
For example, thread A and thread B perform the put operation at the same time. Thread A and thread B determine that the bucket with index 2 is empty, and then insert the value. Thread A first put the key pair of key1 = 1. Then thread B puts in key2 = 2, and thread A is crying. Finally, the value in the bucket with index 2 is key2=2, which means that the value stored in thread A is overwritten.
conclusion
The main advantage of HashMap is that it is fast, especially get data, which is O(1) level and directly locate index positions.
HashMap is not a simple array structure. When a hash collision occurs, a zipper method is used to generate a linked list. When the list is larger than 8, it is converted into a red-black tree, which can greatly improve performance.
The size of the HashMap must be 2 to the power of n, which is designed to ensure that the hash of finding the index is more uniform. The formula for calculating the index is (n-1) & hash.
HashMap Is expanded when the number of key-value pairs reaches capacity expansion threshold x load factor. The number of key-value pairs is doubled each time. In the process of expansion, the index position of elements of single node type will be recalculated. If it is a red-black tree node, the split method will be used to reconsider whether to change the red-black tree into a linked list.
Strong man wait, first give a praise bar, always white piao, the body can not bear!
I am kite, the official number “ancient kite”. A programmer with both depth and breadth of encouragement teacher, a intended to write poetry but write up the code of rural code farmers! You can follow me now, or you can follow history articles later.