Most JAVA developers use Maps, and HashMaps in particular. HashMap is a simple and powerful way to store and retrieve data. But how many developers know how HashMap works internally? A few days ago, I read a lot of the source code for java.util.HashMap (in Java 7 and Java 8) to get a deeper understanding of this basic data structure. In this article, I’ll explain the implementation of java.util.hashMap, introduce the new features in the Java 8 implementation, and discuss the performance, memory, and known issues when using HashMap.
Internal storage
The JAVA HashMap class implements the interface Map<K,V>. The main methods of this interface are:
- V put(K key, V value)
- V get(Object key)
- V remove(Object key)
- Boolean containsKey(Object key)
HashMap uses internal classes to store data: Entry<K, V>. This entry is a simple key-value pair with two additional data:
- A reference to another entry so that a HashMap can store entries, such as a one-way linked list
- The hash value representing the hash value of the key, which is stored to avoid evaluating the hash each time a HashMap is needed.
This is part of the Entry implementation in JAVA 7:
static class Entry<K.V> implements Map.Entry<K.V> {
final K key;
V value;
Entry<K,V> next;
inthash; ... }Copy the code
A HashMap stores data into lists of one-way linked items (also known as buckets or bins). All lists are registered in an Entry array (an Entry<K,V>[] array), an internal array with a default size of 16.
The figure above shows the internal storage of a HashMap instance with an array of nullable entries. Each Entry can be linked to another Entry to form a linked list.
All keys with the same hash value are placed in the same linked list (bucket). Keys with different hash values can end up in the same bucket.
When the user invokes PUT (K key, V value) or GET (Object key), the function calculates the index of the bucket where the Entry resides. The function then iterates through the list to find entries with the same key (using the equals() function for the key).
In the case of get(), this function returns the value associated with the item, if it exists.
In the case of PUT (K key, V value), the function replaces the entry with the new value if it exists, otherwise it creates a new entry (based on the key and value in the argument) at the head of the one-way linked list.
The index for this bucket (linked list) is generated by Map in 3 steps:
- It first retrieves the hash code of the key.
- It rehashes the hash code to prevent an incorrect hash function from the key from putting all the data in the same index (bucket) of the internal array
- It takes a rehashed hash code and bitmasks it with the length of the array (minus 1). This operation ensures that the index cannot be larger than the size of the array. You can think of it as a computationally optimized modular function.
This is the JAVA 7 and 8 source code for handling indexes:
// the "rehash" function in JAVA 7 that takes the hashcode of the key
static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
// the "rehash" function in JAVA 8 that directly takes the key
static final int hash(Object key) {
int h;
return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
// the function that returns the index from the rehashed hash
static int indexFor(int h, int length) {
return h & (length-1);
}
Copy the code
To work efficiently, the size of the internal array needs to be a power of two, and let’s see why.
Assume the array size is 17 and the mask value is 16 (size -1). The binary representation of 16 is 0… 010000, so for any hash value H, the index generated using the bitwise formula “H AND 16” will be either 16 or 0. This means that an array of size 17 will only be used for two buckets: one at index 0 and one at index 16, which is inefficient…
However, if you now take the size of a power of two like 16, the bitwise indexing formula is “H AND 15”. The binary representation of 15 is 0… 001111, so the index formula can output values from 0 to 15, taking full advantage of arrays of size 16. Such as:
- If H = 952, its binary representation is 0.. 0111011 1000, the related index is 0… 0 = 8 1000
- If H = 1576 its binary representation is 0.. 01100010 1000, the related index is 0… 0 = 8 1000
- If H = 12356146, its binary representation is 0.. 010111100100010100011 0010, the correlation index is 0… 0 0010 = 2
- If H = 59843, its binary representation is 0.. 0111010011100 0011, the associated index is 0… 0 0011 = 3
That’s why the array size is a power of two. This mechanism is transparent to the developer: if he selects a HashMap of size 37, the Map automatically selects the next power of 2 after 37 (64) as the size of its internal array.
Automatic resizing
After getting the index, the function (get, PUT, or remove) accesses/iterates through the associated linked list to see if there is an existing entry for the given key. If not modified, this mechanism can cause performance problems because the function needs to traverse the entire list to see if an item exists. Assuming the size of the internal array is the default (16), you need to store 2 million values. In the best case, each linked list has 125 000 entries (2/16 million) in size. Thus, each get(), remove(), and put() will result in 125,000 iterations/operations. To avoid this, a HashMap can augment its internal array to keep its linked list very short.
When creating a HashMap, you can use the following constructors to specify the initial size and loadFactor:
public HashMap(int initialCapacity, float loadFactor)
Copy the code
If no parameter is specified, the default initialCapacity is 16 and the default loadFactor is 0.75. InitialCapacity represents the size of the internal list array.
Every time you use put(…) Whenever a new key/value is added to the Map, this function checks to see if it needs to increase the size of the internal array. To do this, map stores two pieces of data:
- Map size: Indicates the number of entries in the HashMap. This value is updated each time an entry is added or removed.
- Threshold: This is equal to (the size of the internal array) * loadFactor and is refreshed after each adjustment of the internal array size
Before adding a new entry, put(…) Check if the size is greater than the threshold, and if so, recreate a new array that is twice the size. Because the size of the new array has changed, the index function (which returns the bitwise operations “hash(key) AND (sizeofArray-1)”) has changed. Therefore, resizing the array creates twice as many buckets (that is, linked lists) and reallocations all existing entries into buckets (old and newly created).
The purpose of this resizing operation is to reduce the size of the linked list so that the time costs of the PUT (), remove(), and get() methods remain low. After resizing, all entries with the same key hash value will remain in the same bucket. However, two items that were previously in the same bucket with different hash keys may not be in the same bucket after conversion.
The figure shows the representation before and after resizing the internal array. Before adding, the map must traverse a list of five elements in order to get Entry E. After resizing, the same get() just iterates through the list of 2 elements, and gets () twice as fast after resizing!
Note: HashMap only increases the size of the internal array, and it provides no method to reduce it.
Thread safety
If you already know HashMaps, you know it’s not thread-safe, but why? For example, if you have a Writer thread that only puts new data into a Map and a Reader thread that reads data from the Map, why doesn’t it work?
Because during auto-resizing, if a thread tries to place or retrieve an object, the map may use the old index value and not find the new bucket in which the item resides.
The worst case scenario is when two threads place data simultaneously and two put() calls simultaneously resize the Map. Because both threads modify the linked list at the same time, the Map can end up with an inner loop in one of its linked lists. If you try to use an inner loop to get the data in the list, get() will never end.
The implementation of the hash table is thread-safe and prevented from doing so. However, because all CRUD methods are synchronous, this implementation is very slow. For example, if thread 1 calls GET (key1), thread 2 calls GET (key2), and thread 3 calls GET (key3), only one thread can get its value at a time, and three of them can access the data at the same time.
Since JAVA 5, a smarter implementation of thread-safe HashMap exists: ConcurrentHashMap. Only buckets are synchronized, so multiple threads can get(), remove(), or put() data at the same time if that doesn’t mean accessing the same bucket or resizing the internal array. This implementation is best used in multithreaded applications.
Key invariance
Why are strings and integers a good implementation of the HashMap key? Mainly because they are immutable! If you choose to create your own Key class and do not make it immutable, you may lose the data in the HashMap.
Look at the following use case:
- You have a key with an internal value of “1”
- You use this key to place an object in the HashMap
- A HashMap generates a hash from the Key’s Hashcode (so it starts at “1”)
- Map stores this hash in the newly created Entry
- You changed the internal value of the key to “2”
- The hash value of the key has been changed but the HashMap does not know (because the old hash value is stored)
- You tried to get the object using the modified key
- The map computes the new hash of your key (hence starting at “2”) to find out which linked list (bucket) the entry is in
- Case 1: Because you changed your key, the map tries to find the entry in the wrong bucket, but does not find it
- Case 2: Fortunately, the modified key generates the same bucket as the old key. The map then traverses the linked list to find entries with the same key. But to find the key, the map first compares hash values and then calls equals() to compare. Because the key you are modifying does not have the same hash value as the old hash value (stored in the entry), the map will not find the entry in the linked list.
This is a concrete example in Java. I put 2 key-value pairs in my Map, I modify the first key, and then try to get the 2 values. Only the second value is returned from the map, the first value is “lost” in the HashMap:
public class MutableKeyTest {
public static void main(String[] args) {
class MyKey {
Integer i;
public void setI(Integer i) {
this.i = i;
}
public MyKey(Integer i) {
this.i = i;
}
@Override
public int hashCode(a) {
return i;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof MyKey) {
return i.equals(((MyKey) obj).i);
} else
return false;
}
}
Map<MyKey, String> myMap = new HashMap<>();
MyKey key1 = new MyKey(1);
MyKey key2 = new MyKey(2);
myMap.put(key1, "test " + 1);
myMap.put(key2, "test " + 2);
// modifying key1
key1.setI(3);
String test1 = myMap.get(key1);
String test2 = myMap.get(key2);
System.out.println("test1= " + test1 + " test2="+ test2); }}Copy the code
The output is “test1= NULL test2=test 2”. As expected, Map cannot retrieve string 1 using the modified key 1.
JAVA 8 improvement
The internal representation of HashMap has changed a lot in JAVA 8. Indeed, the implementation in JAVA 7 required 1K lines of code, while the implementation in JAVA 8 required 2k lines. Aside from the linked list of items, most of what I said earlier is true. In JAVA8, you still have an array, but it now stores nodes that contain exactly the same information as the entries, and are therefore also linked lists:
Here is part of the Node implementation in JAVA 8:
static class Node<K.V> implements Map.Entry<K.V> {
final int hash;
final K key;
V value;
Node<K,V> next;
}
Copy the code
So what are the big differences from JAVA 7? Nodes can be extended to tree nodes, and TreeNode is a red-black tree structure that stores more information so that it can add, remove, or get elements in O(log(n)).
For your information, here is an exhaustive list of the data stored in TreeNode
static final class TreeNode<K.V> extends LinkedHashMap.Entry<K.V> {
final int hash; // inherited from Node<K,V>
final K key; // inherited from Node<K,V>
V value; // inherited from Node<K,V>
Node<K,V> next; // inherited from Node<K,V>
Entry<K,V> before, after;// inherited from LinkedHashMap.Entry<K,V>
TreeNode<K,V> parent;
TreeNode<K,V> left;
TreeNode<K,V> right;
TreeNode<K,V> prev;
boolean red;
Copy the code
Red-black trees are self-balanced binary search trees. Their internal mechanisms ensure that their length is always in log(n), despite new nodes being added or removed. The main advantage of using these trees is that in the case of many data in the same index (bucket) of the internal table, a search in the tree will cost O(log(n)) and it will cost O(n) with a linked list.
As you can see, trees do take up more space than linked lists
Through inheritance, inner tables can contain both Nodes (linked lists) and Treenodes (red-black trees). Oracle decides to use both data structures according to the following rules:
- If there are more than eight nodes for a given index (bucket) in an inner table, the linked list is converted to a red-black tree
- If there are less than six nodes in the table for a given index (bucket), convert the tree to a linked list
This figure shows the internal array of the JAVA 8 HashMap, which contains trees (in buckets 0) and linked lists (in buckets 1,2, and 3). Bucket 0 is a tree because it has more than eight nodes.
Memory overhead
java7
There are memory costs associated with using HashMap. In JAVA 7, a HashMap wraps key-value pairs in entries. One entry reads:
- A reference to the next entry
- Precomputed hash (integer)
- A reference to a key
- References to values
In addition, JAVA 7 HashMap uses an internal array of entries. Assuming that a JAVA 7 HashMap contains N elements and its internal array has one CAPACITY, the additional memory cost is approximately:
sizeOf(integer)* N + sizeOf(reference)* (3*N+C)
Copy the code
Where:
- The size of an integer depends on being equal to 4 bytes
- The size of the reference depends on the JVM/OS/ processor, but is typically 4 bytes.
- That means the cost is usually
16 * N + 4 * CAPACITY
byte
Note: After Map auto-resizing, CAPACITY of the internal array is equal to the next power of 2 after N.
Note: Starting with JAVA 7, the HashMap class has a lazy initialization. This means that even if you allocate a HashMap, the internal array of entries (which costs 4 * CAPACITY bytes) will not be allocated in memory until the first use of the PUT () method.
java8
With the JAVA 8 implementation, getting memory usage becomes a little more complicated, because a node can contain the same data as an entry or the same data plus six references and a Boolean if it is a TreeNode.
If all Nodes are only Nodes, the memory consumption of a JAVA 8 HashMap is the same as that of a JAVA 7 HashMap.
If all nodes are TreeNodes, the memory consumption of a JAVA 8 HashMap becomes:
N * sizeOf(integer) + N * sizeOf(boolean) + sizeOf(reference)* (9*N+CAPACITY )
Copy the code
In most standard JVMS, it equals 44 * N + 4 * CAPACITY bytes
Performance issues
Tilted HashMap versus well-balanced HashMap
In the best case, the get() and put() methods have O(1) time complexity. However, if you don’t pay attention to the key hash function, you can end up with very slow PUT () and get() calls. Good performance for PUT () and GET depends on repartitioning the data into different indexes of the internal array (bucket). If the hash function of your key is not designed properly, you will have a skewed repartition (regardless of the size of the internal array). All put() and get() that use the largest list of items are slow because they need to iterate over the entire list. In the worst case (if most of the data is in the same bucket), you might end up with O(n) time complexity.
This is a visual example. The first image shows a slanted HashMap, and the second shows a well-balanced image.
In the case of this slanted HashMap, get()/put() operations on bucket 0 are expensive. It will take 6 iterations to get entry K
In this case of a well-balanced HashMap, getting Entry K will take three iterations. Both HashMaps store the same amount of data and have the same internal array size. The only difference is the hash (key) function, which allocates items in the bucket.
This is an extreme example in JAVA where I create a hash function that puts all the data in the same bucket and then add 2 million elements.
public class Test {
public static void main(String[] args) {
class MyKey {
Integer i;
public MyKey(Integer i){
this.i =i;
}
@Override
public int hashCode(a) {
return 1;
}
@Override
public boolean equals(Object obj) {... } } Date begin =new Date();
Map <MyKey,String> myMap= new HashMap<>(2 _500_000.1);
for (int i=0; i<2 _000_000; i++){ myMap.put(new MyKey(i), "test "+i);
}
Date end = new Date();
System.out.println("Duration (ms) "+ (end.getTime()-begin.getTime())); }}Copy the code
On my core I5-2500K @ 3.6ghz, using Java 8U40 took more than 45 minutes (I stopped the process after 45 minutes).
Now, if I run the same code, but this time I use the following hash function
@Override
public int hashCode(a) {
int key = 2097152-1;
return key+2097152*i;
}
Copy the code
It takes 46 seconds, which is even better! This hash function has better repartitioning than the previous one, so put() calls are faster.
If I run the same code using the following hash function, it provides better hash repartitioning
@Override
public int hashCode(a) {
return i;
}
Copy the code
Now it takes 2 seconds.
I hope you realize the importance of hash functions. If you run the same tests on JAVA 7, the results in the first and second cases are worse (because the time complexity of PUT is O(n) in JAVA 7 versus O(log(n) in JAVA 8)).
With HashMap, you need to find a hash function for your keys to distribute them into as many buckets as possible. To do this, you need to avoid hash conflicts. String is a good key because it has a good hash function. Integers are good, too, because their hash code is their own value.
Adjust the overhead
If you need to store a large amount of data, create a HashMap with an initial capacity close to the expected capacity.
If this is not done, the Map takes the default size of 16 and factorLoad is 0.75. The 11th PUT () will be very fast, but the 12th (16*0.75) will recreate a new internal array (and its associated linked list/tree) with a new capacity of 32. Numbers 13 through 23 will be quick, but number 24 (32*0.75) recreates (again) an expensive new representation that doubles the size of the internal array. Internal resizing operations will occur at put() lines 48, 96, 192… The call. Complete reconstruction of the internal array is quick at low volumes, but can take several seconds to several minutes at high volumes. By initially setting your desired size, you can avoid these expensive operations.
But there is a drawback: if you set a very high array size, such as 2^28, and you only use 2^26 buckets in your array, you will waste a lot of memory (in this case about 2^30 bytes).
conclusion
For simple use cases, you don’t need to know how HashMaps work, because you can’t see the difference between O(1) and O(n) or O(log(n)) operations. But it’s always better to understand the underlying mechanics of one of the most commonly used data structures. Also, this is a typical interview question for Java developer positions.
In high-volume situations, it becomes important to understand how it works and understand the importance of the key hash function.
Hopefully this article has given you some insight into the implementation of HashMap.