HashMap source code implementation analysis

One, foreword

A HashMap is an interface container object that uses the principles of a hash table.

We are all familiar with arrays, which are data structures that occupy contiguous memory, and those of you who have studied C will be more impressed. Since it’s a contiguous chunk of memory, the nature of an array is obvious. Once you know what number to look up, the time is O(1), but it’s hard to insert; Another type of data structure you are probably familiar with is linked lists. Linked lists are data structures connected by a set of pointing (one-way or two-way) nodes. They are characterized by discontinuous memory, difficult to find, but easy to insert and delete.

Is there a data structure that’s easy to find, easy to insert, easy to delete, easy to find? Yes, it’s a hash table.

In this article, we will discuss:

  • Data structure implementation of HashMap
  • How does HashMap provide consistent time complexity for GET and PUT operations
  • When does a HashMap go from a single node to a linked list and from a linked list to a red-black tree
  • Why a custom initial capacity is given when a HashMap is initialized?
  • How does a HashMap guarantee that the capacity is always a power of two
  • Why should a HashMap always be a power of two
  • How to calculate the hash value of a HashMap
  • Why is HashMap thread unsafe

The best way to understand HashMap is to look at the source code. This article is based on the Jdk1.8HashMap source code.

2. Basic elements of HashMap

To understand the principle of HashMap, it is necessary to circle the following variables in HashMap source code:

  • DEFAULT_INITIAL_CAPACITY: initial capacity 1<<4 is 16
  • MAXIMUM_CAPACITY: indicates the maximum capacity. 1<<30.
  • DEFAULT_LOAD_FACTOR: load factor. The default value is 0.75. For example, if you define a HashMap with an initial capacity of 16, the HashMap will trigger an expansion operation as you continue to add elements to it, up to 0.75 of its initial capacity.
  • Threshold: indicates the threshold of the next CAPACITY expansion operation. Threshold = CAPACITY * LOAD_FACTOR.
  • TREEIFY_THRESHOLD: indicates the threshold for converting a linked list to a red-black tree. The default value is 8. If the value exceeds 8, the method will be executed.
  • UNTREEIFY_THRESHOLD: threshold for converting red-black trees to linked lists. The default value is 6. As the name implies, if a red-black tree node is less than 6, it will be converted to a linked list.
  • Node

    implements map. Entry

    HashMap Basic unit for storing data. It contains hash, key, value, and next values.
    ,>
    ,>
  • Node

    [] table: storage Node Node array, the lowest array of HashMap, array elements can be single-node Node, multi-node linked list, multi-node red-black tree.
    ,>

Above content, have an impression is good, need not remember each. But these concepts are crucial to understanding HashMap.

Third, the body

3.1 HashMap data structure

The data structure of a HashMap is simple. It is an array of type Node, and each element can be a single Node, a multi-node linked list, or a multi-node red-black tree. The key question is, how can such a simple structure implement put and GET fast? When do you go from a single node to a linked list and when do you go from a linked list to a red-black tree?

3.1.1 How can THE Time complexity of PUT and GET operations be O(1) to O(n) in HashMap?

We know that finding an element in an array, when we don’t know the index, is a lot of complexity, but when we know the index, it’s order 1. HashMap uses this principle. For the get operation, the hash value is calculated based on the key, and the hash value performs the operation (n-1) &the hash value is the index on which it is located. In the best case, the index has exactly one node and the hash value is the same as the hash value of the key, so the time complexity is O(1). When the node is a linked list or red-black tree, the time complexity increases, but due to the optimization of HashMap (the length of the linked list and red-black tree is not too long relative to the capacity of HashMap, which will trigger the resize operation), the worst case is O(n), which may be less than this value.

For put operations, we know that array inserts are expensive, and HashMap cleverly uses linked lists and red-black trees to replace the cost of moving subsequent elements. So in the best case, if you insert an element at the index position and there are no elements, the time complexity is O(1), in the case of linked lists or red-black trees, the time complexity will increase, but in the worst case, it will be O(n).

3.1.2 When does a HashMap change from a single node to a linked list and from a linked list to a red-black tree?

A single node list is very simple. When an element is in the index calculated based on the value added, if the element is a single node, the node is converted to the list. There are two conditions for a linked list to become a red-black tree:

  • The list length is greater than TREEIFY_THRESHOLD. The default threshold is 8

  • The length of a HashMap is greater than 64

When both conditions are met, the operation of turning the list into a red-black tree is triggered.

3.2 Why is a HashMap initialized with a custom initial capacity?

The constructor HashMap(int initialCapacity) must be used to specify the initialCapacity of a HashMap.

Ali’s Java Development Manual explains it this way:

  1. [Recommendation] Specify the initial value of the collection when the collection is initialized.

HashMap is initialized using HashMap(int initialCapacity).

InitialCapacity = (Number of elements to be stored/load factor) + 1. Note the load factor (i.e., loader)

Factor defaults to 0.75, or set to 16 (the default) if you cannot determine the initial size.

Counterexample: A HashMap needs to place 1024 elements, and since there is no initial capacity set, as the number of elements increases, the capacity increases

The resize is forced to expand for 7 times, and the hash table needs to be rebuilt, which seriously affects performance.

This problem is evident in the HashMap source code, where each put function checks whether the current size is greater than threshold, and if so, expands the capacity, doubling the original capacity. The problem is that when you store a large amount of data into a HashMap without specifying the initial capacity, the expansion will be triggered over and over again, which can be very performance consuming.

And how much to better initial capacity and load factor, if there is no necessary in daily development dynamic load factor is not recommended, but according to the size of a HashMap of to use to determine initial capacity, this is not to say that in order to avoid the expansion of initial capacity to the bigger the better, the greater the greater the application memory, if you don’t have to save so much data, and will cause the hash values are too scattered.

3.3 How does a HashMap ensure that its capacity is always a power of two

HashMap uses the method tableSizeFor() to ensure that whatever value you give it will return a power of two:

static final int tableSizeFor(int cap) { int n = cap - 1; / / role: that when the cap is a power of 2, return to the original value, not two times, such as 8 returns rather than 16 n | = n > > > 1; n |= n >>> 2; n |= n >>> 4; n |= n >>> 8; n |= n >>> 16; return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1; }Copy the code

First let’s look at operations:

           n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16Copy the code

Assuming that n = 01000000, n | = n > > > after 1 n = 01100000, n | = n > > > after 2 n = 01111000, n | = n > > > 4. After n=01111111, we can find that the above 5 steps can change all the following bits of a 32-bit digit whose first digit is 1 into 1. So after n plus 1, it’s going to be a power of 2. For example, 01111111+1 = 10000000. So why make sure it’s a power of two? Isn’t it impossible?

3.3.1 Why is the HashMap always a power of 2

We have to talk about how the put() method is positioned in the table based on the hash value.

final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { ...... if ((p = tab[i = (n - 1) & hash]) == null) tab[i] = newNode(hash, key, value, null); .Copy the code

As you can see, it uses an (n-1) & hash operation, where n is the capacity of the current hashmap, which must be a power of 2, and the binary low of n-1 is all 1, for example: 16= 000000000001000015 =0000000000001111. The advantage of this is that when (n-1) & hash is performed, the position of the element depends only on the low level and not on the high level (this independence decreases as the size of the HashMap increases). The advantage of this logic is that, no matter how big your hash value is, I lock the value range of your hash value to be smaller than the current capacity, which avoids the situation that the hash value is too discrete. When the HashMap is expanded, the value range of the hash value can also be increased, but the disadvantage is that it increases the possibility of hash collision. To address this, HashMap modifies the way hash values are computed to increase hash complexity at lower levels.

3.3.2 HashMap Computes the hash value

No nonsense, directly on the source:

    static final int hash(Object key)
    {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }Copy the code

The hash method uses the hash value of the key or the 16 bits higher than the hash value of the key. First, the high 16 bits of h>>>16 are all 0, so the high 16 bits of h^(h>>>16) will not change, but the low 16 bits of h^(h>>>16) will mix the high 16 bits of key, which increases the complexity of the hash value and further reduces the probability of the hash value being identical.

3.4 Why is HashMap thread unsafe

In Jdk1.7, one of the reasons for the unsafe HashMap thread is the transfer function, which uses the header lookup method. In the case of multiple threads, it is easy to have closed loop linked list, resulting in an infinite loop, as well as the problem of data loss. There is no transfer function in Jdk1.8, but HashMap expansion or initialization is completed in resize function. Resize uses tail insertion method to solve the problem of closed-loop linked list, but it still cannot avoid the problem of data coverage. In a HashMap put operation:

final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { Node<K, V>[] tab; Node<K, V> p; int n, i; if ((tab = table) == null || (n = tab.length) == 0) n = (tab = resize()).length; .Copy the code

The execution of the if (= = null (TAB = table) | | (n = TAB. Length) = = 0) under the condition of judgment and to true, will direct assignment, but in a multithreaded environment, complete judgment when two threads simultaneously, thread 1 had just finished the assignment, thread 2 assignment again, This creates a data coverage problem. This is just the simplest phenomenon. In order to be thread safe, we must first have multithread safe processing logic. Obviously HashMap does not have such logic, so much of the logic designed for a single thread is likely to fail. It is not designed to support multithreading, so it should not be asked. If you want thread-safe hash based maps, you can use ConcurrentHashMap, which implements get without locking and PUT with piecewise locking. So there is a specialization, and each container implementation has its own strengths and weaknesses. What we need to learn is to analyze each situation we face, and use different containers reasonably to solve the problem.

The basic principle and corresponding implementation of HashMap are said here, more in-depth topics such as: red-black tree insertion node, balancing red-black tree, traversal red-black tree, you can directly see the corresponding principle and implementation of red-black tree.

If you need a source code comment, please click here to parse the source code


Finally, recently, many friends asked me for Linux learning roadmap, so I stayed up for a month in my spare time according to my own experience, and sorted out an e-book. Whether you are interviewing or self-improvement, I believe will help you! The directory is as follows:

Free to everyone, just ask everyone to point to me!

Ebook | Linux development learning roadmap

Also hope to have a small partner can join me, do this e-book more perfect!

Have a harvest? Hope the old iron people come to a triple whammy, give more people to see this article

Recommended reading:

  • Dry goods | programmers advanced architect necessary resources free of charge
  • Artifact | support resource site search