In this article, we introduced some of the concepts related to the capacity of HashMap, and briefly explained the capacity expansion mechanism of HashMap.

We mentioned that by default the HashMap has a capacity of 16, but if the user specifies a number as a capacity through the constructor, the Hash selects the first power of 2 greater than that number as the capacity. (3->4, 7->8, 9->16)

In this article, we’ll take a closer look at whether the default size of a HashMap should be set. If we really want to set the initial capacity of the HashMap, how much should we set it?

Why set the initial capacity of the HashMap

As mentioned earlier, the Alibaba Java Development Manual recommends that we set the initial capacity of the HashMap.

So why do you suggest that? Have you ever thought about it.

Let’s start with JDK 1.7 (jdk1.7.0_79) to test performance without specifying initial capacity and with specifying initial capacity. (JDK 8 results will be different, which I will examine in a later article.)

public static void main(String[] args) { int aHundredMillion = 10000000; Map<Integer, Integer> map = new HashMap<>(); long s1 = System.currentTimeMillis(); for (int i = 0; i < aHundredMillion; i++) { map.put(i, i); } long s2 = System.currentTimeMillis(); System.out.println(" Uninitialized capacity, time: "+ (s2-s1)); Map<Integer, Integer> map1 = new HashMap<>(aHundredMillion / 2); long s5 = System.currentTimeMillis(); for (int i = 0; i < aHundredMillion; i++) { map1.put(i, i); } long s6 = System.currentTimeMillis(); System.out.println(" initialize capacity 5000000, time: "+ (s6-s5)); Map<Integer, Integer> map2 = new HashMap<>(aHundredMillion); long s3 = System.currentTimeMillis(); for (int i = 0; i < aHundredMillion; i++) { map2.put(i, i); } long s4 = System.currentTimeMillis(); System.out.println(" Initialize capacity 10000000, time: "+ (S4-S3)); }Copy the code

The above code is not hard to understand. We create three HashMaps that are initialized with the default capacity (16), with half the number of elements (50 million), and with the number of elements (100 million). And then we put 100 million KV into each of them.

Output result:

Uninitialized capacity, duration: 14419 Initialize capacity 5000000, duration: 11916 Initialize capacity 10000000, duration: 7984Copy the code

From the results, we can know that when the number of KV to be stored in the HashMap is known, setting a reasonable initial capacity can effectively improve the performance.

Of course, the above conclusion is also supported by theory. As we explained in our last article, HashMap has a capacity expansion mechanism, which means that it expands when a capacity expansion condition is met. The expansion condition of a HashMap is that when the number of elements in the HashMap (size) exceeds the threshold, the HashMap will be automatically expanded. In HashMap, threshold = loadFactor * capacity.

Therefore, if we do not set the initial capacity, the HashMap will expand multiple times as the elements continue to grow, and the expansion mechanism in the HashMap determines that each expansion requires the reconstruction of the hash table, which is very bad for performance. (I will introduce resize in a separate article later)

From the above code example, we also find that different values of initial capacity will also affect performance. When we know the number of KV to be stored in HashMap, what is the best capacity to set?

Initialization of capacity in HashMap

In the last article, we actually showed through code examples that by default, when we set the initial capacity of a HashMap, the HashMap actually takes the first power of 2 greater than that as the initial capacity.

Here’s an example from the last article

Map<String, String> map = new HashMap<String, String>(1); map.put("hahaha", "hollischuang"); Class<? > mapType = map.getClass(); Method capacity = mapType.getDeclaredMethod("capacity"); capacity.setAccessible(true); System.out.println("capacity : " + capacity.invoke(map));Copy the code

When the initialization capacity is set to 1, the output is 2. Put (“hahaha”, “hollischuang”); map.put(“hahaha”, “hollischuang”); map.put(“hahaha”, “hollischuang”); So capacity expansion, capacity expansion from 1 to 2.

So, again, when we set the initialCapacity using a HashMap(int initialCapacity), the HashMap does not necessarily take the value we passed in directly, but evaluates it to a new value to improve hash efficiency. (1->1, 3->4, 7->8, 9->16)

In Jdk 1.7 and Jdk 1.8, HashMap initializes this capacity at different times. In JDK1.8, the capacity is set when the HashMap constructor is called to define the HashMap. In Jdk 1.7, you don’t do this until the first PUT operation.

Whether Jdk 1.7 or Jdk 1.8, the algorithm to calculate the initial capacity is actually the same, the main code is as follows:

    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
Copy the code

The above code is quite interesting, a simple capacity initialization, Java engineers also have a lot of consideration in it.

The purpose of the above algorithm is quite simple: based on the capacity value passed in by the user (cap in the code), it calculates the first power of 2 greater than it and returns it.

Smart reader, if you were to design this algorithm, how would you calculate it? If you think about binary, it’s pretty easy. Here are a few examples:

Pay attention to the changes in the blue font in the examples above and you may notice some patterns. 5->8, 9->16, 19->32, 37->64 are mainly through two stages.

Step 1, 5 – > 7

Step 2, 7 – > 8

Step 1, 9 – > 15

Step 2,15->16

Step 1 – > 3 31

Step 2, 31 – > 32

Corresponding to the above code, Step1:

n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
Copy the code

Corresponding to the above code, Step2:

return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
Copy the code

Step 2 is relatively simple, which is to make a judgment of the limit value, and then add the value obtained in Step 1 to 1.

What about Step 1? ** is actually a shift of a binary number to the right, and then with the original value. ** Its purpose is to set all subsequent bits to 1 starting with the first non-0 bit of a binary number.

Take any binary number and run through the above formula to find the purpose:

1, 1100, 1100, 1100 > > > = 0110 0110 0110 1100 1100 1100 | = 0110 0110 0110 1110 1110 1110 1110 1110 1110 > > > 2 = 0011, 1011 0011 1011 1011 1011 1110 1110 1110 | = 1111 1111 1111 1111 1111 1111 > > > 4 = | 1111 1111 1111 1111 1111 1111 1111 1111 1111 = 1111 1111 1111Copy the code

By a couple of unsigned right shifts and bitwise or operations, we convert 1100 1100 1100 to 1111 1111 1111, add 1111 1111 1111 plus 1, and we get 1 0000 0000 0000, that’s the first power of 2 greater than 1100 1100 1100.

Ok, so we’ve now explained the code for Step 1 and Step 2. You can convert a number to the first power of 2 larger than itself. (You can start to admire Java engineers for using unsigned right shifts and bitwise or operations to improve efficiency.)

But there’s a special case where this formula doesn’t work, where these numbers are powers of 2 by themselves. If the number four follows the formula. You would get 8:

Step 1: 
0100 >>>1 = 0010
0100 | 0010 = 0110
0110 >>>1 = 0011
0110 | 0011 = 0111
Step 2:
0111 + 0001 = 1000
Copy the code

To solve this problem, the JDK engineers put all the numbers passed in by the user at -1, the first line in the source code, before calculating them:

int n = cap - 1;
Copy the code

At this point, go back and look at the code that sets the initial capacity. The purpose is clear:

    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
Copy the code

A reasonable value for the initial capacity in the HashMap

When we use HashMap(int initialCapacity) to initialize the capacity, the JDK by default calculates a relatively reasonable value for the initialCapacity. So, do we just need to pass the known number of elements in the HashMap directly to initialCapacity?

For setting this value, the following suggestions are provided in the Alibaba Java Development Manual:

This value, which was not created by Alibaba’s engineers, is also used in Guava (version 21.0).

public static <K, V> HashMap<K, V> newHashMapWithExpectedSize(int expectedSize) { return new HashMap<K, V>(capacity(expectedSize)); } /** * Returns a capacity that is sufficient to keep the map from being resized as long as it grows no * larger than And the load factor is ≥ its default (0.75). */ static int capacity(int expectedSize) {if (expectedSize < 3) { checkNonnegative(expectedSize, "expectedSize"); return expectedSize + 1; } if (expectedSize < Ints.MAX_POWER_OF_TWO) { // This is the calculation used in JDK8 to resize when a putAll // happens; It seems to be the most conservative calculation we // can make.0.75 is the default load factor. Return (int) ((float) ExpectedSize / 0.75F + 1.0f); } return Integer.MAX_VALUE; // any large value }Copy the code

In return (int) ((float) expectedSize / 0.75F + 1.0f); There is a comment above, which indicates that this formula is not original by Guava, but is implemented in the putAll method in JDK8. Interested readers can look at the implementation of the putAll method, which is also the formula above.

However, when we use HashMap(int initialCapacity) to initialize the capacity, the JDK by default calculates a relatively reasonable value for the initialCapacity. But this value does not reference the loadFactor value.

That is, if we set the default value to 7, it will be set to 8 after Jdk processing, but the HashMap will be expanded once the number of elements reaches 8*0.75 = 6, which we obviously don’t want.

If we calculate the expectedSize by expectedSize / 0.75F + 1.0f, 7/0.75 + 1 = 10,10 will be set to 16 after Jdk processing, which greatly reduces the probability of expansion.

When the internal hash table maintained by a HashMap reaches 75% capacity (by default), rehash is triggered, and the process of rehash can be time consuming. Therefore, if the initial capacity is set to expectedSize/0.75 + 1, it can effectively reduce conflicts and errors.

So, I can assume that setting the default size to expectedSize / 0.75F + 1.0f is a relatively good choice for performance when we know exactly how many elements are in the HashMap, but at the same time, some memory is sacrificed.

conclusion

When we want to create a HashMap in our code, if we know how many elements will be stored in the Map, setting the initial capacity of the HashMap can be somewhat efficient.

However, instead of taking the number the user passes in as the default capacity, the JDK does some math to get a power of two. The reason for this is (the most thorough hash() analysis of the Map in the entire web, there is no other.) The algorithm used to arrive at this number actually uses unsigned right shifts and bitwise or operations to improve efficiency.

However, in order to avoid the performance cost caused by the expansion to the greatest extent, we suggest setting the default size to expectedSize / 0.75F + 1.0F. In daily development, it can be used

Map<String, String> map = Maps.newHashMapWithExpectedSize(10);
Copy the code

To create a HashMap, and Guava will do the calculation for us.

However, the above operation is a way to use memory for performance, the real use of time, to consider the impact of memory.

Finally, there is a question: why is the formula expectedSize / 0.75F + 1.0f used by the putAll method in JDK 8, but not by default by put, constructors, etc.?

About the author: Hollis, a person with a unique pursuit of Coding, is a technical expert of an Internet company, a personal technical blogger, who has read tens of millions of technical articles on the Internet, and co-author of three Courses for Programmers.