Why does the array length have to be 2 to the n?

	/** * Returns a power of two size for the given target capacity. */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }
Copy the code

The above method is called when initializing a HashMap, and the method annotation states that it returns a power of 2 as the array capacity.

But why does it have to be 2 to the n?

The reason has to do with calculating the index of the array. The following code is the part of the PUT method that determines whether the index of the array already has Node.

	if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
Copy the code

You can find array subscript I = hash & (n-1). A HashMap is a quick way to get array subscripts using this and operation.

hash & (len – 1)

So let’s look at len is 16, len minus 1 is 15 is 01111

hash expression The results of
1 01111 & 00001 1
2 01111 & 00002 2
3 01111 & 00003 3
4 01111 & 00004 4
5 01111 & 00005 5
.
13 01111 & 01101 13
15 01111 & 01111 15
16 01111 & 10000 0

Two rules can be observed by observing the results on the table:

  1. All subscripts are in the range 0 to 15
  2. The subscript distribution is very uniform, with a cycle of 16 numbers

Let’s look at len=17:

hash expression The results of
1 10000 & 00001 0
2 10000 & 00010 0
3 10000 & 00011 0
4 10000 & 00100 0
5 10000 & 00101 0
.
15 10000 & 01111 0
16 10000 & 10000 1
17 10000 & 10001 1

If len=17, the array indexes are all zeros, which causes all nodes to fall into the same bucket, causing a very serious collision.

conclusion

We can see from the above table that the reason why the array capacity is specified as a power of 2, there are two functions:

  • Ensure that the highest bit of LEN-1 is 0, so that the subscript value cannot be out of bounds
  • Set len-1 to 1 except for the highest digit, so that the hash value does not change, making subscripts more scattered and reducing conflicts

How does a HashMap compute a hash

	static final int hash(Object key) {
        int h;
        return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
    }
Copy the code

The above code is a hash method for JDK1.8, which has a similar but more complex approach and is not covered here.

First, if the key is null, the hash method returns 0, which is why the null key must be at the array subscript 0 position.

The hashCode() method of the Object class is called when the key is not null, followed by bitwise operations. Here’s why hashCode is called with bitwise operations.

Xor and right shift operations

A type of xOR binary operation, represented by ^ in Java. Xor compares each binary bit, giving 0 for the same bit and 1 for the different bit.

The operand 1 The operand 2 Exclusive or
0 0 0
0 1 1
1 0 1
1 1 0

Right shift is the operation of removing the low right digit of binary number and adding zeros to the high left digit.

Such as:

0101011 >>> 2 = 0001010
Copy the code

key.hashCode()

Java specifies that we must override hashCode after overriding equals of the Object class. This is because two objects are equal hashCode must also be equal.

Since we can override the hashCode method, it doesn’t make sense for a hashMap to use the hashCode directly. That’s one of the reasons why you should hash.

Let’s go back to hash & (len-1)

As mentioned earlier, HashMap uses this expression to calculate array subscripts, which requires consideration of key distribution to reduce collisions.

The hash is 32 bits, and len-1’s high bits are all zeros and its low bits are 1. That is, we will only use part of the hash 32-bit binary.

For example, len is 16, then we only use 4 of the 32 hash bits.

Now it’s clear to go back to the hash () method.

  1. Here we’ve shifted h 16 bits to the right, and 16 exactly splits the hash value in half
  2. If we’re using a 16-bit hash to do the and, len is equal to 2 to the 17. This capacity is obviously unattainable, meaning that at least half the length of hashCode is unused.

conclusion

  • The most powerful effect of a right shift and then xor with the original value is that every bit of hashCode is fully utilized. Full use of 32-bit values can also reduce the likelihood of conflicts.
  • The user might have overridden the hashCode method, and the hash() method scrambles the rewrite logic to avoid the increased chance of collisions that overwriting might cause

Why is the load factor 0.75

	/*Ideally, under random hashCodes, The frequency of * nodes in bins follows a Poisson distribution with a * parameter of about 0.5 on average for the Default resizing * threshold of 0.75, although with a large variance because of * resizing granularity. Ignoring variance, The expected * occurrences of list size k are (exp(-0.5) * POw (0.5, k) / * factorial(k)). The first values are: * * 0: 0.60653066 * 1: 0.30326533 * 2: 0.07581633 * 3: 0.01263606 * 4: 0.00157952 * 5: 0.00015795 * 6: 0.00001316 * 7: 0.00000094 * 8: 0.00000006 * More: less than 1 in ten millionCopy the code

The HashMap source code contains a comment above. In random Hashcode, which bucket does Node fit the Poisson distribution? The probability of different lengths of linked lists in the same bucket is given when the load factor is 0.75.

The probability of a list length of 8 is already 6 in 10 million, which is almost impossible. As for how it works out… . (I didn’t count either.)

Conclusion 1: When the load factor is 0.75, the list length is difficult to reach 8.

Away from the math, let’s consider how the size of the load factor affects a HashMap.

  • A load factor greater than 1 is clearly not possible, and the HashMap will never expand, leading to serious conflicts
  • The load factor is close to 1, which causes expansion to occur only when the array is full, and also makes collisions frequent
  • The load factor is small and arrays are frequently triggered to expand, which has the advantage that keys are scattered and conflicts are reduced. But expansion takes a lot of time, and array elements are scattered, making space utilization low.

Conclusion 2: The load factor should be a value around 0.5 to balance too much and too little.