Do you know the implementation of the hash method in HashMap? Do you know the implementation of hash methods in HashTable, ConcurrentHashMap and why? Do you know why you did that? Do you know why the hash methods in JDK 7 and JDK 8 are implemented differently? If you can’t answer these questions well, then you need to read this article. This paper involves a lot of code and computer basic principle knowledge. Absolutely full of dry stuff. There is no other Internet that has so thoroughly analyzed hash().
The hash
To Hash an input of any length into an output of a fixed length using a Hash algorithm. The output is a Hash value. This transformation is a compression mapping, that is, the space of hashes is usually much smaller than the space of inputs, and different inputs may be hashed into the same output, so it is impossible to determine the input values uniquely from the hashes. Simply put, it is a function that compresses a message of any length into a message digest of a fixed length.
All hash functions have the following basic property: If the hash value calculated from the same hash function is different, the input value must be different. However, if the hash values calculated from the same hash function are the same, the input values may not be the same.
A collision occurs when two different input values are computed with the same hash function.
Common Hash functions are as follows:
Direct addressing: Hash address directly with the keyword K or k plus some constant (k+ C).
Number analysis method: Extract the uniform number in the keyword as the hash address.
Divide the keyword k by a number p that is not larger than the length of the hash table m, and use the remainder as the address of the hash table.
Piecewise superposition: Divide a keyword into equal parts according to the address number of the hash table, the last part of which can be shorter. Then add these parts together and discard the highest carry. The result is the hash address of the keyword.
Square method: if the distribution of each part of the keyword is not uniform, you can first find its square value, and then according to the demand to take the middle several bits as the hash address.
Pseudorandom number method: use a pseudorandom number as a hash function.
We talked about collisions above. An important measure of how good a hash function is is the probability of collisions and the solution to collisions. Basically, no hash function can completely avoid collisions. Common methods to solve collisions are as follows:
- Open addressing method:
- Open addressing means that once a conflict occurs, the next empty hash address is searched. As long as the hash table is large enough, the empty hash address is always found and the record is stored.
- Chain address method
- Each element of the hash table is the head of the list, and all elements with hash address I form a synonym list. In case of a conflict, the keyword is chained to the end of the list that starts with the cell.
- Then the hash method
- When a hash address conflict occurs, another hash function is used to calculate the address of another hash function until the conflict does not occur again.
- Establish public overflow areas
- The hash table is divided into basic table and overflow table, and the elements that conflict are put into the overflow table.
The data structure of the HashMap
In Java, there are two relatively simple data structures for storing data: arrays and linked lists. ** arrays are easy to address, difficult to insert and delete; The characteristics of linked lists are: difficult to address, easy to insert and delete. ** As mentioned above, one of the commonly used methods for resolving hash functions is called chained address method. In fact, it combines an array and a linked list to give full play to the advantages of both. We can understand it as an array of linked lists.
As we can see from the figure above, the left-hand side is clearly an array, and each member of the array is a linked list. All elements contained in the data structure contain a pointer for linking between elements. We assign elements to linked lists based on their characteristics, and in turn we find the correct list and the correct element from the linked list. The method for calculating the indexOf an array of elements based on their characteristics is the hash algorithm, the main hash() function (and, of course, the indexOf() function).
The hash method
Let’s take a HashMap in JDK 1.7, which defines a final int hash(Object K) method that is referenced primarily by the following methods.
The above methods are mainly add and delete methods, which is not difficult to understand, when we want to add and delete an element in a linked list array, we first need to know where it should be stored in the list array, that is, its index in the array. The hash() method, on the other hand, does just that: locates the Key in the HashMap. The same applies to HashTable and ConcurrentHashMap.
The source code parsing
First, the hash methods in HashMap, HashTable, and ConcurrentHashMap are implemented differently in the same Jdk version. There are also differences between JDK versions (Java7 and Java8). I’ll try to cover them all. Read this article and you’ll understand the hash method once and for all.
Before we get to the code, let’s do a quick analysis. We know that the hash method’s function is to locate the k-V in the list array based on the Key. The input of the hash method should be an Object Key and the output should be an array index of type int. If you could design this method, what would you do?
We simply call the Object’s hashCode() method, which returns an integer and modulates the size of the HashMap or HashTable. Int hash(Object k) and int indexFor(int h, int Length). However, the implementation of a HashMap is a little more complicated for reasons like efficiency.
Hash: This method essentially converts Object to an integer.
IndexFor: This method primarily converts hash generated integers into indices in linked list arrays.
HashMap In Java 7
final int hash(Object k) { int h = hashSeed; if (0 ! = h && k instanceof String) { return sun.misc.Hashing.stringHash32((String) k); } h ^= k.hashCode(); h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); } static int indexFor(int h, int length) { return h & (length-1); }Copy the code
As I said earlier, the indexFor method essentially converts hash generated integers into indices in linked list arrays. Then return h & (length-1); What does that mean? In fact, he’s just taking modules. The main consideration in Java for using bit-arithmetic (&) instead of modulo arithmetic (%) is efficiency. Bit operation (&) is much more efficient than substitute mod operation (%), mainly because bit operation directly operates on memory data, does not need to convert to decimal, so processing speed is very fast.
So why is it possible to use the bit operator (&) to implement the modulo operator (%)? This works as follows:
X % 2^n = X & (2^n – 1)
2 to the n is 2 to the n, that is, modulo 2 to the n is equal to a sum of 2 to the n minus 1.
If n is 3, then 2 to the third is equal to 8, which is equal to 1000 in base 2. 2 to the third is equal to 7, which is 0111.
So X ^ 2^3-1 is the same thing as taking the last three digits of base 2 of X.
In base 2, X / 8 is the same thing as X >> 3, that is, if you move X to the right by 3 bits, you get the quotient of X / 8, and what you remove is X % 8, which is the remainder.
I don’t know if you understand the above explanation, but it doesn’t matter if you don’t, you just need to remember this technique. Or you could try it out with a couple of examples.
6%8 = 6, 6 & 7 = 6
10 &8 = 2, 10 &7 = 2
So, return h & (length-1); If length is 2 to the n, then you can do modulo. The length in a HashMap is indeed a multiple of 2, starting at 16 and expanding by twice each time.
After analyzing the indexFor method, we are ready to examine the principles and implementation of the hash method. So before we go further, here’s a summary.
The data in a HashMap is stored in a linked list array. Any insert/delete operation on a HashMap needs to locate the index of the array in which the k-V pair should be stored based on its key value. And this operation of taking the index by key is called hashing. The array of a HashMap has a length, and Java dictates that the length can only be a multiple of 2, starting with 16. The simple way to do this is to take the hashcode of the key and then modulo the length of the array with the resulting int. In order to consider performance, Java always uses bitwise and operation to implement modulus fetching operations.
That’s all you can say for now, but since HashMap uses bitwise operations instead of modulo operations, another problem is the possibility of collisions. Such as:CA11 0000
and0001, 0000,
In the0000, 1111,
The value of the bitwise and operation is equal.
Two different keys, the bitwise and operation on the length of the array will result in the same result. So how to resolve this conflict, let’s look at how Java does it.
The main code parts are as follows:
h ^= k.hashCode();
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
Copy the code
This code is used to perturb the hashCode of the key to prevent hash conflicts caused by different hashcodes having different high values but the same low values. To put it simply, it is to combine high-order features with low-order features to reduce the probability of hash conflict. In other words, it is to try to make the change of any bit have an effect on the final result.
, for example, we now want to put in a HashMap to a K – V, the Key has a value of “hollischuang”, after simple get hashcode, get a value of “1011000110101110011111010011011”, If the current HashTable size is 16, he will end up with an index value of 11 without any perturbation. Due to the expansion of 16 binary to 32 bit is “00000000000000000000000000001111”, so, a number in the bitwise and operator and his, before 28 whatever it is, the calculation results are the same (because 0 and any number of do and, as a result, all 0). As shown in the figure below.
As you can see, the following two hashcodes are bitwise computed to get the value 11. Although we do not know which key has the hashcode of the above example, there must be such a key, which causes a conflict.
So, what I want to do is see what happens to the perturbed algorithm.
As you can see from the above figure, the two hashcodes that would have conflicted before have different index values after perturbation calculation, which is a good way to avoid conflicts.
In fact, the use of bitwise operation instead of modulus operation, in addition to performance, there is a good solution to the problem of negative numbers. As we know, the result of hashcode is int, and int ranges from -2^31 to 2^ 31-1, i.e. There are negative numbers in there, and we know that it’s a little tricky to take the magnitude of a negative number. This problem can be avoided by using binary bits. First, whether the value of hashCode is positive or negative. Length-1 must be a positive number. Then, the first digit of his binary must be 0 (signed numbers use the highest digit as the sign bit, with “0” representing “+” and “1” representing “-“), so that the first digit of the sum of the two digits must be 0, that is, the result must be positive.
HashTable In Java 7
The HashMap hash method and the indexOf method are implemented in Java 7. Let’s take a look at how thread-safe HashTable is implemented and how it differs from HashMap, and try to analyze the reasons for the differences. The following is an implementation of the Hash method for HashTable in Java 7.
private int hash(Object k) {
// hashSeed will be zero if alternative hashing is disabled.
return hashSeed ^ k.hashCode();
}
Copy the code
And you can see, it’s pretty simple, you’re just doing a simple hash of k, taking its hashCode. Int index = (hash & 0x7FFFFFFF) % tab.length; . In other words, HashMap and HashTable take two approaches to calculating array subscripts. A HashMap uses bitwise operations, whereas a HashTable uses direct modulo operations.
Why do a bitwise and hash operation on 0x7FFFFFFF? The main reason is to ensure that the first digit of the index is 0, in order to get a positive number. Because we have a sign number where 0 is positive and 1 is negative.
As we said earlier, the reason why hashMaps don’t take modules is to improve efficiency. HashTable is a thread-safe class that is slow, so Java doesn’t think about efficiency. But that’s not entirely true. There are some considerations in Java’s design, although it’s certainly slower than HashMap.
In fact, HashTable using a simple module is a certain consideration in. This is where the HashTable constructors and expansion functions come in. Because space is limited, I won’t post the code here, but give the conclusion directly:
HashTable defaults to an initial size of 11 and then expands to 2n+1 each time.
That is, the default size of the linked list array of a HashTable is a prime, odd number. Every subsequent expansion has an odd number.
Since HashTable will try to use prime and odd numbers as the size of the capacity. Simple modular hashing is more uniform when the hash table is prime in size. (this can be proved out, because the emphasis is not on, temporarily not in detail, may refer to: http://zhaox.github.io/algorithm/2015/06/29/hash).
Now that we’ve looked at Java 7 HashMap and HashTable implementations of hash, let’s make a quick summary.
- HashMap defaults to an initial size of 16 and then expands to twice that size each time.
- HashTable defaults to an initial size of 11 and then expands to 2n+1 each time.
- When the HashTable size is prime, the result of simple modulus hash is more uniform, so from this point of view alone, HashTable HashTable size selection seems to be more straightforward. Because the more scattered the hash results, the better.
- In modular calculation, if the modulus is a power of two, then we can directly use bitwise operations to get the result, which is much more efficient than division. So in terms of efficiency of hash calculation, HashMap is superior.
- However, in order to improve efficiency, HashMap uses bit operation instead of hash, which also introduces the problem of uneven hash distribution. Therefore, HashMap makes some improvements to hash algorithm and carries out perturbation calculation to solve this problem.
ConcurrentHashMap In Java 7
private int hash(Object k) { int h = hashSeed; if ((0 ! = h) && (k instanceof String)) { return sun.misc.Hashing.stringHash32((String) k); } h ^= k.hashCode(); // Spread bits to regularize both segment and index locations, // using variant of single-word Wang/Jenkins hash. h += (h << 15) ^ 0xffffcd7d; h ^= (h >>> 10); h += (h << 3); h ^= (h >>> 6); h += (h << 2) + (h << 14); return h ^ (h >>> 16); } int j = (hash >>> segmentShift) & segmentMask;Copy the code
The hash implementation of ConcurrentHashMap above is exactly the same as HashMap. It’s all bit operations instead of modulo, and then perturbing hashCode. The difference is that ConcurrentHashMap uses a variant of the Wang/Jenkins hash algorithm, whose main purpose is also to combine high and low values to avoid conflicts. As for why we don’t use the same algorithm for HashMap perturbation, I guess that’s just the programmer’s free will. At least I have no way of proving which is superior.
HashMap In Java 8
Prior to Java 8, HashMap and other map-based classes resolved conflicts using chained addresses, using one-way linked lists to store elements with the same index value. In the worst case, this approach reduces the performance of the HashMap get method from O(1) to O(n). To address the degradation of HashMap performance in the event of frequent collisions, Java 8 uses a balanced tree instead of a linked list to store conflicting elements. This means that we can improve our worst-case performance from O(n) to O(logn). I’ll go into more detail about HashMap optimization in Java 8 in a later article.
If a malicious program knew we were using a Hash algorithm, in the pure linked list case it could send a flood of requests that would cause Hash collisions, and then keep accessing those keys and cause the HashMap to get busy doing linear lookups and eventually collapse, a denial-of-service attack (DoS).
The hash functions in Java 8 work in much the same way as those in Java 7. This step was optimized in Java 8 to do 16-bit right-shift xor blending once instead of four times, but the principle remains the same.
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code
In the implementation of JDK1.8, the algorithm for high-order operation is optimized by using the high 16 bits of hashCode() or low 16 bits of hashCode() :(h = k.hashcode ()) ^ (h >>> 16), mainly from the speed, efficiency, quality consideration. H & (table.length-1) to obtain the hash value of int.
HashTable In Java 8
In Java 8 HashTable, there are no hash methods anymore. But the hash operation is still there, as in the put method:
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
Copy the code
This is virtually the same implementation as in Java 7, so I won’t go into too much detail.
ConcurrentHashMap In Java 8
The hash method in Java 8 was changed from hash to spread. The implementation is as follows:
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
Copy the code
Java 8’s ConcurrentHashMap also determines the index of the Key in the array by modulating the hash value of the Key and the array length. Also to avoid a bad hashCode design for keys, it calculates the final hash value of the Key as follows. In contrast, the author of ConcurrentHashMap for Java 8 believes that the addressing efficiency is high enough after the introduction of red-black trees, so the author does not design too much in the calculation of hash values. Just xor the Key’s hashCode value 16 bits above it and ensure that the highest bit is 0 (thus ensuring that the final result is a positive integer).
conclusion
So far, we have analyzed the implementation of HashMap, HashTable, and ConcurrentHashMap in Jdk 1.7 and Jdk 1.8, respectively. As you can see, the JDK does a lot of work on a small hash method to ensure that the results of the hash are scattered and to improve the efficiency of the hash. Of course, MY hope is that not only will we gain insight into the rationale behind this, but we will also learn this attitude to code excellence.
Every line of the Jdk source code is interesting and worth the time to delve into.
The resources
The construction method and conflict resolution of HashTable
The data structure of the HashMap
What’s the difference between HashMap and HashTable?
Answer to @erwang and @anra in zhihu question