This article is learning blog: Try Ali, HashMap this article is enough to record later, there are also some including their own understanding
Hash table
What is a hash table
Such as linear tables, tree and other queries (split search, binary sort tree search, B tree search), this kind of search belongs to the comparison type of search, that is to say, the query efficiency completely depends on the number of comparisons. Why do we compare? Because when we store the positions of the elements, it’s completely random.
Hash table is another way of query, is to store elements and storage location to suggest a corresponding relationship, so that you can not find elements through comparison, similar to the dictionary in life, through the keyword directly find the page number of the keyword.
Hash == relational mapping ==f(x)=x
For example, the keyword is our province, the mapping relationship is to take the first letter of the first letter, according to the alphabetical order, find the corresponding storage subscript
x | BEIJING | TIANJIN | HEBEI | SHANXI | SHANGDONG | GUANGZHOU |
---|---|---|---|---|---|---|
f(x) | 02 | 20 | 08 | 19 | 19 | 07 |
Storage structure
- If we want to find BEIJING, go through keyword relational mapping, find the subscript 02, find the element
- For different keys, the corresponding F (x) is the same. For example, If SHANXI and SHANGDONG both correspond to 19, this phenomenon is called
Hash conflict
Hashish conflict cannot be avoided. It has to be managed at the time of conflict - A data set consisting of a hash algorithm is a hash table
Hash function construction method
There are several methods, and I’ll just focus on the method used by HashMap, the division and remainder method
H(key) = key MOD P
For example, keywords: 28, 35, 63, 77, 105, P =21
The keyword | 28 | 35 | 63 | 77 | 105 |
---|---|---|---|---|---|
Hash address | 7 | 14 | 0 | 14 | 0 |
Ways to handle conflicts
There are several ways to do this, but I’m only going to focus on the HashMap method, chain address method, when hash conflicts occur, take array + chain (of course HashMap has its own optimization)
Lookup analysis of hash tables
The efficiency of a hash lookup depends on the length of the list. If the list has too many elements, it is the list traversal algorithm. So we need to control the probability of hash conflicts, that is, when the hash table elements reach a certain length of the hash table column, we will expand to reduce the probability of hash conflicts, that is, the hash table loading factor
Note: The default value of HashMap is 0.75
Based on the above theoretical knowledge, we started HashMap learning, mainly by reading the source code java.util.HashMap, mainly to understand the life cycle of several important methods, corresponding code is JDK 1.8
HashMap
In order to understand HashMap, I learned this knowledge in advance to help understand
- Balanced binary trees
- Red and black tree
HashMap stores the structure
JDK 1.8 was made up of “arrays + linked lists”, as shown below
JDK 1.8 “array + linked list + red-black tree” composition, as shown in the figure below
Question one, why did we introduce red black trees?
A: Linked list query efficiency is O(n), red black tree is a balanced tree, query efficiency is O(logn), when hash conflict caused too many nodes, linked list query efficiency is low, red black tree optimization is introduced
Question two, why do we have linked lists, arrays plus red-black trees? A: No, the red black tree is more complex in the construction, the design of node balance adjustment and other operations, and compared with the linked list needs more storage space, when the number of conflicting nodes is relatively small, the comprehensive performance of the linked list is better than the red black tree
Question three, when do we use linked lists when nodes have hash collisions? When to use a red black tree?
Reference code variable
// List to red black tree node threshold (treeifyBin)
static final int TREEIFY_THRESHOLD = 8;
// List to red black tree array minimum capacity threshold
static final int MIN_TREEIFY_CAPACITY = 64;
// Untreeify (untreeify)
static final int UNTREEIFY_THRESHOLD = 6;
Copy the code
There are four life cycles
- If the same address conflict node is less than 8, it is a linked list
- If the number of nodes with the same address conflict is larger than 8, but the array capacity is smaller than 64, expand the data capacity first, and do not immediately turn to red black tree
- When the same address collision node is larger than 8, but the array capacity is larger than 64, the linked list becomes red black tree
- When the same address conflict node is originally a red-black tree, because the array expansion, the conflict node is less than or equal to 6, the red-black tree list
Why did you choose 6 or 8 as a reference? Source code comments to explain: random hashcode, hash address is follows the poisson distribution, reference: (en.wikipedia.org/wiki/Poisso…).
Number of duplicate hash addresses | The probability of |
---|---|
Zero: | 0.60653066 |
1: | 0.30326533 |
2: | 0.07581633 |
3: | 0.01263606 |
4: | 0.00157952 |
5: | 0.00015795 |
6: | 0.00001316 |
7: | 0.00000094 |
8: | 0.00000006 |
More than 8 | Less than 1 in 10 million |
According to the probability, when the nodes are 6 and 8, the probability difference between the upper and lower nodes is very large. For example, in theory, the number of nodes exceeds ten million in order to achieve the condition of linking list to red-black tree
HashMap Important attributes
- Store node table array
-
transient Node
,v>
[] table; .
-
Node is a linked list type
class Node<K.V> implements Map.Entry<K.V> { final int hash; finalK key; V value; Node<K,V> next; . }Copy the code
-
A Node is a TreeNode that can be converted to a red-black tree. TreeNode is a subclass of Node
public class LinkedHashMap<K.V> extends HashMap<K.V> implements Map<K.V> static class Entry<K.V> extends HashMap.Node<K.V> { Entry<K,V> before, after; . }}final class TreeNode<K.V> extends LinkedHashMap.Entry<K.V> { TreeNode<K,V> parent; // red-black tree links TreeNode<K,V> left; TreeNode<K,V> right; TreeNode<K,V> prev; // needed to unlink next upon deletion booleanred; . }Copy the code
The table array storage structure is shown below
2. Other attributes
/ / table. Length: capacity
transient Node<K,V>[] table;
// Number of nodes that have been stored in HashMap
transient int size;
// Capacity expansion threshold. When the number of HashMaps reaches this threshold, capacity expansion is triggered
int threshold;
// Load factor, capacity expansion threshold = Capacity x load factor.
final float loadFactor;
Copy the code
The general idea is the above description, the details are different, some important code to illustrate the screenshot
We have reviewed the principle of hash tables, so the four important properties of the above are of course, so it is necessary to master some of the necessary theory.
The default value
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
static final float DEFAULT_LOAD_FACTOR = 0.75 f;
Copy the code
-
The array capacity attribute states that the size of the HashMap must be 2 to the power of N. Of course, the HashMap supports passing in an initial capacity. If it is not 2 to the power of N, the code will convert
public HashMap(int initialCapacity) { this(initialCapacity, DEFAULT_LOAD_FACTOR); } Copy the code
,v>
[] transient Node
,v>
[] table; The code saves the country through the threshold expansion variable curve. It is used as capacity first, and then changed back when initialized.
public HashMap(int initialCapacity, float loadFactor) { this.loadFactor = loadFactor; this.threshold = tableSizeFor(initialCapacity); } Copy the code
Threshold and capacity corrected back
So tableSizeFor how does tableSizeFor convert data that’s not two to the N to two to the N
static final int tableSizeFor(int cap) {
int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code
The code is actually pretty simple, and the idea is pretty simple, and it’s actually pretty easy to understand, not as complicated as everyone says, but let me explain the idea
Suppose we are using 32 bits of storage with the highest bit being 1
code | Corresponding binary |
---|---|
The initial value of n | 1000 0000 0000 0000 0000 0000 0000 |
n |= n >>> 1; |
1100 0000 0000 0000 0000 0000 0000 |
n |= n >>> 2; |
1111 0000 0000 0000 0000 0000 |
n |= n >>> 4; |
1111 1111 0000 0000 0000 0000 0000 |
n |= n >>> 8; |
1111 1111 1111 0000 0000 0000 |
n |= n >>> 16; |
1111 1111 1111 1111 1111 1111 1111 1111 1111 |
We can see from the above operation that the number after the highest digit is filled with 1, that is, we just have to control the highest digit, and finally by +1, the number must be 2 to the N
Let me give you an example
- Example 1) If the input parameter is 16 and the binary value is 00010000, run
int n = cap - 1;
Then 00001111, after the logical operation, the highest bit 1 is followed by 1, 00001111n + 1;
, becomes the final capacity size 00010000=16 - Example 2) If the input parameter is 25 and the corresponding binary is 00011001, run
int n = cap - 1;
Then it will become 00011000. After logical operation, the highest bit 1 will be filled with 1 to become 00011111. Finally, it will be executedn + 1;
, becomes the final capacity size 00100000=32
HashMap inserts the PUT process
This figure is a reference to the original author, their own draw with the source code, to be honest, if only see this figure is easy to forget, it is best to follow the source code to see it again. Code entry:java.util.HashMap#putVal
How do I compute hash addresses
Look at the source code, any keyword address into the library is a hash function
public V put(K key, V value) {
return putVal(hash(key), key, value, false.true);
}
Copy the code
Continue coding to find the final hashCode of the object to be used, and move it 16 bits to the right to do xor (different is 1, same is 0)
static final int hash(Object key) {
int h;
return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code
In the theory section, we looked at the methods used in HashMap, except for the remainder method, but HashMap has been improved. Look at the code
p = tab[i = (n - 1) & hash])
Copy the code
HashCode (h = key.hashcode ()) ^ (h >>> 16); hashCode(h >> 16); None of the high bits of hashCode participate in the operation
A HashMap resize process
This figure is a reference to the original author, their own draw with the source code, to be honest, if only see this figure is easy to forget, it is best to follow the source code to see it again. Code entry: java.util.hashmap #resize
How to understand(e.hash & oldCap) == 0
New hash address judgment
NewCap = oldCap << 1; newCap = oldCap << 1; NewCap = oldCap * 2;
As can be seen from the calculation formula of hash address, since we move the address to the left by one bit, that is to say, cap-1 will add a 1 to the original. For example, oldCap-1 is 0000 1111, newCap-1 is 0001 1111. (e.hash & oldCap) == 0
Since the relationship between the oldCap and the new one is twice that of the old one, the distance between the old one and the new one is one oldCap, which is quite clever
HashMap thread safety issue
HashMap is not thread-safe, under the concurrent existence of data coverage, traverse and modify will throw ConcurrentModificationException problem, etc
There was an infinite loop before JDK 1.8.
- JDK 1.7 expansion using the “head plug method”
- After JDK 1.8, “tail interpolation” was used.
JDK 1.7.0 extension code (JDK 1.7.0 extension code); JDK 1.7.0 extension code (JDK 1.7.0 extension code)
void transfer(Entry[] newTable) {
Entry[] src = table;
int newCapacity = newTable.length;
for (int j = 0; j < src.length; j++) {
Entry<K,V> e = src[j];
if(e ! =null) {
src[j] = null;
do {
Entry<K,V> next = e.next;
int i = indexFor(e.hash, newCapacity);
e.next = newTable[i];
newTable[i] = e;
e = next;
} while(e ! =null); }}}Copy the code
For example, let’s say we have data 1,2, 3. The demonstration is as follows
So let’s simplify the problem by looping through a linked list, using “headplug” elements in reverse order, as shown
Multithreading condition reappears
Step 1: Thread 1 executes the tag to the first element, then switches to step thread 2. Thread 2 is very powerful, and executes directly (true 6, DOGE).
Step 2) Now it is thread 1’s turn to execute. At this point we will look at the code again. Important steps 1/2/3 are marked
1) Entry<K,V> next = e.next; (this is also the exit condition e = next;)
2) e.ext = newTable[I]; (newTable[I] is HEAD)
3) newTable[I] = e; (newTable[I] is HEAD)
Now it’s a loop, that’s what it is, and it’s going to exit the loop, and it’s done, okay
So where does the loop happen
Java.util. HashMap#get method, (e = e.next)! = null never happens, this code runs very fast on CPU
Summary (this paragraph directly copied, nothing to say)
The major optimizations in JDK 1.8 include the following:
- The underlying data structure is changed from “array + linked list” to “array + linked list + red-black tree”, mainly to optimize the search performance of long linked list when hash conflicts are serious: O(n) -> O(logn).
- The way to calculate the initial capacity of the table has been changed. The old way is to start from 1 and continue to the left until a value greater than or equal to the input capacity is found. The new method is calculated by “5 shift + or equal operations”.
- Optimized the way the hash value is computed, the old one is done with a random JB operation, and the new one simply involves the higher 16 bits
- During capacity expansion, the insertion mode is changed from Head insertion to tail insertion, avoiding dead loops in concurrent scenarios
- During capacity expansion, the index position of compute nodes in the new table is changed from H & (Length-1) to Hash & oldCap. The performance may not improve much, but the design is more clever and elegant.
other