This article is learning blog: Try Ali, HashMap this article is enough to record later, there are also some including their own understanding

Hash table

What is a hash table

Such as linear tables, tree and other queries (split search, binary sort tree search, B tree search), this kind of search belongs to the comparison type of search, that is to say, the query efficiency completely depends on the number of comparisons. Why do we compare? Because when we store the positions of the elements, it’s completely random.

Hash table is another way of query, is to store elements and storage location to suggest a corresponding relationship, so that you can not find elements through comparison, similar to the dictionary in life, through the keyword directly find the page number of the keyword.

Hash == relational mapping ==f(x)=x

For example, the keyword is our province, the mapping relationship is to take the first letter of the first letter, according to the alphabetical order, find the corresponding storage subscript

x BEIJING TIANJIN HEBEI SHANXI SHANGDONG GUANGZHOU
f(x) 02 20 08 19 19 07

Storage structure

  1. If we want to find BEIJING, go through keyword relational mapping, find the subscript 02, find the element
  2. For different keys, the corresponding F (x) is the same. For example, If SHANXI and SHANGDONG both correspond to 19, this phenomenon is calledHash conflictHashish conflict cannot be avoided. It has to be managed at the time of conflict
  3. A data set consisting of a hash algorithm is a hash table

Hash function construction method

There are several methods, and I’ll just focus on the method used by HashMap, the division and remainder method

H(key) = key MOD P

For example, keywords: 28, 35, 63, 77, 105, P =21

The keyword 28 35 63 77 105
Hash address 7 14 0 14 0

Ways to handle conflicts

There are several ways to do this, but I’m only going to focus on the HashMap method, chain address method, when hash conflicts occur, take array + chain (of course HashMap has its own optimization)

Lookup analysis of hash tables

The efficiency of a hash lookup depends on the length of the list. If the list has too many elements, it is the list traversal algorithm. So we need to control the probability of hash conflicts, that is, when the hash table elements reach a certain length of the hash table column, we will expand to reduce the probability of hash conflicts, that is, the hash table loading factor

Note: The default value of HashMap is 0.75

Based on the above theoretical knowledge, we started HashMap learning, mainly by reading the source code java.util.HashMap, mainly to understand the life cycle of several important methods, corresponding code is JDK 1.8

HashMap

In order to understand HashMap, I learned this knowledge in advance to help understand

  • Balanced binary trees
  • Red and black tree

HashMap stores the structure

JDK 1.8 was made up of “arrays + linked lists”, as shown below



JDK 1.8 “array + linked list + red-black tree” composition, as shown in the figure below

Question one, why did we introduce red black trees?

A: Linked list query efficiency is O(n), red black tree is a balanced tree, query efficiency is O(logn), when hash conflict caused too many nodes, linked list query efficiency is low, red black tree optimization is introduced

Question two, why do we have linked lists, arrays plus red-black trees? A: No, the red black tree is more complex in the construction, the design of node balance adjustment and other operations, and compared with the linked list needs more storage space, when the number of conflicting nodes is relatively small, the comprehensive performance of the linked list is better than the red black tree

Question three, when do we use linked lists when nodes have hash collisions? When to use a red black tree?

Reference code variable

// List to red black tree node threshold (treeifyBin)
static final int TREEIFY_THRESHOLD = 8;
// List to red black tree array minimum capacity threshold
static final int MIN_TREEIFY_CAPACITY = 64;
// Untreeify (untreeify)
static final int UNTREEIFY_THRESHOLD = 6;
Copy the code

There are four life cycles

  1. If the same address conflict node is less than 8, it is a linked list
  2. If the number of nodes with the same address conflict is larger than 8, but the array capacity is smaller than 64, expand the data capacity first, and do not immediately turn to red black tree
  3. When the same address collision node is larger than 8, but the array capacity is larger than 64, the linked list becomes red black tree
  4. When the same address conflict node is originally a red-black tree, because the array expansion, the conflict node is less than or equal to 6, the red-black tree list

Why did you choose 6 or 8 as a reference? Source code comments to explain: random hashcode, hash address is follows the poisson distribution, reference: (en.wikipedia.org/wiki/Poisso…).

Number of duplicate hash addresses The probability of
Zero: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
More than 8 Less than 1 in 10 million

According to the probability, when the nodes are 6 and 8, the probability difference between the upper and lower nodes is very large. For example, in theory, the number of nodes exceeds ten million in order to achieve the condition of linking list to red-black tree

HashMap Important attributes

  1. Store node table array
  • transient Node

    [] table; .
    ,v>

  • Node is a linked list type

    class Node<K.V> implements Map.Entry<K.V> {
    	final int hash;
    	finalK key; V value; Node<K,V> next; . }Copy the code
  • A Node is a TreeNode that can be converted to a red-black tree. TreeNode is a subclass of Node

    public class LinkedHashMap<K.V> extends HashMap<K.V> implements Map<K.V>
        static class Entry<K.V> extends HashMap.Node<K.V> { Entry<K,V> before, after; . }}final class TreeNode<K.V> extends LinkedHashMap.Entry<K.V> {
    	TreeNode<K,V> parent;  // red-black tree links
    	TreeNode<K,V> left;
    	TreeNode<K,V> right;
    	TreeNode<K,V> prev;    // needed to unlink next upon deletion
    	booleanred; . }Copy the code

The table array storage structure is shown below



2. Other attributes

/ / table. Length: capacity
transient Node<K,V>[] table;
// Number of nodes that have been stored in HashMap
transient int size;
// Capacity expansion threshold. When the number of HashMaps reaches this threshold, capacity expansion is triggered
int threshold;
// Load factor, capacity expansion threshold = Capacity x load factor.
final float loadFactor;
Copy the code

The general idea is the above description, the details are different, some important code to illustrate the screenshot

We have reviewed the principle of hash tables, so the four important properties of the above are of course, so it is necessary to master some of the necessary theory.

The default value

    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
    static final float DEFAULT_LOAD_FACTOR = 0.75 f;
Copy the code
  1. The array capacity attribute states that the size of the HashMap must be 2 to the power of N. Of course, the HashMap supports passing in an initial capacity. If it is not 2 to the power of N, the code will convert

    public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
    }
    Copy the code


    [] transient Node

    [] table; The code saves the country through the threshold expansion variable curve. It is used as capacity first, and then changed back when initialized.
    ,v>
    ,v>

        public HashMap(int initialCapacity, float loadFactor) {
            this.loadFactor = loadFactor;
            this.threshold = tableSizeFor(initialCapacity);
        }
    Copy the code

    Threshold and capacity corrected back

So tableSizeFor how does tableSizeFor convert data that’s not two to the N to two to the N

    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }
Copy the code

The code is actually pretty simple, and the idea is pretty simple, and it’s actually pretty easy to understand, not as complicated as everyone says, but let me explain the idea

Suppose we are using 32 bits of storage with the highest bit being 1

code Corresponding binary
The initial value of n 1000 0000 0000 0000 0000 0000 0000
n |= n >>> 1; 1100 0000 0000 0000 0000 0000 0000
n |= n >>> 2; 1111 0000 0000 0000 0000 0000
n |= n >>> 4; 1111 1111 0000 0000 0000 0000 0000
n |= n >>> 8; 1111 1111 1111 0000 0000 0000
n |= n >>> 16; 1111 1111 1111 1111 1111 1111 1111 1111 1111

We can see from the above operation that the number after the highest digit is filled with 1, that is, we just have to control the highest digit, and finally by +1, the number must be 2 to the N

Let me give you an example

  • Example 1) If the input parameter is 16 and the binary value is 00010000, runint n = cap - 1;Then 00001111, after the logical operation, the highest bit 1 is followed by 1, 00001111n + 1;, becomes the final capacity size 00010000=16
  • Example 2) If the input parameter is 25 and the corresponding binary is 00011001, runint n = cap - 1;Then it will become 00011000. After logical operation, the highest bit 1 will be filled with 1 to become 00011111. Finally, it will be executedn + 1;, becomes the final capacity size 00100000=32

HashMap inserts the PUT process

This figure is a reference to the original author, their own draw with the source code, to be honest, if only see this figure is easy to forget, it is best to follow the source code to see it again. Code entry:java.util.HashMap#putVal

How do I compute hash addresses

Look at the source code, any keyword address into the library is a hash function

    public V put(K key, V value) {
        return putVal(hash(key), key, value, false.true);
    }
Copy the code

Continue coding to find the final hashCode of the object to be used, and move it 16 bits to the right to do xor (different is 1, same is 0)

    static final int hash(Object key) {
        int h;
        return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
    }
Copy the code

In the theory section, we looked at the methods used in HashMap, except for the remainder method, but HashMap has been improved. Look at the code

p = tab[i = (n - 1) & hash])
Copy the code

HashCode (h = key.hashcode ()) ^ (h >>> 16); hashCode(h >> 16); None of the high bits of hashCode participate in the operation

A HashMap resize process

This figure is a reference to the original author, their own draw with the source code, to be honest, if only see this figure is easy to forget, it is best to follow the source code to see it again. Code entry: java.util.hashmap #resize



How to understand(e.hash & oldCap) == 0New hash address judgment

NewCap = oldCap << 1; newCap = oldCap << 1; NewCap = oldCap * 2;

As can be seen from the calculation formula of hash address, since we move the address to the left by one bit, that is to say, cap-1 will add a 1 to the original. For example, oldCap-1 is 0000 1111, newCap-1 is 0001 1111. (e.hash & oldCap) == 0



Since the relationship between the oldCap and the new one is twice that of the old one, the distance between the old one and the new one is one oldCap, which is quite clever

HashMap thread safety issue

HashMap is not thread-safe, under the concurrent existence of data coverage, traverse and modify will throw ConcurrentModificationException problem, etc

There was an infinite loop before JDK 1.8.

  • JDK 1.7 expansion using the “head plug method”
  • After JDK 1.8, “tail interpolation” was used.

JDK 1.7.0 extension code (JDK 1.7.0 extension code); JDK 1.7.0 extension code (JDK 1.7.0 extension code)

void transfer(Entry[] newTable) {
    Entry[] src = table;
    int newCapacity = newTable.length;
    for (int j = 0; j < src.length; j++) {
        Entry<K,V> e = src[j];
        if(e ! =null) {
            src[j] = null;
            do {
                Entry<K,V> next = e.next;
                int i = indexFor(e.hash, newCapacity);
                e.next = newTable[i];
                newTable[i] = e;
                e = next;
            } while(e ! =null); }}}Copy the code

For example, let’s say we have data 1,2, 3. The demonstration is as follows



So let’s simplify the problem by looping through a linked list, using “headplug” elements in reverse order, as shown



Multithreading condition reappears

Step 1: Thread 1 executes the tag to the first element, then switches to step thread 2. Thread 2 is very powerful, and executes directly (true 6, DOGE).



Step 2) Now it is thread 1’s turn to execute. At this point we will look at the code again. Important steps 1/2/3 are marked

1) Entry<K,V> next = e.next; (this is also the exit condition e = next;)

2) e.ext = newTable[I]; (newTable[I] is HEAD)



3) newTable[I] = e; (newTable[I] is HEAD)



Now it’s a loop, that’s what it is, and it’s going to exit the loop, and it’s done, okay

So where does the loop happen

Java.util. HashMap#get method, (e = e.next)! = null never happens, this code runs very fast on CPU

Summary (this paragraph directly copied, nothing to say)

The major optimizations in JDK 1.8 include the following:

  1. The underlying data structure is changed from “array + linked list” to “array + linked list + red-black tree”, mainly to optimize the search performance of long linked list when hash conflicts are serious: O(n) -> O(logn).
  2. The way to calculate the initial capacity of the table has been changed. The old way is to start from 1 and continue to the left until a value greater than or equal to the input capacity is found. The new method is calculated by “5 shift + or equal operations”.
  3. Optimized the way the hash value is computed, the old one is done with a random JB operation, and the new one simply involves the higher 16 bits
  4. During capacity expansion, the insertion mode is changed from Head insertion to tail insertion, avoiding dead loops in concurrent scenarios
  5. During capacity expansion, the index position of compute nodes in the new table is changed from H & (Length-1) to Hash & oldCap. The performance may not improve much, but the design is more clever and elegant.

other