An overview of the

HashMap is a data structure commonly used in daily development. The introduction of HashMap will involve many terms, such as time complexity O, hash (also called hash), hash algorithm, etc. These were taught in college courses, but due to some force of god, they were returned to the teacher. Before studying HashMap in depth, we should first understand the design idea of HashMap and some important concepts, so that we can have a clear understanding when analyzing the source code.

What is a HashMap

Xiao Ming decided to move from city A to city B. He packed up his belongings with two boxes and set out.

When he arrived in CITY B, he wanted to take his mobile phone to tell his family that he was safe. At this time, the problem came, because there were so many things, he forgot which box he put his mobile phone in. Xiao Ming began to open the no.1 box to look for the phone, but did not find it. Open box two again and find the phone. If there are only two boxes and few items, it may take him 2 minutes to find the phone. If there are 20 boxes and each box is full of miscellaneous items, it will take much longer. Xiao Ming summed up the reason for the time-consuming search and found that it was because these things were not put regularly. If he divided each box into different categories, such as a box for mobile phones, computers and other electronic devices, and a special box for clothes, the time he spent looking for things could be greatly shortened.

In fact, HashMap is also used in this way. As a data structure, HashMap is used for adding, deleting, modifying and searching just like array and linked list. When storing data (PUT), it does not put it randomly, but first performs operations like “sorting” before storing it. The next time you fetch, you can greatly reduce the search time. We know that arrays are very efficient at performing searches and changes, while lists are inefficient at adding and deleting (not tails). A HashMap is a combination of the two

As you can see from the above structure, a HashMap is typically made up of an array and a linked list (Java8 converts lists longer than 8 into red-black trees). Combined with the example of finding a mobile phone above, we simply analyze the mental process of HashMap access operation. When storing a key-value pair (such as galen), put first “sorts” it by the key-value, then “sorts” it by the key-value, then tells us that Galen should belong to pit 14, and locates it directly to pit 14. Then there are several scenarios:

  • No one in pit 14, nice, save value directly;
  • Number 14, also known as Galen, replaces the original attack value;
  • There’s someone in number 14, Lao Wang! Cut in front of Lao Wang (head insertion method of single linked list, new element is always placed in the same position at the head of the list)

Get also need to pass the key value, according to the key value to determine the “category” to find, such as looking for fire man, “classification” operation enough to tell us the fire man belongs to the no.2 pit, so we directly locate the no.2 pit to start looking, Yasso is not… Find the hot man.

summary

HashMap is a data structure composed of arrays and linked lists. In Java8, if the linked list length exceeds 8, it will be converted into a red-black tree. When accessing, a hashCode is calculated based on the key value, and then the hashCode is used to locate the location in the array and perform the operation.

What is the HashCode

For example, a factory with 500 employees uses two schemes to store the letters of employees in the factory.

There are 27 mailboxes on the left and right, and the security guard brother on the left does not do processing when he saves the letter, he puts which mailbox he wants to put in which mailbox. When the staff goes looking for the letter, they have to look for each mailbox, and then compare the name of the envelope in the mailbox one by one. In case the brother’s face is black, they put the envelope at the bottom of the last mailbox. So the time in this case is O(N); On the right side of the box will adopt the way of HashCode 27 categories, classification rule is initials (first trunk don’t write the name of the elder brothers), security eldest brother will conform to the corresponding name of letters in the corresponding box, so employees need not each to find, you just need to compare a letter box, greatly improving the efficiency, The time complexity of this case approaches a constant O(1).

In the example, the figure on the right is actually an implementation of hashCode. Each employee has his or her own hashCode, for example, Li Si’s hashCode is L, Wang Wu’s hashCode is W(it depends on how your hash algorithm is written), and then we classify the mailbox according to the determined hashCode value. HashCode match exists corresponding mailbox. You can call the hashCode() method in Java Object to get the Object hashCode, which returns an int. So will two objects have the same hashCode? The answer is yes, just like the previous example of Galen and Wang hashCode, this situation is called “hash collision “, there is such a so-called “collision” processing has been introduced above the solution, the specific source code is introduced later.

summary

HashCode is the identity of an object, and in Java an object’s hashCode is a value of type int. Using hashCode to specify the index of the array allows you to quickly locate the object in the array and then iterate through the list to find the corresponding value, ideally in O(1) time, and different objects can have the same hashCode.

The time complexity of a HashMap

Using hashCode, we can directly locate a box. The time spent is mainly traversing the list. Ideally (the hash algorithm is perfectly written), the list has only one node, which is what we want

Then the time complexity is O(1), which is not ideal, as in the mailbox example above, assuming that the hash algorithm returns the same hashCode for every employee

All the letters are placed in a box, at this time, to find letters, we need to traverse the letters in MAILBOX C in turn, and the time complexity is O(N) instead of O(1). Therefore, the time complexity of HashMap depends on the implementation of the algorithm. Of course, the internal mechanism of HashMap is not as simple as mailbox. Inside HashMap, there will be scaling, and Java8 will convert lists longer than 8 into red-black trees, which will be covered later.

summary

The time complexity of a HashMap depends on the hash algorithm. A good hash algorithm tends to the constant O(1), while a bad hash algorithm tends to the constant O(N).

What is the load factor

We know that the array length in HashMap is 16(what? Assuming that we use the best hash algorithm, which ensures that every time I store a key-value pair in a HashMap, there are 16 key-value pairs in a HashMap, and it only takes once to find a given one.

If you continue to store values in a hashmap, you will inevitably have what are called “hash collisions “to form a linked list. When there are 32 key-value pairs in a hashmap, it takes two worst-case attempts to find the specified one. When there are 128 key-value pairs in a HashMap, it takes eight worst-case attempts to find the specified one

As there are more key-value pairs in a HashMap, lookups become less efficient without changing the number of arrays. So how do you solve this problem? We just need to increase the number of arrays. If you have more than 16 key-value pairs, we need to increase the number of arrays. This is what we call “expansion” on the Internet.

You create a new array, and then “put” the key-value pairs from the original array into the new array. Instead of using the original index, as ArrayList does, you recalculate hashCode based on the length of the new table. So the index of the same key-value pair in the old array is not the same as the index in the new array. (Hot man: “I used to be number 3, but now I am number 127. New watch: “Qing dynasty is dead, now you have to listen to me “). In addition, we can also see that this is a typical space for time operation.

Having said that, what is a load factor? A load factor is simply a requirement for when to expand. The default hashMap array size is 16. If the number of key-value pairs exceeds 16, it will be expanded. However, the HashMap does not wait until the array is full (16) to expand. There is a threshold, and as long as the key-value pairs in the HashMap are greater than or equal to this threshold, the capacity will be expanded. The formula for calculating the threshold value is as follows:

Threshold = current array length * load factor

The default load factor in hashMap is 0.75, and the first disk expansion threshold is 16 * 0.75 = 12. So the first time you save a key-value pair, you need to expand it by the 13th key-value pair; Or another way of thinking about it: suppose the 12th key-value pair is currently saved: 12/16 = 0.75, 13/16 = 0.8125(larger than 0.75 needs to be expanded). I’m sure some people will ask, what do I need this iron bar for? No, WHAT do I need this load factor for? If the array is larger than the size of the array, you don’t have to recalculate the new threshold after each expansion. Google says 0.75 is a good tradeoff, but we can change it ourselves.

public HashMap(int initialCapacity, float loadFactor)
Copy the code

My understanding is that the load factor is like the amount of food people eat. Some people eat 70 percent full, some people eat 10 percent full, but it’s safe to default to 7.5 percent full.

summary

In the case of the same array size, the more key-value pairs are stored, the time efficiency of searching will decrease. Expansion can solve this problem, and the load factor determines the expansion time. The load factor is the ratio of the number of stored key-value pairs to the total array length. The default load factor is 0.75, which we can change ourselves when initializing the HashMap.

Hash to Rehash

The concepts of hash and rehash have been analyzed above. Each expansion requires rehash to calculate the index of the key-value pair in the new table before transferring the key-value pair from the old table to the new table. The key-value pair is stored in the hashMap and then expanded, through the hash and rehash processes

The first hash can be interpreted as’ classification ‘, so that subsequent operations, such as fetch and change, can quickly locate a specific hash. So why rehash, the index of the previous element in the array is directly assigned, for example, hot man went from pit 3 to pit 30.

Personal understanding is that before expansion, if the length of chain no. 13 is 3, in order to ensure that the time complexity of our search tends to O(1), the ideal situation is “one radish one pit”, so now there are more “pits”, the original “three radishes one pit” can be effectively avoided.

Source code analysis

Java7 source code analysis

Take a look at the implementation of HashMap in Java7. With the above analysis in mind, now look for the implementation in the source code.

// Array transient Entry<K,V>[] TABLE = (Entry<K,V>[]) EMPTY_TABLE; //Entry object, store key, value,hashStatic class Entry<K,V> implements map. Entry<K,V> {final K key; V value; Entry<K,V> next; inthash; Static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // Load factor default static finalfloatDEFAULT_LOAD_FACTOR = 0.75 f; // Transient int size; // Threshold = array size * load factor; // Load factor variable finalfloatloadFactor; // Default new HashMap array size 16, load factor 0.75 publicHashMap() { this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR); } public HashMap(int initialCapacity) { this(initialCapacity, DEFAULT_LOAD_FACTOR); } public HashMap(int initialCapacity,floatLoadFactor) {// omit some logic to judge this.loadFactor = loadFactor; threshold = initialCapacity; // empty method init(); }Copy the code

These are some of the prerequisites for a HashMap. Let’s take a look at the code that implements the HashMap operation. There are three situations that occur when a HashMap operation is put.

Public V put(K key, V value) {// Create an array if the array is emptyif(table == EMPTY_TABLE) { inflateTable(threshold); } // If key is empty, treat it separatelyif (key == null)
            returnputForNullKey(value); //① Calculate by keyhashValue of the inthash = hash(key); / / (2) according tohashInt I = indexFor(int I = indexFor(hash, table.length); // Iterate through the listfor(Entry<K,V> e = table[i]; e ! = null; e = e.next) { Object k; 1. If the hash and key values are the same, replace the previous valuesif (e.hash == hash&& ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); // Return the replaced valuereturnoldValue; } } modCount++; //③ situation 2. There is no one in the pit, directly save value or occurhashCollisions all go this addEntry(hash, key, value, i);
        return null;
    }
Copy the code

PutForNullKey () will place the null key pair in index0, so nullKey is allowed in HashMap. Take a look at the main flow:

Step 1. Calculate the hash value based on the key value – > hash(key)

Step 2. Calculate the index in the array based on the hash value and the length of the current array – > indexFor(hash, table.length)

    static int indexFor(int h, int length) {
        //hashValue and array length -1 bit and operation, sound laborious? It's the same thing as h%length; H = 17, length = 16; So it turns out that 1 // ampersand is more efficient than %return h & (length-1);
    }
Copy the code

Step 3 Case 1. If the hash value and key value are the same, replace the original value and return the replaced value.

Step 3 Case 2. No one is in the pit or hash collision occurs – > addEntry(hash, key, value, I)

    void addEntry(int hash, K key, V value, int bucketIndex) {// The number of key-value pairs in the current HashMap exceeds the thresholdif((size >= threshold) && (null ! = table[bucketIndex])) {resize(2 * table.length);hash= (null ! = key) ?hash(key) : 0; // Calculate index bucketIndex = indexFor(hash, table.length); } // Create node createEntry(hash, key, value, bucketIndex);
    }
Copy the code

If put exceeds the threshold, the resize() method is called to double the array size and calculate the index in the new table based on the length of the new table (17%16 =1 now 17%32=17)

Void resize(int newCapacity) {// pass in a newCapacity // get a reference to the old array Entry[] oldTable = table; int oldCapacity = oldTable.length; // In extreme cases, the current number of key-value pairs has reached the maximumif(oldCapacity == MAXIMUM_CAPACITY) {// Change the threshold to the maximum value and return threshold = integer.max_value;return; } // Step 1 Create a new array Entry[] newTable = new Entry[newCapacity]; Transfer (newTable, initHashSeedAsNeeded(newCapacity)); transfer(newTable, initHashSeedAsNeeded(newCapacity)) // Step 3 assign a reference to the new array to table table = newTable; // Step 4 Change the threshold = (int) math. min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1); }Copy the code

The focus above is step 2, and look at its specific transfer operation

    void transfer(Entry[] newTable, boolean rehash) {// Get the length of the new array int newCapacity = newtable.length; // Iterate over the key-value pairs in the old arrayfor (Entry<K,V> e : table) {
            while(null ! = e) { Entry<K,V> next = e.next;if (rehash) { e.hash = null == e.key ? Zero:hash(e.key); Int I = indexFor(e.hash, newCapacity); e.next = newTable[i]; newTable[i] = e; e = next; }}}Copy the code

This for loop will reverse the order of the key pairs before and after the transition (Java7 vs. Java8). If the stone key is 5 and the Galen key is 37, the two will still be in pit 5. For the first time:

The second time

Finally, take a look at how to create a node

    void createEntry(int hash, K key, V value, int bucketIndex) {
        Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<>(hash, key, value, e);
        size++;
    }
Copy the code

When you create a node, if you don’t find a value in the pit, you just put the value in, and then size++; In the case of a collision,

This is the case with the single-chain head insertion (Galen: “I kicked the king out! “), summarize the Java7 PUT flowchart

Compared to put, get operation does not have so many routines, just need to calculate the hash value based on the key value, and the array length modulo, and then can find the position in the array (key is empty also separate operation), and then traverses the list, source code is very few do not analyze.

Java8 source code analysis

The basic idea is the same

Static final int TREEIFY_THRESHOLD = 8; static final int TREEIFY_THRESHOLD = 8; // I still know you!! static class Node<K,V> implements Map.Entry<K,V> { final inthash;
        final K key;
        V value;
        Node<K,V> next;
}
Copy the code

Take a look at the Java8 PUT source code

Public V put(K key, V value) {// Calculate by keyhashvaluereturn putVal(hash(key), key, value, false.true);
}

final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) { Node<K,V>[] tab; Node<K,V> p; int n, i; // Step 1. If the array is empty or the array length is 0, expand the array.if((tab = table) == null || (n = tab.length) == 0) n = (tab = resize()).length; // Step 2hashThe value and array length calculate the position in the array // if"Pit"Create a Node with no one in itif ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else{ Node<K,V> e; K k; / / step 3."Pit"There are people in it, andhashValue and key are equal. The reference is retrieved and the value is replaced laterif (p.hash == hash&& ((k = p.key) == key || (key ! = null && key.equals(k)))) e = p; // Step 4. The chain is a red-black treeelse if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value); // Step 5. The chain is a linked listelse {
                for (int binCount = 0; ; ++binCount) {
                    if((e = p.ext) == null) {// Step 5.1 Note that this place is not like Java7, insert the end of the list!! p.next = newNode(hash, key, value, null); // If the list length exceeds 8, it becomes a red-black treeif (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break; } // Step 5.2 Exists in the linked list andhashValue and key are equal. The reference is retrieved first and the value is replaced laterif (e.hash == hash&& ((k = e.key) == key || (key ! = null && key.equals(k))))break; p = e; }}if(e ! = null) { // existing mappingfor key
                V oldValue = e.value;
                if(! OnlyIfAbsent | | oldValue = = null) / / replace the original value unified e.v alue = value; afterNodeAccess(e); // Return the original valuereturnoldValue; } } ++modCount; // Step 6. If the number of key-value pairs exceeds the threshold, expand the capacityif (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }
Copy the code

Through the analysis of the above comments, the difference between contrast and Java 7, Java8 alike, the unity of the tube you for key isn’t empty processing, more step list the length of the judgment, and turn the operation of the red-black tree, and more importantly, the new Node is inserted in the tail not head!!!!!! . Of course, the above protagonist or expand the resize operation

final Node<K,V>[] resizeNode<K,V>[] oldTab = table; Int oldCap = (oldTab == null)? 0 : oldTab.length; Int oldThr = threshold; Int newCap, newThr = 0;if(oldCap > 0) {// In extreme cases, the old array is fullif(oldCap >= MAXIMUM_CAPACITY) {returnoldTab; } // The array is moved to the left by 1 bit, which is the old array *2else if((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) // double threshold }else if (oldThr > 0) // initial capacity was placed inthreshold newCap = oldThr; // Initialization is hereelse {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE); } // Update threshold = newThr; @SuppressWarnings({"rawtypes"."unchecked"}) / / to create a new array Node < K, V > [] newTab = (Node < K, V > []) new Node [newCap]; table = newTab;if(oldTab ! = null) {for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if((e = oldTab[j]) ! OldTab [j] = null; oldTab[j] = null; // This chain has only one node, and its position in the new table is calculated according to the new array lengthif(e.next == null) newTab[e.hash & (newCap - 1)] = e; // Red black tree processingelse if(e instanceof TreeNode) ((TreeNode<K,V>)e).split(this, newTab, j, oldCap); // If the list length is greater than 1 and less than 8, the following high energy is taken out separately for analysiselse { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        if ((e.hash & oldCap) == 0) {
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            elsehiTail.next = e; hiTail = e; }}while((e = next) ! = null);if(loTail ! = null) { loTail.next = null; newTab[j] = loHead; }if(hiTail ! = null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } }return newTab;
}
Copy the code

As you can see, Java8 write initialize array and expansion in the resize method, but the idea is the same, enlarged to transfer, transfer to recalculate the position in the new table, the above code may not be the last piece of high-energy good understanding, I started to see a face of meng force, see a Meituan blog analysis diagram to be suddenly enlightened, Think clearly before you analyze

Let’s take a look at some of the optimizations for JDK1.8. It is observed that we are using an extension of two powers, so that the element is either in its original position or moved by two powers from its original position. N is the length of the table. Figure (a) shows an example of determining the index position with key1(5) and KEY2 (21) keys before expansion. Figure (b) shows an example of determining the index position with key1 and KEY2 keys after expansion. Where hash1 is the hash and high order operation result corresponding to key1.

In Figure A, both key1(5) and key(21) are computed as 5. After recalculating the hash, the mask range of n-1 is 1bit more (red) at the high level, so the new index will change like this:

In figure B, key1(5) is still 5, and key2(21) is already 21. Therefore, when we extend the HashMap, we don’t need to recalcate the hash as we did in JDK1.7. We just need to see if the new hash bit is 1 or 0. If it is 1, the index becomes “old index +oldCap”.

With the above analysis, come back to the source code

else{// preserve order // Define two chains // originalhashLoHead = null, loTail = null; / / the originalhashNode<K,V> hiHead = NULL, hiTail = null; Node<K,V> next; // Loop through the chaindo {
        next = e.next;
        if ((e.hash & oldCap) == 0) {
            if (loTail == null)
                loHead = e;
            else
                loTail.next = e;
            loTail = e;
        }
        else {
            if (hiTail == null)
                hiHead = e;
            elsehiTail.next = e; hiTail = e; }}while((e = next) ! = null); // Keep the chain in the same position before and after expansionif(loTail ! = null) { loTail.next = null; newTab[j] = loHead; } // Add the chain length to the expanded positionif (hiTail != null) {
        hiTail.next = null;
        newTab[j + oldCap] = hiHead;
    }
}
Copy the code

For clarity, the following table defines the keys and their hash values (when array length is 16, they are all in pit 5)

Key Hash
stone 5
galen 5
more 5
Demon ji 21
The fox 21
Female, 21

Let’s say a hash algorithm just happens to have a store that looks like this, and it expands on the 13th element

So the flow should look like this (focusing only on pit key pair # 5), the first time:

The second:

Skip the middle number, number six

After the two chains are found, a final wave is transferred, and you’re done.

// Keep the chain in the same position before and after expansionif(loTail ! = null) { loTail.next = null; newTab[j] = loHead; } // Add the chain length to the expanded positionif(hiTail ! = null) { hiTail.next = null; newTab[j + oldCap] = hiHead; }Copy the code

Summarize the Java8 PUT flowchart

contrast

1. When a hash conflict occurs, Java7 inserts at the head of the list and Java8 inserts at the tail

2. When data is transferred after capacity expansion, the sequence of the linked list before and after the transfer of Java7 is reversed, but the sequence of Java8 remains the same

3. For performance comparison, please refer to meituantechnical blog. The introduction of Red-black tree Java8 has greatly optimized the performance of HashMap

Thank you

Very detailed foreign little brother

Meituan technology blog