Principle and implementation of HashMap

Revision:

JDK 1.7: Table array + Entry linked list; — “JDK1.8: Table array + Entry/red-black tree; (Why red black trees?)

Ask how HashMap works

  • Have you read the HashMap source code? Do you understand the underlying principles
  • Why array + linked list
  • Is it ok to use LinkedList instead of array
  • If you can, why don’t you use arrays instead.

Introduction of important variables:

Ps: All important variables are best remembered and understood.

  • DEFAULT_INITIAL_CAPACITYTable array initialization length:1 < < 4 2 ^ 4 = 16(Why 2 to the n?)
  • MAXIMUM_CAPACITYMaximum size of Table array:1 < < 30 2 ^ 30 = 1073741824
  • DEFAULT_LOAD_FACTORLoad factor: Default value is0.75. When the total number of elements > the current array length * load factor. The array is expanded to double its size (todo: Why double?).
  • TREEIFY_THRESHOLDList tree threshold: The default value is8. Indicates that when the number of values under a node (Table) is greater than 8, the linked list will be converted to a red-black tree.
  • UNTREEIFY_THRESHOLDRed-black tree chain threshold: The default value is6. Indicates that during capacity expansion, if the number of red-black tree nodes under a single Node is less than 6, the red-black tree is converted to a linked list.
  • MIN_TREEIFY_CAPACITY = 64Minimum tree threshold. When all elements of the Table exceed the changed value, tree will be performed (in order to prevent frequent expansion and conflict in the tree process in the early stage).

Implementation principle:

As we all know, in HashMap, array + linked list is used to realize the storage of data.

A HashMap uses an Entry array to store key-value pairs. Each key-value pair forms an Entry entity. The Entry class is actually a one-way linked list structure with a Next pointer that can be connected to the Next Entry entity. In JDK1.8, if the list length is greater than 8, the list will turn into a red-black tree!

To understand why we use linked lists, we need to know how Hash collisions come about:

A: Because the value of our array is limited, after we hash the key value to the subscript, when we put it into the array, it is inevitable that there will be two different key values, but they are put into the same subscript grid, at this time we can use a linked list to store the chain. Is it ok if I use LinkedList instead of array structure? In the source code we do this

Entry[] table=new Entry[capacity]; // Entry is a list of nodesCopy the code

Now make the substitution and implement it as follows

List<Entry> table=new LinkedList<Entry>();

Copy the code

Does it work? The answer, of course, is yes. The third question is why the array is used since it can be used for substitution processing? Because arrays are the most efficient! In a HashMap, node positions are determined by modulo the array length with the hash value of the element’s key. At this point, we have the position of the node. Arrays are obviously more efficient than linkedlists (the underlying structure is a LinkedList). So ArrayList, which is an array at the bottom, is also a quick way to find it, so why not ArrayList? Because the basic array structure is adopted, the expansion mechanism can be defined by itself. Array expansion in HashMap is just the second power of 2, which is highly efficient in modulus calculation. ArrayList has a 1.5-fold expansion mechanism (which I’m sure everyone knows), so why ArrayList has a 1.5-fold expansion mechanism is not explained in this article.

Hash conflict: get subscript value:

We all know how to use an array with a linked list in a HashMap. The problem is that arrays have subscripts, but we usually use hashmaps in this way:

 HashMap<Integer,String> hashMap=new HashMap<>();
hashMap.put(2,"dd");

Copy the code

That’s because our hashMap hashcode() the value of the stored key, which generates a value that is too large to be used as a subscript. At this point we can think of a way to use the mod, for example:

Key. Hashcode () % Table. Length;Copy the code

For any key, the value is 0-table. length-1, but the source code for HashMap does not do that. It adds and to the Table, subtracting the length of the Table to the hash value produced:

  if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);

Copy the code

Let’s draw a picture to get a better idea;

And that’s why the length of the Table array should always be 2 to the n, because that’s the only way to get to the maximum value of n minus 1 when you subtract. As an example, we now have an array length of 15 minus one to make 14. The binary representation 0000, 1110, always ends with a 0, so we can’t use the Table array completely. This violates our original principle of maximizing unordered use of the Table array, because in order for HashMap to be efficient, there should be as few collisions as possible, which is to try to distribute data evenly, with each list roughly the same length. Another thing to note here is that after hashcode the key, we only use the last four bits of the key, and many of the first bits are not used, which may cause the generated subscript values to not be fully hashed. Solution: Xor the high 16 bits of the generated HashCode at the low 16 bits, and then add the resulting value to get the highest hash subscript.

static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

Copy the code

HashMap get/ PUT

  • Do you know what the put element of a HashMap looks like?
  • Do you know what a GET process looks like?
  • What other hash algorithms do you know?
  • Let’s talk about the implementation of HashCode for String

Put method

1. Hash key hashCode() to calculate index. 2. If there is no collision, put it directly into the bucket; If they collide, they should put them behind buckets in a linked list. 4. If the collision causes the list to be too long (greater than or equal to TREEIFY_THRESHOLD), convert the list to a red-black tree (changed in JDK1.8); 5. If the node already exists, replace the old value(ensure the uniqueness of the key) 6. If the bucket is full (exceeding load factor*current capacity), resize is required

After we get the subscript value, we can start to put the value into the array + linked list in one of three ways:

  1. The position of the array is empty.
  2. The position of the array is not empty and the face is in the form of a linked list.
  3. The position of the array is not empty, and the following is the red-black tree format.

You also have to go through the steps for keys and values

  • Obtain the corresponding Table through Key hash; ‘
  • Iterate over nodes under the Table to update or add them.
  • Capacity expansion detection;
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                  boolean evict) {
       Node<K,V>[] tab; Node<K,V> p; int n, i;
       if((TAB = table) = = null | | (n = TAB. Length) = = 0) / / HashMap lazy loading strategy, when performing the put operation test table array initialization. n = (tab = resize()).length;if ((p = tab[i = (n - 1) & hash]) == null) // Get the corresponding Table from the Hash function. If the current Table is empty, initialize a new Node and put it into the Table. tab[i] = newNode(hash, key, value, null);
       else{ Node<K,V> e; K k; // Check whether different values are passed for the same key value. If so, return the original valueif (p.hash == hash&& ((k = p.key) == key || (key ! = null && key.equals(k)))) e = p;else if// If the current Node type is TreeNode, call PutTreeVal. e = ((TreeNode<K,V>)p).putTreeVal(this, tab,hash, key, value);
           else{// If it is not a TreeNode, it is a linked list, traversing and hitting the input key.for (int binCount = 0; ; ++binCount) {
                   if((e = p.ext) == null) {// If the current key does not exist in the current Table, add it. p.next = newNode(hash, key, value, null);
                       if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for1st // converts to a red-black tree if the value exceeds TREEIFY_THRESHOLD. treeifyBin(tab, hash);
                       break;
                   }
                   if (e.hash == hash&& ((k = e.key) == key || (key ! = null && key.equals(k)))) // to do hit collision, usehash, memory, and equals simultaneously determine (different elements)hashMay agree).break; p = e; }}if(e ! = null) { // existing mappingforKey // If the hit is not null, update operation. V oldValue = e.value;if(! onlyIfAbsent || oldValue == null) e.value = value; afterNodeAccess(e);return oldValue;
           }
       }
       ++modCount;
       if(++size > threshold) resize(); afterNodeInsertion(evict);return null;
   }

Copy the code

This is the Put operation of HashMap. We will not discuss the process of adding red-black tree and converting Node list to red-black tree.

Resise method

The implementation mechanism of HashMap expansion is to take out all the entries in the old table array and Hash their Hashcode into the new table. One can see that the annotation Initializes or doubles table size. Resize represents the initialization or Double processing of the array. Now let’s analyze it step by step.

 /**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {// rename the old Table to another name. Node<K,V>[] oldTab = table; int oldCap = (oldTab == null) ? 0 : oldTab.length; int oldThr = threshold; int newCap, newThr = 0; // Indicates that the previous array capacity is not empty.if(oldCap > 0) {// If the array size is greater than the maximumif(oldCap >= MAXIMUM_CAPACITY) {// Threshold = integer.max_value;returnoldTab; } // The size of the old array is not that large, so double the size of the thresholdelse if((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) newThr = oldThr << 1; // double threshold} //hashThe length of the table is equal to the thresholdelse if (oldThr > 0) // initial capacity was placed in threshold
            newCap = oldThr;
        else{// Zero initial threshold using defaults so far // So I get the default array length * load factor newCap = DEFAULT_INITIAL_CAPACITY; newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } // This means that if the new threshold is 0, the new capacity * load factor should be used to recalculate.if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE); } // start the newhashTable to perform the corresponding operation. threshold = newThr; @SuppressWarnings({"rawtypes"."unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if(oldTab ! = null) {// iterate over the oldhashTable to move the inside element to a newhashIn the table.for(int j = 0; J < oldCap/*** now oldhashTable threshold */; ++j) { Node<K,V> e;if((e = oldTab[j]) ! OldTab [j] = null;if(e.ext == null) // Only one element is presenthashHash and assign calculations. newTab[e.hash & (newCap - 1)] = e;else if(e instanceof TreeNode) // If in the old hash table, this position is the result of the tree, the newhash((TreeNode<K,V>)e).split(this, newTab, j, oldCap);else{// preserve order // preserve oldhashNode<K,V> loHead = null, loTail = null; Node<K,V> hiHead = null, hiTail = null; Node<K,V> next;do{// Iterate over Node in the current Table and assign values to the new Table. next = e.next; / / the original indexesif ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                elseloTail.next = e; loTail = e; } // Old index +oldCapelse {
                                if (hiTail == null)
                                    hiHead = e;
                                elsehiTail.next = e; hiTail = e; }}while((e = next) ! = null); // Put the original index inside the bucketif(loTail ! = null) { loTail.next = null; newTab[j] = loHead; } // Add oldCap to bucketif(hiTail ! = null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } }return newTab;
    }

Copy the code

The get method

1. Hash key hashCode() to calculate index. 2. If a direct hit is made in the first node in the bucket, the value is returned. 3. If there is a conflict, check key.equals(k) to find the corresponding Entry. O(logn) = key.equals(k); 5. Check key.equals(k) in the linked list, O(n).

And when we do that, because we’re doing a bunch of hash operations on the key that we’re passing in, first of all, we’re doing advanced hash operations on the key that we’re passing in,

  final Node<K,V> getNode(int hash, Object key) { Node<K,V>[] tab; Node<K,V> first, e; int n; K k; // Check whether the table is empty, whether the table reread is greater than zero, and whether there is a Node Node in the table corresponding to this key.if((tab = table) ! = null && (n = tab.length) > 0 && (first = tab[(n - 1) &hash]) != null) {
            if (first.hash == hash&& // always check first node ((k = first.key) == key || (key ! = null && key.equals(k)))) // check the first Node, if matched, do not need to do... Whirle cycle.return first;
            if((e = first.next) ! = null) {if(First instanceof TreeNode) // Tree structure, using the corresponding retrieval method, to search.return ((TreeNode<K,V>)first).getTreeNode(hash, key);
                do{// list method to dowhileLoop until hit ends or traversal ends.if (e.hash == hash&& ((k = e.key) == key || (key ! = null && key.equals(k))))return e;
                } while ((e = e.next) != null);
            }
        }
        return null;
    }

Copy the code

Either containsKey method

According to the result of the get method, determine whether the key is empty and whether the key is included

 public boolean containsKey(Object key) {
        return getNode(hash(key), key) ! = null; }Copy the code

And we know which hash algorithms

So let’s talk about what a hash algorithm does, but a hash function is a mapping from a large range to a small range. The goal of mapping a large scope to a small scope is often to save space and make data easy to save.

MurmurHash, MD4, MD5, etc

Implementation of HashCode in String

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

Copy the code

The method of calculating hashCode in the String class is relatively simple, which is to calculate the ASCII value of each character with the weight of 31, and use natural overflow to equivalent modulus. Hash calculation formula can plan for ss [[00]] 3311 ^ ^ ((nn – 11)) + + ss [[11]] 3311 ^ ^ ((nn – 22)) + +… ++ ss[[nn — 11]] Why is 31 prime? Because 31 is an odd prime number, 31i=32i- I =(I <<5)-i, this displacement and subtraction combined calculation is much faster than the general operation

Why is the hashmap changed to a red-black tree when the number of linked list elements exceeds 8

  • Do you know what hashmap has changed in JDK1.8?
  • Why is thread insecurity
  • Why not use a red-black tree instead of a linked list and then a red-black tree when resolving hash conflicts
  • When a linked list becomes a red-black tree, when does it degenerate into a linked list

1. Array + linked list + red-black tree instead of array + linked list 2. Optimized hash algorithm h^(h>>>16) 3. After the expansion, the elements are either in the same position, or moved to a power of 2 in the same position, and the list order remains the same. Note: The last item is important, because hashMap does not have an infinite loop in 1.8 because of the change to the last item.

HashMap thread is not safe

HashMap in jdk1.7 uses arrays and lists, and uses header inserts for list inserts. Note: The reason for using the header method is that if we have the same value after hashing, we don’t need to iterate over the length of the list each time. However, this will have a disadvantage, when the expansion, will lead to the new array when the reverse order situation, also will appear in the multi-thread thread insecurity. However, with JDK1.8, there is still a threshold to determine when to convert a red-black tree to a linked list. So whenever you’re going to iterate, you’re going to insert it at the end, so you don’t have to be in reverse order when you’re expanding.

Use thread-safe ConcurrentHashMap when using multiple threads. As for theHashtableThe efficiency is too low.

Thread safety

Void Transfer (Entry[] newTable, Booleanrehash) {
                    int newCapacity = newTable.length;
                    for (Entry<K,V> e : table) { // A
                         while(null ! = e) { Entry<K,V> next = e.next;if (rehash) { e.hash = null == e.key ? Zero:hash(e.key); } int i = indexFor(e.hash, newCapacity); e.next = newTable[i]; newTable[i] = e; e = next; }}}Copy the code

If multiple threads, such as thread1 and thread2, are created in JDk1.7 and want to enter the transfer, the situation will appear as shown in the following figure:

Now for our 1 we have two temporary variables, which we call E1 and e2. This time, the thread for a while first perform the above function, an array of double, and state of going into reverse, at this time of temporary variables e1 and next1 have already disappeared, but for each node connection above have never change, this time, another e2 temp 1, 2 a next2 temporary variables. As shown below:

Once thread 1 has expanded, thread 2 will create its own array of length 6. At this point, the procedure is executed again and again.

E.next = newTable[I]; newTable[i] = e; e = next;Copy the code

E.next = newTable[I]; The moral is: The next thing represented by 2 points to newTable[I], which is where we see the problem. After executing the first loop, the next thing represented by 2 already points to newTable[I], which is our 1, so we don’t have to move, so we don’t have to move, and when we’re done, it looks like the picture below.

E.next = newTable[I]; newTable[i] = e; e = next;Copy the code

At this point the third loop begins, with Entry

next = e.next; E2 = newTable[I]; e2 = newTable[I]; e2 = newTable[I]; You will find that the following figure is executed. This is where circular linked lists come in, which, if left unchecked, can drain our CPU.
,v>

The third question is why not just use red black trees in the beginning, isn’t it efficient? Because a red-black tree needs to do left rotation, right rotation, color change to maintain balance, whereas a single-linked list doesn’t. When the number of elements is less than 8, the linked list structure can ensure the query performance. When the number of elements is larger than 8, red-black trees are needed to speed up the search speed, but the efficiency of new nodes becomes slow. Therefore, if you start with a red-black tree structure, with too few elements and slow addition, you will definitely waste performance. The fourth question is when to degenerate to a linked list is 6 when to revert to a linked list. A difference of 7 prevents frequent conversions between lists and trees. For example, if the number of linked lists is more than 8, the linked list will be transformed into a tree structure, and if the number of linked lists is less than 8, the tree structure will be transformed into a linked list. If a HashMap keeps inserting and deleting elements, and the number of linked lists is around 8, it will frequently transform tree to linked list and linked list to tree, and the efficiency will be very low.

Ask concurrent questions for HashMap

  • What are the problems with HashMap in a concurrent environment
  • How is the general solution

The emergence of a problem

(1) Infinite loop caused by multi-thread expansion (2) Elements may be lost when multi-thread put (3) Non-null element get is null after put

Insecure solutions

  1. We used HashTable before. Synchronized was added in front of every function but it’s inefficient and we don’t use it very often anymore.
  2. Use the ConcurrentHashmap function, for which we can share a lock for each element. Used to improve efficiency.

What do you usually use as a HashMap key

  • Can key be null? Can value be null
  • What is the key value
  • What’s wrong with using mutable classes as the Key of Hashmap1
  • How do you implement a custom class as the Key of a HashMap

Can key be null? Can value be null

Of course, both are possible, but only one key can be null for keys, and multiple values can be null for keys

What is the key value

Immutable classes such as Integer and String are commonly used as HashMap keys, and String is most commonly used. (1) Because the string is immutable, hashCode is cached when it is created and does not need to be recalculated. This makes strings suitable as keys in maps, and strings are processed faster than other key objects. This is why keys in a HashMap tend to use strings. (2) Because the equals() and hashCode() methods are used when retrieving objects, it is important that the key object overrides them properly. These classes have already overridden hashCode() and equals() methods in a very formal way.

What’s wrong with using mutable classes as the Key of Hashmap1

Hashcode may change so that the value of put cannot be gotten, as shown in the following code:

 HashMap<List<String>,Object> map=new HashMap<>();
        List<String> list=new ArrayList<>();
        list.add("hello");
        Object object=new Object();
        map.put(list,object);
        System.out.println(map.get(list));
        list.add("hello world");
        System.out.println(map.get(list));

Copy the code

The output value is as follows:

java.lang.Object@1b6d3586
null

Copy the code

How to implement a custom class as the key of a Hashmap

The following two points are examined for this question

  • What do I need to notice when overriding hashcode and equals methods?
  • How to design an immutable class. For problem 1, remember the following four principles: (1) Two objects are equal, hashcode must be equal, (2) two objects are not equal, hashcode is not equal, (3) Hashcode is equal, two objects are not equal, (4) Hashcode is equal, two objects are not equal. Remember how to write an immutable class (1) that adds the final modifier to ensure that the class is not inherited. If a class can be inherited, it breaks the immutability mechanism of the class. As long as the inheriting class overrides the methods of the parent class and the inheriting class can change the values of member variables, there is no guarantee that the current class will be mutable once the subclass appears as the parent. (2) Ensure that all member variables must be private, and add the final modifier in this way to ensure that member variables cannot be changed. However, this step is not enough because it is possible to change the value of an object member variable externally. So point 4 makes up for that. (3) Do not provide methods to change member variables, including setters to avoid changing the value of member variables through other interfaces and destroying immutable characteristics. (5) In the getter method, do not directly return the object itself, but clone the object, and return a copy of the object. This practice is also to prevent the object leakage, to prevent the internal mutable member object through the getter after the direct operation of the member variable. Causes a member variable to change

Afterword.

  1. For a HashMap, capacity expansion is a particularly memory consuming operation. So when using a HashMap, programmers estimate the size of the map and give it a rough value during initialization to avoid frequent map expansions.
  2. The load factor can be modified and can be greater than 1, but it is not recommended to change it unless the situation is very special.
  3. HashMap is not thread-safe. Do not operate HashMap simultaneously in a concurrent environment. ConcurrentHashMap is recommended.