The introduction

HashMap<K,V> and TreeMap<K,V> are both groups of objects mapped from keys to values. The difference is that HashMap<K,V> is unordered, while TreeMap<K,V> is ordered, and accordingly they differ greatly in data structure.

HashMap<K,V> uses arrays for keys and linked lists or red-black trees for values.The relation between HashSet<K,V> and HashMap<K,V> is similar to the relation between TreeSet<E> and TreeMap<K,V>. The internal implementation also uses the keyset of HashMap<K,V>, which can also be found through the constructor of HashSet<K,V>. HashMap<K,V> will be explained in detail, but HashSet<K,V> will not be analyzed.

public HashSet(a) {
    map = new HashMap<>();
}

public HashSet(int initialCapacity) {
    map = new HashMap<>(initialCapacity);
}
Copy the code

Frame structure

HashMap<K,V> is similar to TreeMap<K,V> in inheritance structure. Both HashMap<K,V> and TreeMap<K,V> inherit from AbstractMap<K,V> and also realize Map<K,V> interface. Therefore, there is little difference in function, but the difference is the underlying data structure to realize the function. At the same time, since HashMap<K,V> is unordered, it does not inherit from SortedMap<K,V>, and correspondingly lacks some lookups based on order.

The hash

Before we look at the implementation of HashMap<K,V>, let’s look at what a hash is. Hash is also called “hash”, is any object or data according to the specified hash algorithm after the operation, output a fixed length of data, through this fixed length of data as the characteristics of the object or data, this is hash. This can be tricky, for example.

If there are 10,000 words in an article, the most intuitive way to find out if there is the word “hello” in the 10,000 words is, of course, to traverse the array and compare each word with “hello”. In the worst case, it may take 10,000 comparisons to find the desired result. If the array is infinite, The number of comparisons is going to go up indefinitely. Is there a faster way to find it? The answer is a hash table. The hash of each of those 10,000 words is calculated according to a given hash algorithm, and then the hash is mapped into an array of 100 words, roughly 100 words per value if the mapping is uniform enough, In this way, we only need to calculate the index of the hash value of “hello” in the array, and then iterate over the 100 words in that position. When the array of maps is big enough, like 10,000, and the hash algorithm is good enough, and the maps are one to one, and each hash value is different, so that in theory the optimal result can be achieved in one lookup, and the worst lookup is the number of words in each position of the array. This is much faster than going through the array directly.

Hashing can greatly improve the efficiency of finding a given element, but is limited by the quality of the hashing algorithm. A good hash algorithm can evenly distribute elements in a fixed-length array, and performance can be greatly affected if the algorithm is not good enough.

Is there an algorithm that can print a unique hash for any given element? The answer is that no such algorithm has yet been found. If every element does not correspond to a unique hash value, a situation can occur where multiple elements correspond to a hash value, which is called “hash conflict.”

Hash conflict

Using a simple hash algorithm in the figure below, the air and airport hashes are the same when the first letter of each word is hashed.Using the previous example, when 10,000 words are stored in an array of 100, if the hash algorithm is good enough and the words are distributed evenly enough, there will be about 100 elements for each hash value, so there will be about 100 hash collisions per position. Although we can reduce the probability of collisions by increasing the array size, such as changing 100 to 10000, it is possible to have one hash for each element. However, if the number of words needed to store is large enough, no matter how large the array is, it may not be enough, and most of the time, the memory or hard disk cannot expand indefinitely. The hash algorithm can not guarantee that the hash values of two different elements are not the same, then the hash conflict is inevitable, so we need to find a way to solve the hash conflict. There are two general methods to resolve hash conflicts: zipper method and open addressing method. As the name implies, the zipper method is to form a linked list of all elements that conflict in the same position, and each time a conflict occurs, a new element is placed at the end of the linked list. A linked list is returned when the hash of an element is found at the specified location, and the list is looped through to find exactly equal elements. Open addressing means that when a conflict occurs, you simply look for the next empty hash address and store the value there. When the hash array is large enough, there will always be empty addresses, and when there are not enough empty addresses, the array can be expanded. The first zipper method is used in HashMap<K,V>.

The constructor

There are several important fields in HashMap<K,V>. Node<K,V>[] table, this array is used to store hash values and their corresponding elements, also called hash bucket array. LoadFactor is the default fill factor, which increases the size of the hash bucket array when the number of elements in the hash bucket array reaches the fill factor times the total size of the hash bucket array. For example, the length of the bucket array is 16. When the number of stores reaches 16*0.75=12, expand the capacity of the hash bucket array. The default padding factor DEFAULT_LOAD_FACTOR is 0.75 and does not need to be changed.

public class HashMap<K.V> extends AbstractMap<K.V> implements Map<K.V>, Cloneable.Serializable  {
    // Default padding factor
    static final float DEFAULT_LOAD_FACTOR = 0.75 f;
    // Hash the bucket array
    transient Node<K,V>[] table;

    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " + loadFactor);
        this.loadFactor = loadFactor;
        // Critical value (the first critical value is the capacity size after conversion)
        this.threshold = tableSizeFor(initialCapacity);
    }

    // Default fill factor threshold (the first threshold is the capacity size after conversion)
    public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
    }

    // The default padding factor threshold threshold is 0
    public HashMap(a) {
        this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted}}Copy the code

In the constructor, there is a tableSizeFor method that converts the input capacity to an integer power of two, so that whatever the input value is, we get a hash bucket array of integer powers of two. For example, 13 returns 16 and 120 returns 128.

static final int tableSizeFor(int cap) {
    // Avoid the case where input 8 becomes 16
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    // When the low order is 1, do n + 1 to change the low order to 0, and get a power of 2
    return (n < 0)?1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
Copy the code

After the input parameter 001x XXXX XXXX XXXX is moved by 1 bit, 0001 XXXX XXXX XXXX is calculated with the original value, and 0011 XXXX XXXX XXXX XXXX is obtained. The highest and lowest bit of the parameter are changed to 1. After displacement of 2 bits, 0000 11xx XXXX XXXX performs or operation with the original value 0011 XXXX XXXX XXXX, and 0011 11XX XXXX XXXX is obtained. The 1 of the highest 2 bits and the lower 2 bits become 1. After displacement of 4 bits, 0000 0011 11XX XXXX and the original value 0011 11XX XXXX are calculated or, and 0011 1111 11XX XXXX is obtained. The 1 of the highest 4 bits and the lower 4 bits are changed to 1. After 8 bits of displacement 0000 0000 0011 1111 and the original value 0011 1111 11XX XXXX or operation, get 0011 1111 1111 1111 1111 1111, the highest 8 bits of 1 and the lower 8 bits of 1 become 1. Displacement 16 bits similar. The result is that from the highest digit all the subsequent digits are 1’s. And then n plus 1, you get 0100 0000 0000 0000. See the following example: when 13 is entered:When 118 is entered:Note here that n = cap-1, the reason why the input parameter is reduced by one is to avoid doubling the capacity when the input 2 is raised to the power. For example, if the input 8 is not reduced by one, the output will be 16. Readers can test by themselves.

Hash value

So why does it have to be an integer power of 2 to initialize the length of the hash bucket array? And that brings us to the hash calculation. The code for calculating the hash value of an element in HashMap<K,V> is as follows:

static final int hash(Object key) {
    int h;
    return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code

In this code, we get the hashCode of the key. This hashCode, if the element has a re-implementation of the hashCode function, will use its own implementation of the hashCode. If it does not have its own implementation, Most of the time, the hashCode function returns the memory address of the element, but this is not absolute. It depends on the implementation of each JVM, but most implementations have some association with the memory address, if not directly.

After getting the hashCode of the key, the xOR operation is performed on the lower 16 bits of the hashCode value and the higher 16 bits of the hashCode value. This is where the xor operation takes all the characteristics of the higher 16 bits and the lower 16 bits at the same time, which greatly increases the randomness of the lower level. When TAB [(n-1) &hash] is used to fetch the index, the hash value containing all features and the hash bucket length can be reduced by 1 to obtain the low value of the hash bucket length.

Add elements

HashMap<K,V> uses an array + a linked list + a red-black tree to store data.

When hash conflicts occur in the process of adding elements, a linked list will be generated at the conflicting location using zipper method to store the conflicting data. If the amount of conflicting data in the same location is greater than 8, the hash bucket array will be expanded or the linked list will be converted into a red-black tree to store the data. At the same time, the system checks the capacity of hash bucket data after each data addition. When the capacity exceeds the threshold, the system expands the capacity.

For those of you wondering about red-black trees, check out the previous two articles.

public V put(K key, V value) {
    return putVal(hash(key), key, value, false.true);
}

final V putVal(int hash, K key, V value, boolean onlyIfAbsent, boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    If the hash bucket array is empty, initialize the hash bucket array by resize
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    // The hash value corresponds to the position is empty, which means that there is no conflict, directly assign the value
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    else {
        // create a hash conflict
        Node<K,V> e; K k;
        // If the hash values are equal and the keys are equal, the values are overwritten directly
        if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
            e = p;
        //p for red-black tree add using red-black tree logic (see TreeMap)
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        / / p for the linked list
        else {
            for (int binCount = 0; ; ++binCount) {
                // If no equal element is found at the end of the list, add it directly to the end
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    // If the list length is larger than 8, expand the hash bucket array or convert the list to a red-black tree
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break;
                }
                // Overwrite value if there are equal elements in the list
                if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                    break; p = e; }}//覆盖value
        if(e ! =null) { // existing mapping for key
            V oldValue = e.value;
            if(! onlyIfAbsent || oldValue ==null)
                e.value = value;
            afterNodeAccess(e);
            return oldValue;
        }
    }
    ++modCount;
    // Expand the capacity when the hash bucket contains more data than the threshold
    if (++size > threshold)
        resize();
    afterNodeInsertion(evict);
    return null;
}

final void treeifyBin(Node<K,V>[] tab, int hash) {
    int n, index; Node<K,V> e;
    If the hash bucket array is smaller than 64, expand the hash bucket array
    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();
    else if ((e = tab[index = (n - 1) & hash]) ! =null) {
        TreeNode<K,V> hd = null, tl = null;
        // Convert all Node
      
        nodes to TreeNode
       
         nodes
       ,v>
      ,v>
        do {
            TreeNode<K,V> p = replacementTreeNode(e, null);
            if (tl == null)
                hd = p;
            else {
                p.prev = tl;
                tl.next = p;
            }
            tl = p;
        } while((e = e.next) ! =null);
        // Convert the TreeNode
      
        list to a red-black tree
      ,v>
        if((tab[index] = hd) ! =null) hd.treeify(tab); }}Copy the code

capacity

During the process of adding elements, expansion occurs in the following two cases: If a single hash bucket stores more than 8 elements, the hash bucket array is checked. If the total hash bucket array capacity is less than 64, expansion occurs. The entire hash bucket array is also checked after each element is added, and is expanded beyond the threshold. Expansion source code analysis is as follows:

final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    int oldCap = (oldTab == null)?0 : oldTab.length;
    int oldThr = threshold;
    int newCap, newThr = 0;
    // If the hash bucket array is already initialized, move it 1 bit to the left
    if (oldCap > 0) {
        if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        // Double the left displacement by 1 bit
        else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }
    // The hash bucket array is uninitialized and its capacity has been initialized
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    // The hash bucket array is uninitialized and the uninitialized capacity uses the default DEFAULT_INITIAL_CAPACITY
    else {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    // Initialize the hash bucket array capacity and threshold
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                    (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes"."unchecked"})
    // Initialize the hash bucket array
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    // Copy all elements into the new hash bucket array
    if(oldTab ! =null) {
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if((e = oldTab[j]) ! =null) {
                oldTab[j] = null;
                // There is no hash conflict, recalculate hash value and copy
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                // The collision structure is a red-black tree
                else if (e instanceof TreeNode)
                    ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                // The conflict structure is a linked list
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        // After capacity expansion, if the highest bit is 0, there is no need to move to the new location
                        if ((e.hash & oldCap) == 0) {
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        // After capacity expansion, if the highest bit is 1, you need to move it
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
                    if(loTail ! =null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    // Move the hash changed node to a new position
                    if(hiTail ! =null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}
Copy the code

Now, there’s a really neat thing about this expansion, because the hash of each element has to be recalculated and put into a new hash bucket array, and in the process of doing that, it’s multiplied by two, which is an integer power.

In this way, after each expansion, one more bit of feature will be used, so that when the extra bit of feature is 0 ((e.hash & oldCap) == 0), the hash value actually does not change, so there is no need to move, when this bit of feature is 1, Simply move the position to the old capacity size (newTab[j + oldCap] = hiHead), which reduces the number of moves. This is true of red-black trees and linked list structures.

Look for the element

Finding a HashMap<K,V> is easy to understand by simply finding the same value in a linked list or red-black tree.

Whether a value is the one we want to find is determined first by the hash, and then by equals or equals if the hash is equal.

public V get(Object key) {
    Node<K,V> e;
    // Compute the hash value
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}

final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if((tab = table) ! =null && (n = tab.length) > 0 &&
        (first = tab[(n - 1) & hash]) ! =null) {
        // If the hash values are equal and the key is equal, return the found value
        if (first.hash == hash && // always check first node((k = first.key) == key || (key ! =null && key.equals(k))))
            return first;
        // There is a hash conflict, the first one is not the key
        if((e = first.next) ! =null) {
            // The collision structure is a red-black tree
            if (first instanceof TreeNode)
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
            // The conflict structure is a linked list
            do {
                if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                    return e;
            } while((e = e.next) ! =null); }}return null;
}
Copy the code

Remove elements

When deleting an element, the required element is first found and then deleted according to the data structure of the found element.

public V remove(Object key) {
    Node<K,V> e;
    // Compute the hash value
    return (e = removeNode(hash(key), key, null.false.true)) = =null ?
        null : e.value;
}

final Node<K,V> removeNode(int hash, Object key, Object value,
                            boolean matchValue, boolean movable) {
    Node<K,V>[] tab; Node<K,V> p; int n, index;
    if((tab = table) ! =null && (n = tab.length) > 0 &&
        (p = tab[index = (n - 1) & hash]) ! =null) {
        Node<K,V> node = null, e; K k; V v;
        // Hash values are equal and keys are equal
        if(p.hash == hash && ((k = p.key) == key || (key ! =null && key.equals(k))))
            node = p;
        // There is a hash conflict, the first one is not the key
        else if((e = p.next) ! =null) {
            // The collision structure is a red-black tree
            if (p instanceof TreeNode)
                node = ((TreeNode<K,V>)p).getTreeNode(hash, key);
            // The conflict structure is a linked list
            else {
                do {
                    if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k)))) {
                        node = e;
                        break;
                    }
                    p = e;
                } while((e = e.next) ! =null); }}if(node ! =null&& (! matchValue || (v = node.value) == value || (value ! =null && value.equals(v)))) {
            // If the node is a red-black tree node, delete it according to red-black tree logic
            if (node instanceof TreeNode)
                ((TreeNode<K,V>)node).removeTreeNode(this, tab, movable);
            // The node is the first node in the bucket
            else if (node == p)
                tab[index] = node.next;
            // Nodes are subsequent nodes
            else
                p.next = node.next;
            ++modCount;
            --size;
            afterNodeRemoval(node);
            returnnode; }}return null;
}
Copy the code

The HashMap Key

Explaining the complete implementation of HashMap, we can find that in most cases, the most core impact on the performance of HashMap is still in the hash algorithm. Although the time complexity of HashMap in addition, deletion and search can reach O (1) in theory, it is still affected by many factors in practical application. Sometimes, HashMap with time complexity of O (1) may have worse performance than TreeMap with time complexity of O (log n), because of the hash algorithm.

If you use the default hash algorithm for an object, as we said earlier, most JVM hashing algorithms are implemented directly in relation to memory locations, and to reduce the probability of collisions, hashing can be extremely complex to the point where it becomes inefficient. Therefore, in practical use, it is necessary to use a simple type as the Key of the HashMap, such as int, so that the hash time can be greatly shortened. If you use your own hash algorithm, you need to test the efficiency of the hash algorithm before using it to reduce the impact on the performance of HashMap.

conclusion

This concludes the Java collections series, which has gone from the collection framework to several commonly used collection classes, and of course leaves much to be desired, such as Queue, Stack, LinkHashMap, and so on. Although this is a summary of my Java learning process, I also hope that this collection series can help you understand Java collection. If there are mistakes, questions or need to be improved in the article, I hope you don’t hesitate to point out. Next, I plan to make a series of system summaries of the contents under the java.util.Concurrent package. Please leave comments to me if you have any suggestions.