This article is participating in “Java Theme Month – Java Development Practice”, for details: juejin.cn/post/696826…
This is the second day of my participation in Gwen Challenge
Continue from the previous chapter:
[basic principle tour – HashMap source code analysis (JDK1.8 version)
Daily sentence
Is expected to get to work hard, hopeless get don’t mind, no matter winning or losing posture will look good.
Review the concept
-
HashMap consists of array + linked list. Array is the main body of HashMap. Linked list is mainly used to solve hash conflicts.
-
If the array to be located contains a linked list, the time for the add operation is still O(1) because the latest Entry is inserted into the head of the list, requiring a simple change of the reference chain. For the lookup operation, it is necessary to traverse the list and then compare the search through the equals method of the key object.
-
Therefore, for performance purposes, the fewer linked lists in a HashMap, the better the performance.
Representation of HashMap for different JVM versions
This chapter introduces the version of JDK7.
JDK7
The data structure of HashMap is array + linked list
JDK8
The data structure of HashMap is array + linked list + red-black tree
Basic introduction to
Hash table, also known as hash table, is a very important data structure with various application scenarios. The core of many caching technologies (such as memcached) is to maintain a large hash table in memory, and the implementation principle of HashMap is based on this.
Analysis of data structures
This paper analyzes why Hash tables are used for storage and query of various data structures.
An array of
A contiguous storage unit is used to store data.
Query Scenarios
-
For the search of the specified subscript, the time complexity is O(1);
-
To search for a given value, it is necessary to traverse the number group and compare the given keyword and array elements one by one. The time complexity is O(n).
-
For ordered arrays, binary search, interpolation search, Fibonacci search can be used to improve the search complexity to O(logn).
Insert/Delete scenario
- For ordinary insert and delete operations, which involve moving array elements, the average complexity is also O(n).
The corresponding collection class is ArrayList.
Linear list
Query Scenarios
- The search operation needs to traverse the linked list for comparison one by one, and the complexity is O(n).
Insert/Delete scenario
- The operation of adding and deleting the linked list (after finding the specified operation location) only needs to deal with the reference between nodes, and the time complexity is O(1).
The corresponding collection class is LinkedList.
Binary tree
-
For a relatively balanced ordered binary tree.
-
The average complexity of operations such as insertion, search and deletion is O(logn).
The corresponding collection classes are TreeSet and TreeMap.
Hash table
Compared with the above data structures, adding, deleting, searching and other operations in the hash table have higher performance. Without considering the hash conflict, only one positioning can be completed, and the time complexity is O(1).
The corresponding collection class is a HashMap.
The backbone of a hash table is an array. To add or find an element, we map the key of the current element to a location in the array using a function.
That is: Store location = hash(keyword)
Where, the function hash is generally called the hash function, the design of the function will directly affect the quality of the hash table. This is going to involve hash conflicts.
- The design of the hash function is very important, and a good hash function ensures that the computation is as simple as possible and the hash address is evenly distributed.
Hash conflict mechanism
When we hash an element, get a storage address, and then try to insert it, and it’s already occupied by another element, this is actually called a hash collision, also called a hash collision.
-
An array is a contiguous, fixed-length memory space, and a good hash function cannot guarantee that the resulting storage addresses will never collide. So how do hash conflicts get resolved?
- There are several solutions to hash conflicts: open addressing (if a conflict occurs, continue to look for the next unoccubed storage address), rehash function, and chained address.
And HashMap is a chain address method, that is, array + linked list.
Source code implementation of HashMap
Storage structure
Hash bucket
-
When instantiating a HashMap, the system creates an array of entries with the length of Capacity, which is called Capacity. In this array, the locations where elements can be stored are called buckets. Each bucket has its own index. The system can quickly look up elements in the bucket based on the index.
-
Each bucket stores one element, an Entry object, but each Entry object can have a reference variable that points to the next element, so it is possible to create a chain of entries within a bucket.
Entry
The basic building blocks of a HashMap. Each Entry contains a key-value pair. Entry is a static inner class in a HashMap. The code is as follows:
static class Entry<K.V> implements Map.Entry<K.V> {
final K key;
V value;
// Stores a reference to the next Entry, single linked list structure
Entry<K,V> next;
// The hash value of the key hashcode value is stored in the Entry to avoid double calculation
int hash;
/** * Creates new entry. */
Entry(int h, K k, V v, Entry<K,V> n) {
value = v;
next = n;
key = k;
hash = h;
}
Copy the code
After the above analysis, the storage structure diagram of HashMap is as follows:
-
Each element in a 16-length array stores the head of a linked list. So what are the rules for how these elements are stored in the array?
-
It is usually obtained by hash(key)%len, which is the hash of the element’s key modulo the length of the array.
- For example, in the hash table above, 12%16=12,28%16=12,108%16=12,140%16=12. So 12, 28, 108, and 140 are all stored at 12 subscript.
-
When storing a pair of values (Key ->Value pair), it is actually stored in an Entry object E, and the program calculates the storage location of the Entry object through the Key.
- The corresponding relationship between Key->Value is realized through the process of Key — Entry — Value. Therefore, the Value exists wherever the superficially known Key exists.
Important attributes
Let’s start with a few important attributes in the HashMap:
// The capacity is initialized by default, which is 16
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4;
// Maximum capacity, i.e. 2 ^ 30
static final int MAXIMUM_CAPACITY = 1 << 30;
// Default load factor
static final float DEFAULT_LOAD_FACTOR = 0.75 f;
// The internal storage structure of a HashMap is an array, where the array is empty, i.e., there is no pre-initialization state
static finalEntry<? ,? >[] EMPTY_TABLE = {};// Empty storage entity
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;
// The number of stored key-value pairs
transient int size;
// Threshold. When table == {}, this value is the initial capacity (default initial capacity is 16). When the table is filled, that is, after the memory space is allocated for the table, the threshold is generally capacity*loadFactory. When expanding HashMap, refer to threshold
int threshold;
// Load factor, which represents how full the table is. Default is 0.75
final float loadFactor;
// For quick failures, since the HashMap is not thread-safe, if the structure of the HashMap changes (such as put, remove, etc.) during the iteration of the HashMap due to participation of other threads, Need to throw an exception ConcurrentModificationException
transient int modCount;
// The default threshold
static final int ALTERNATIVE_HASHING_THRESHOLD_DEFAULT = Integer.MAX_VALUE;
// The key seed value for calculating the Hash value
transient int hashSeed = 0;
Copy the code
A constructor
The HashMap has four constructors, and the other constructors use default values if the user does not pass in initialCapacity and loadFactor. InitialCapacity defaults to 16 and loadFactory defaults to 0.75.
// Construct a HashMap from the initial capacity and state factors
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)// Check parameter validity
throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)// Check parameter validity
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))// Check parameter validity
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
this.loadFactor = loadFactor;
threshold = initialCapacity;
The init method is not actually implemented in HashMap, but in its subclass
// linkedHashMap will have the corresponding implementation
init();
}
// Construct a HashMap by expanding the capacity factor to the default value of 16
public HashMap(int initialCapacity) {
this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
// Load factor = 0.75, capacity = 16, construct HashMap
public HashMap(a) {
this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
}
// Initialize the HashMap with other maps,
// The capacity is calculated by the size of other maps. The loading factor is 0.75
public HashMap(Map<? extends K, ? extends V> m) {
this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1, DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
inflateTable(threshold);// Initializes the array structure underlying the HashMap
putAllForCreate(m);// Add elements in m
}
Copy the code
As you can see from the above code, the table array is not allocated immediately in a regular constructor (except for a constructor that takes a specified Map as an input parameter) and is actually built on the first PUT operation.
The put operation
-
If two keys give the same index from hash%Entry[].length, a concept of chained data structures is used in a HashMap to solve this problem.
-
As mentioned above, the Entry class has a next attribute that points to the next Entry. For example, if Entry[0] = A is entered, index=0 is computed by the hash of the first key pair A.
-
After A while, A key/value pair B is added and its index is also equal to 0. The HashMap does this: b.next = A,Entry[0] = B. If C is added and index is also equal to 0, then c.next = B,Entry[0] = C;
-
-
So we see that index=0 actually accesses the A,B, and C key-value pairs, which are linked together by the next attribute.
-
That is, the array stores the last element to be inserted. At this point, the rough implementation of HashMap should be clear.
public V put(K key, V value) {
// If the table array is an empty array {}, fill the array (allocate actual memory space for the table) with the parameter threshold. Threshold is initialCapacity. Default is 1<<4(=16).
if (table == EMPTY_TABLE) {
inflateTable(threshold);// Allocate array space
}
// If the key is null, the storage location is on the conflicting chain of table[0] or table[0]
if (key == null)
return putForNullKey(value);
int hash = hash(key);
// Further evaluate the hashcode of the key to ensure an even hash
int i = indexFor(hash, table.length);// Get the actual position in the table
for(Entry<K,V> e = table[i]; e ! =null; e = e.next) {
// If the corresponding data already exists, overwrite the data. Replace the old value with the new value and return the old value
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);// call the value callback, which is also null
return oldValue;
}
}
modCount++;// Fast response fails if the internal structure of the HashMap changes during concurrent access
addEntry(hash, key, value, i);// Add an entry
return null;
}
Copy the code
PutForNullKey method
-
When a key is null, it is placed at the zero position of the hashMap (that is, the hashCode of the key is 0 and the subscript of the array length mod is 0), and there is no linked list
-
PutForNullKey (V value); putForNullKey(V value); putForNullKey(V value); Value is reassigned to the element’s value and the original value is returned. If not, add the element to the head of talbe[0]
/** * Offloaded version of put for null keys */
private V putForNullKey(V value) {
// The for loop handles the case where the key is empty
for (Entry<K,V> e = table[0]; e ! =null; e = e.next) {
if (e.key == null) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(0.null, value, 0);
return null;
}
Copy the code
InflateTable’s source code is as follows:
private void inflateTable(int toSize) {
// Capacity must be a power of 2
int capacity = roundUpToPowerOf2(toSize);
// Set threshold to the minimum of capacity*loadFactor and MAXIMUM_CAPACITY+1. Capaticy must not exceed MAXIMUM_CAPACITY unless loadFactor is greater than 1
threshold = (int) Math.min(capacity * loadFactor,
MAXIMUM_CAPACITY + 1);
// Allocate space
table = new Entry[capacity];
// Select the appropriate Hash factor
initHashSeedAsNeeded(capacity);
}
Copy the code
RoundUpToPowerOf2 (toSize) To ensure that capacity is greater than or equal to the nearest second power of toSize, For example, if toSize=13, capacity=16. To_size = 16, capacity = 16; To_size = 17, capacity = 32.
The quadratic power is implemented as follows:
private static int roundUpToPowerOf2(int number) {
// assert number >= 0 : "number must be non-negative";
return number >= MAXIMUM_CAPACITY
? MAXIMUM_CAPACITY
: (number > 1)? Integer.highestOneBit((number -1) < <1) : 1;
}
Copy the code
RoundUpToPowerOf2 makes the array a power of 2. Integer.highestOneBit is used to get the value of the leftmost bit (the other bits are 0).
After the array is allocated, the hash function calculates the hash value as follows:
// A lot of xor, shift and other operations are used to further calculate the hashcode of the key and adjust the binary bits to ensure that the final storage location is as evenly distributed as possible
final int hash(Object k) {
int h = hashSeed;
// The Hash function is optimized for String, whether to use the new Hash function and Hash
// Factor dependent
if (0! = h && kinstanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h ^= k.hashCode();
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Copy the code
-
As you can see from the above operation, only the value of the key, not the value, affects the storage location of the HashMap elements.
-
After the hash function is used to get the hash value, indexFor is further processed to get the actual storage location
Its implementation is as follows:
// Return the array index
static int indexFor(int h, int length) {
return h & (length-1);
}
Copy the code
- H & (length-1) ensures that the index obtained is within the range of the array. For example, the default capacity is 16, length-1=15, h=18
The final index is equal to 2. Some versions use modulo operations for this calculation, and also ensure that the index is in array range, but bitwise operations are more efficient for computers (there are a lot of bitwise operations in a HashMap).
- From the above analysis, we can see that to obtain the storage location of an element, we need to take the following steps:
- Gets the key value of the element
- The hash method is used to obtain the hash value of the key, which uses the hashcode value of the key.
- The storage subscript location is computed using indexFor.
- Finally, once we have the subscript location of the store, we can put the element into the HashMap via addEntry:
void addEntry(int hash, K key, V value, int bucketIndex) {
// When the size exceeds the threshold and hash conflicts are about to occur, expand the capacity. The new capacity is twice the old capacity
if ((size >= threshold) && (null! = table[bucketIndex])) { resize(2 * table.length);
hash = (null! = key) ? hash(key) :0;
// Recalculate the insertion subscript after expansion
bucketIndex = indexFor(hash, table.length);
}
// Place the element in the bucket of the HashMap
createEntry(hash, key, value, bucketIndex);
}
// Create the element
void createEntry(int hash, K key, V value, int bucketIndex) {
// Get the element of the position to be inserted
Entry<K,V> e = table[bucketIndex];
// Perform the link operation so that the newly inserted element points to the original element.
table[bucketIndex] = new Entry<>(hash, key, value, e);
// This ensures that newly inserted elements are always at the head of the list
// Number of elements +1
size++;
}
Copy the code
-
According to the above code, when hash conflicts occur and size is greater than the threshold value, and the first address element of the table bucket corresponding to the corresponding key is not NULL, array expansion is required
-
During capacity expansion, you need to create a new array with twice the length of the previous array, and then transfer all the elements in the current Entry array. The length of the new array after capacity expansion is twice the length of the previous array. Therefore, capacity expansion is a resource-consuming operation.
Scale operation
Capacity expansion is achieved through the resize operation:
// Expand the Hash table based on the new capacity
void resize(int newCapacity) {
Entry[] oldTable = table;// Old data
int oldCapacity = oldTable.length;// Get the old capacity value
// The old capacity has reached the maximum capacity
if (oldCapacity == MAXIMUM_CAPACITY) {
// Change the capacity expansion threshold
threshold = Integer.MAX_VALUE;
return;
}
// New structure
Entry[] newTable = new Entry[newCapacity];
// Copy the data from the old table to the new structure
transfer(newTable, initHashSeedAsNeeded(newCapacity));
// Modify the underlying array of the HashMap
table = newTable;
// Modify the threshold
threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
}
Copy the code
If the array is expanded, the length of the array will change, and the storage location index = h&(Length-1),index may also change, and index needs to be recalculated. Let’s first look at the transfer method.
// Copy the data from the old table to the new structure
void transfer(Entry[] newTable, boolean rehash) {
int newCapacity = newTable.length;/ / capacity
for (Entry<K,V> e : table) { // Iterate over all buckets
while(null! = e) {// Iterate over all the elements in the bucket.
Entry<K,V> next = e.next;
// If it is a rehash, the Hash value needs to be recalculated.
if (rehash) {
e.hash = null == e.key ? 0 : hash(e.key);
}
// Locate the Hash bucket
int i = indexFor(e.hash, newCapacity);
// The element is connected to the bucket, which is equivalent to singly linked list inserts, always inserted first
e.next = newTable[i];
// The value of newTable[I] is always the latest inserted value
newTable[i] = e;
e = next;// Proceed to the next element}}}Copy the code
In this method, the data in the old array is iterated through the linked list one by one, and then recalculated into the new expanded array. The calculation of the array index position is to get the final array index position by hash operation on hashcode of key value and bit operation with Leng-1.
-
Note: The design of the element length of the HashMap array
-
The array length of a hashMap must be a power of 2. What is the advantage of this?
// Select the appropriate Hash bucket based on the Hash value and Hash table size
static int indexFor(int h, int length) {
return h & (length-1);
}
Copy the code
-
If length is a power of 2, its binary representation is 100… . 0000; Then leng-1 into binary must be 0111… In the form of.11, the binary and operation efficiency of H will be very fast, and the space is not wasted; If length is not a power of 2, for example, length is 15, then length-1 is 14, the corresponding binary is 1110, and then h and operation.
-
The last bit is 0, so 0001,0011,0101,1001,1011,0111,1101 will never hold an element, which is quite a waste of space. Worse, in this case, the number of places the array can be used is much smaller than the length of the array, which further increases the chance of collisions, Slow down the efficiency of the query! This will result in a waste of space.
Get operation
// Get the element whose key is key
public V get(Object key) {
if (key == null)// If the Key value is null, the corresponding value is obtained. As can be seen here, HashMap allows null keys, and there is special logic inside for null keys
return getForNullKey();
Entry<K,V> entry = getEntry(key);// Get the entity
return null == entry ? null : entry.getValue();// Check whether it is null. If not, get the corresponding value
}
// Get the entity whose key is null
private V getForNullKey(a) {
if (size == 0) {// If the number of elements is 0, null is returned
return null;
}
// The element whose key is null is stored at position 0 of the table
for (Entry<K,V> e = table[0]; e ! =null; e = e.next) {
if (e.key == null)// Check whether it is null
return e.value;// Returns its value
}
return null;
}
Copy the code
The get method returns the corresponding value from the key. If the key is null, it retrieves it directly from table[0]. Let’s look at the getEntry method again:
// Get the element with the key value
final Entry<K,V> getEntry(Object key) {
if (size == 0) {// The number of elements is 0
return null;// Return null
}
int hash = (key == null)?0 : hash(key);// Get the Hash value of the key
for (Entry<K,V> e = table[indexFor(hash, table.length)];// Locate the Hash bucket based on the key and table lengthe ! =null;
e = e.next) {// perform traversal
Object k;
if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))// Determine the Hash value and the corresponding key, and return the value if appropriate
return e;
}
return null;
}
Copy the code
-
Key (hashcode) — >hash — >indexFor — > final index position table[I], then check whether there is a linked list, traverse the linked list, and find the corresponding record through the equals method of the key.
-
E.hash == hash is not necessary after locating the array and then iterating through the list, just using equals.
How to generate hashcode in HashMap
Call the hashCode method of the object key, and perform some right shifts and xor operations on the hashCode method (involving both the high and low parts of the hashCode); Right shift and xOR operations can make hashMap hashing stronger and improve the efficiency of hashMap get methods
Why HashCode
HashCode exists primarily for speed of lookup, HashCode is used to determine the storage address of an object in a hash storage structure (using HashCode to represent the object’s position in the hash table), One of the most important reasons why hashCode exists is because it is used in a HashMap (a HashSet is actually a HashMap).
Equals method and HashCode
-
If you override equals(Object obj), it is necessary to override the hashCode() method
-
If equals(Object obj) returns true, it is necessary that hashCode() also return the same int
-
HashCode () does not necessarily return different ints if equals(Object obj) returns false
-
If two objects hashCode() return the same int, equals(Object obj) does not necessarily return true
-
If two objects hashCode() return different ints, equals(Object obj) must return false
If the same object is already stored in the collection during execution, you cannot modify the information that affects the hashCode value, otherwise it will cause memory leaks
conclusion
HashMap is fast because it uses hash tables to generate array subscripts based on the hashcode value of the key.