1. What is a hash table
A Hash table is a data structure that accesses data stored in memory based on a Key. That is, it accesses records by computing a function on key values that maps the data for the desired query to a location in the table, which speeds up lookups. This mapping function is called a hash function, and the array of records is called a hash table.
1.1 Hash tables and arrays
As you can see from the above definition of hash tables, hash tables use arrays to randomly access data by index, so hash tables are actually an extension of arrays, evolved from arrays. The difference with arrays is that:
- When we use arrays, we usually put them in order, for(I) arr[I]
- When using a hash table, the index position is computed by the hash function, that is, I = hash(key).
So, hash tables actually contain a kind of K-V idea:
- K: The hash value is calculated internally, not the key we passed
- V: The value actually stored, which can be a basic type such as an int, a Node, or an Entry containing a key
,>
1.2 Hash Table and Map
Map is a container, and Entry
is the basic storage unit. The underlying data structure of Map is implemented, that is, the Entry organization can be array, linked list, hash table, tree, etc. :
- Linked list: not suitable, k-V query is too slow O(n), linked list operations need to be traversed
- Hash table: Fit, the design idea is k-V, the query and operation complexity is O (1) ==> HashMap
- Tree: The tree organization entries are ordered and the query complexity is O (logn) ==> TreeMap
Hash functions
2.1 Basic Requirements
The hash function must satisfy the following requirements:
-
The hash function evaluates to a non-negative integer. Since the array subscript starts at 0, the hash value generated by the hash function must also be a non-negative integer
-
The hash function should yield the same hash value for the same key. For example key1 = key2, hash(key1) == hash(key2)
-
The hash function should yield different values for different keys. If key1 ≠ key2, hash(key1) ≠ hash(key2)
This may seem reasonable, but in the real world it is almost impossible to find a hash function that has a different hash value for different keys. Even the well-known hash algorithms such as MD5, SHA, and CRC cannot completely avoid such hash conflicts.
2.2 How to Design a hash function?
- First, the design of the hash function should not be too complicated. A hash function that is too complex is bound to consume a lot of calculation time, which indirectly affects the performance of the hash table.
- Second, the values generated by the hash function should be as random and evenly distributed as possible to avoid or minimize hash conflicts, and even if conflicts occur, the data hashed into each slot will be relatively average, so that there will not be too much data in one slot
// Hash function of HashMap
int hash(Object key) {
intH = key. HashCode ();// apicity specifies the size of the hash table. The result is the same as h % (capitity-1).
return (h ^ (h >>> 16)) & (capitity -1);
}
Copy the code
3. Hash conflicts
Before we talk about hash conflict resolution, let’s look at what a load factor is.
Load factor of the hash table = number of elements filled in/length of the hash tableCopy the code
Therefore, we generally use load factor to express the number of vacancies:
- The larger the load factor, the fewer free positions, the more conflicts, and the worse hash table performance.
- Load factor of a HashMap = 0.75 When the number of elements in a HashMap exceeds 0.75 x capacity (capacity indicates the capacity of the hash table), the HashMap is expanded. Each expansion is twice the original size.
Let’s take a look at several solutions to hash collisions…
3.1 Open addressing method
If a hash conflict occurs, we re-probe a free location and insert it.
- advantages
- Open addressing is not like linked list, which involves pulling many lists. The data in the hash table is stored in an array, which can effectively use the CPU cache to speed up queries.
- Moreover, the hash table implemented in this method is relatively easy to serialize. Linked lists contain Pointers, which are not easy to serialize.
- disadvantages
- Using open addressing to resolve conflicting hash tables is more troublesome when deleting data, requiring special marking of deleted data.
- Also, in open addressing, where all the data is stored in an array, collisions are more costly than in linked lists. Therefore, when using open addressing to resolve conflicting hash tables, the upper limit of load factors should not be too large. This also causes this method to waste more memory space than the linked list method.
When the data volume is small and the loading factor is small, the open addressing method is suitable. This is why ThreadLocalMap in Java uses open addressing to resolve hash conflicts.
Scheme 1: linear detection
When we insert data into the hash table, if the hash function hashes the data and the storage space is already occupied, we start at the current location and look for free space until we find it. Here is an example:
Scheme 2: secondary detection
The so-called secondary detection is similar to linear detection. Linear detection detects hash(key)+0, hash(key)+1, hash(key)+2 when each step is 1.
The step size of the second probe becomes the original “quadratic”, that is, it probes the subscript sequence hash(key)+0², hash(key)+1², hash(key)+2²…
Scheme three: double hashing
By double hashing, I mean using more than one hash function.
We use a set of hash functions hash1(key), hash2(key), hash3(key)… We use the first hash function, and if the calculated storage space is already occupied, we use the second hash function, and so on, until we find a free storage space.
3.2 Linked List method (Bucket/tank)
Linked list is a more common hashing conflict resolution method, which is much simpler than open addressing.
-
Advantages: Linked list is more memory efficient than open addressing. Because linked list nodes can be created as needed, they do not need to be requested in advance, as with open addressing. In fact, this is where linked lists are superior to arrays.
-
Disadvantages:
-
Because linked lists store Pointers, storage of small objects is more memory intensive, and may double the memory consumption.
Of course, if we store large objects, that is, objects that are much larger than the size of a pointer (4 bytes or 8 bytes), then the memory consumption of Pointers in the linked list is negligible in the presence of large objects.
-
Because the nodes in the linked list are scattered in memory, not continuous, it is not friendly to CPU cache, which also has a certain impact on the execution efficiency.
-
Plan 1: zipper
In the hash table, eachBarrels (bucket)orGroove (slot)It’s going to be a linked list, and all the elements that have the same hash value are going to be put into the linked list with the same slot.Complexity analysis:
-
When inserting, we only need to calculate the corresponding hash slot through the hash function and insert it into the corresponding linked list (with the help of the tail pointer), so the insertion time complexity is O(1).
-
When an element is found or deleted, we also compute the corresponding slot through the hash function, and then traverse the list to find or delete it.
In fact, the time complexity of both operations is proportional to the length of the list k, which is O(k). For uniformly hashed hash functions, theoretically, k=n/m, where n represents the number of data in the hash and M represents the number of slots in the hash.
Scheme two: tree
In fact, we can modify the linked list method slightly to achieve a more efficient hash table. That is, we transform the linked list method of linked lists into other efficient dynamic data structures, such as hopping tables, red black trees. Thus, even if hash conflicts occur, in extreme cases all the data is hashed into the same bucket, and the resulting hash table lookup time is only O(logn). This effectively avoids the hash collision attacks described earlier.
3.3 Dynamic Capacity Expansion
For hash tables, when the load factor is too large, we can also dynamically expand, reapply for a larger hash table, and move the data to the new hash table.
Let’s say each time we expand we apply for a space twice the size of the original hash table. If the original hash load factor was 0.8, the new hash load factor is reduced by half to 0.4 after expansion.
For the expansion of array, the operation of data moving is relatively simple. However, for the expansion of hash table, the data movement operation is much more complicated. Because the size of the hash table changes, the location of the data changes, so we need to recalculate the location of each data through the hash function.
Capacity Expansion in Batches
To take an extreme example, if the current size of the hash table is 1GB, to double the size would require rehashing 1GB of data and moving from the old hash table to the new one, which sounds time-consuming, right?
If our business code directly serves the user, although most of the time the insertion of a single piece of data is fast, very few very slow inserts can crash the user. At this point, the “one-time” expansion mechanism is not appropriate. To solve the problem that it takes too much time to expand capacity at one time, you can intersperses expansion operations with insertion operations and complete them in batches.
- When the load factor reaches the threshold, we only apply for new space, but we do not move the old data into the new hash table.
- When there is new data to insert, we insert the new data into the new hash table and take a data from the old hash table and put it into the new hash table.
- We repeat the process each time we insert data into the hash table.
- After many inserts, the data in the old hash table is moved to the new hash table bit by bit. Without centralized one-time data movement, insertions are fast.
For queries, we first look in the new hash table for compatibility with the old and new hash tables, and then look in the old hash table if none is found.
In this way, the cost of one-time capacity expansion is evenly spread over multiple insertion operations, avoiding time-consuming one-time capacity expansion. This way, in any case, the time to insert a data is O(1).
4. Hash table application
4.1 a HashMap, HashSet
Here are two links to source code analysis:
- [Java container source code] the most detailed HashMap source code (JDK8)
- Java container source code HashSet source code analysis
4.2 Hash + Two-way Linked list: LinkedListMap
A hash table is a data structure that supports very efficient insertion, deletion, and lookup operations, but the data in a hash table is stored irregularly after being scrambled by the hash function. That is, it does not support fast traversal of data in some order.
If we want to iterate over the hash table in order, we need to copy the hash table into an array, sort it, and iterate over it again. Because a hash table is a dynamic data structure, with data constantly being inserted and deleted, it would be inefficient to have to sort the data whenever we wanted to traverse the hash table in order.
To solve this problem, we use hash tables and linked lists (or hops) together.LinkedListHashMap uses hash table + bidirectional linked list. The bidirectional linked list is used here because the deletion operation can be completed in O(1) instead of iterating through again to get the preceding node. Specific can refer to the source code[Java container source] LinkedHashMap source code analysis.
4.3 Word spelling check function in Word documents
There are about 200,000 commonly used English words. Assuming that the average length of a word is 10 letters and the average word occupies 10 bytes of memory space, then 200,000 English words take up about 2MB of storage space, even if magnified 10 times, it will be 20MB. For today’s computers, this size fits perfectly in memory. So we can use hash tables to store the entire dictionary of English words.
When a user enters an English word, we take that word and look it up in the hash table. If yes, the spelling is correct; If not, it indicates that the spelling may be wrong, give a hint. A hash table is a data structure that makes it easy to quickly determine if there is a spelling error.