Hash table & bitmap & Hash algorithm
Learning hash algorithm process, deeply feel the beauty of the algorithm. Those excellent ideas, emitting magical light, illuminating the direction of computer development.
Application scenarios
-
Word spelling error function in Word software
-
Language level: Java HashMap LinkedHashMap
-
Bloom filter: to achieve efficient web crawler in the address link to repeat the function
-
Improved LRU elimination algorithm
-
Hash algorithm to solve security encryption, unique identification, data verification, hash function, load balancing, data fragmentation, distributed storage
The famous hashing algorithm in the industry
MD5, SHA, CRC
Why are hash tables so popular
This guy’s query, delete, insert time is order one.
Concept,
-
A hash table is an extension of an array, and the underlying dependent array supports quick access to elements by subscript
-
The idea of hash is to convert the number (key) into an array index (hash) using a function (hash function), and then store the corresponding data in the array at the corresponding index location. The arr [hash (key)] = valuearr [hash (key)] = valuearr [hash (key)] = value
-
Hash function properties
-
The hash function evaluates to a hash value that is a non-negative integer
-
If key1 = key2, hash(key1) = hash(key2)
-
If key1! = key2, then hash(key1)! = hash(key2)
Hash conflict
It is difficult to guarantee the third feature of the above hash function, no good hash function can avoid hash conflicts. Then there are difficulties we solve difficulties, come on, Ollie!
Commonly used methods to resolve hash conflict are: open addressing, linked list method
Develop addressing methods
-
Idea: If a hash conflict occurs (the place to insert is found occupied), the conflict is resolved by re-probing the location.
-
Linear detection method:
-
Insert:
If you find a conflict, keep looking down, take it. Look down to the end of the search did not find, start from scratch again, until you find the pit. (If the hash table is full, it will not be inserted)
-
To find the
Calculate the hash value of the current search data through the hash function, go to the hash table to find, find the wrong guy, along the insertion order to find, if the process of finding an empty position, indicating that the data to find is not in the hash table. And why would you say that? If you think about insertion, you’ll get the idea.
-
delete
Delete a little annoying, need to consider the conditions of the end of the search, can not be deleted, empty pit, interfere with the judgment logic of the search. To solve this problem, you need to mark your deletions.
-
Analysis of linear detection method
For the linear detection method, when more and more data are inserted into the hash table, there will be fewer and fewer free positions, and the probability of hash conflict will be higher and higher. The events of linear detection will be longer and longer, resulting in the worst event complexity of O(n).
-
There are two other classical probing methods for developing addressing: secondary probing and double hashing
The linked list method
🔥 is a more common way to resolve hash conflicts and is simpler than open addressing.
-
Idea: Set up two structures, slots (arrays) and linked lists. Slots record the head of a linked list, which records data. Insert the process first through the slot, slot is not occupied, occupied slot, slot to open up a linked list, insert data; If the slot is occupied, the slot is directly inserted into the linked list
-
Time complexity analysis
For hash table based on the linked list method to resolve conflicts, the time complexity of search and delete operations is proportional to the length of the linked list K, that is, O(k). For hash functions with uniform hash, k=n/mk =n/mk =n/m theoretically, where N represents the number of data in the hash table and M represents the number of slots in the hash table. When k is a small constant, we can roughly think of O(1) as the time complexity of searching and deleting data in the hash table.
-
Load factor
We call the top k the load factor. The load factor is expressed as: Load factor = number of elements in the hash table/length of the hash table Load factor = number of elements in the hash table/length of the hash table. The larger the load factor, the longer the linked list length, the lower the performance of the hash table.
Build industrial-grade hash tables
Hash table collision attack
Design hash functions
-
Overly complex functions consume too much computation time
-
The values generated by the hash function should be as random and evenly distributed as possible
-
Some design methods
-
Data analysis: A method of designing hash functions according to the characteristics of data
-
Direct addressing
-
Take the average method
-
Folding method
-
Random number method
Solve the problem of too large load factor
When the load factor is too large, the dynamic capacity is expanded, a larger hash table is applied for, and the data is migrated to a new table
Avoid inefficient Capacity Expansion
-
Expansion would take too much time
-
To solve the problem of time-consuming capacity expansion, the capacity expansion operation can be interspersed with multiple insertion operations and completed in batches using the idea of allocation.
Choose the appropriate conflict resolution method
In Java, LinkedHashMap resolves conflicts by linked list method, while ThreadLocalMap resolves conflicts by open addressing method based on linear probe.
- Advantages and disadvantages of open addressing
- advantages
Storing data in arrays can effectively use THE CPU cache to speed up queries. Easy serialization.
-
disadvantages
Deleting data is complicated; High probability of conflict; The load factor can’t be too big, it has to be less than 1, so you need more storage.
- The linked list method
-
advantages
The memory utilization is higher than open addressing method. High tolerance to large loading factor
-
disadvantages
To store Pointers, consume extra memory space; CPU cache unfriendly;
-
To improve the
Instead of a linked list, I could use a red-black tree
Industrial hash table examples
Java HashMap
-
Initial size
The default size is 16 and can be set by yourself
-
Load factor
The default value is 0.75, which is doubled after triggering expansion
-
Hashing conflict resolution
If we continue with the list method, we switch between the list and the red-black tree, depending on the length of the list
-
The hash function
Not to discuss
Some solutions to practical problems using hash tables
Optimized LRU elimination algorithm
- Several operations to be included in the buffering system
-
Add a data
-
Delete a data
-
Find a piece of data
- Idea: Build a hash table as an index on top of the original ordered list, and each node in the hash table stores an additional pointer to the ordered list node
Java LinkedHashMap
Linked refers to a bidirectional ordered list
- The data in the hash table is stored irregularly after being scrambled by the hash function, but the LinkedHashMap implements output in the order in which the data was inserted
Crawler to achieve the url link to heavy
Bit uses a bitmap, bloom filter
Prevent user information leakage after data is removed from the database
The principle of hashing algorithm: mapping arbitrary length binary value string to fixed length binary string.
A hash algorithm must meet the following requirements
-
The original data cannot be derived from the hash value (hence the hash algorithm is also called one-way hash algorithm)
-
Sensitive to data changes, even if the original data changes only one bit, the corresponding hash value is also different size
-
The probability of hash collision is very small, the probability of different raw data corresponding to the same hash value is very small
-
The hashing algorithm has high efficiency and can also calculate the hashing value quickly for long text.