Hash table & bitmap & Hash algorithm

Learning hash algorithm process, deeply feel the beauty of the algorithm. Those excellent ideas, emitting magical light, illuminating the direction of computer development.

Application scenarios

  • Word spelling error function in Word software

  • Language level: Java HashMap LinkedHashMap

  • Bloom filter: to achieve efficient web crawler in the address link to repeat the function

  • Improved LRU elimination algorithm

  • Hash algorithm to solve security encryption, unique identification, data verification, hash function, load balancing, data fragmentation, distributed storage

The famous hashing algorithm in the industry

MD5, SHA, CRC

Why are hash tables so popular

This guy’s query, delete, insert time is order one.

Concept,

  1. A hash table is an extension of an array, and the underlying dependent array supports quick access to elements by subscript

  2. The idea of hash is to convert the number (key) into an array index (hash) using a function (hash function), and then store the corresponding data in the array at the corresponding index location. The arr [hash (key)] = valuearr [hash (key)] = valuearr [hash (key)] = value

  3. Hash function properties

  • The hash function evaluates to a hash value that is a non-negative integer

  • If key1 = key2, hash(key1) = hash(key2)

  • If key1! = key2, then hash(key1)! = hash(key2)

Hash conflict

It is difficult to guarantee the third feature of the above hash function, no good hash function can avoid hash conflicts. Then there are difficulties we solve difficulties, come on, Ollie!

Commonly used methods to resolve hash conflict are: open addressing, linked list method

Develop addressing methods

  1. Idea: If a hash conflict occurs (the place to insert is found occupied), the conflict is resolved by re-probing the location.

  2. Linear detection method:

  • Insert:

    If you find a conflict, keep looking down, take it. Look down to the end of the search did not find, start from scratch again, until you find the pit. (If the hash table is full, it will not be inserted)

  • To find the

    Calculate the hash value of the current search data through the hash function, go to the hash table to find, find the wrong guy, along the insertion order to find, if the process of finding an empty position, indicating that the data to find is not in the hash table. And why would you say that? If you think about insertion, you’ll get the idea.

  • delete

    Delete a little annoying, need to consider the conditions of the end of the search, can not be deleted, empty pit, interfere with the judgment logic of the search. To solve this problem, you need to mark your deletions.

  1. Analysis of linear detection method

    For the linear detection method, when more and more data are inserted into the hash table, there will be fewer and fewer free positions, and the probability of hash conflict will be higher and higher. The events of linear detection will be longer and longer, resulting in the worst event complexity of O(n).

  2. There are two other classical probing methods for developing addressing: secondary probing and double hashing

The linked list method

🔥 is a more common way to resolve hash conflicts and is simpler than open addressing.

  1. Idea: Set up two structures, slots (arrays) and linked lists. Slots record the head of a linked list, which records data. Insert the process first through the slot, slot is not occupied, occupied slot, slot to open up a linked list, insert data; If the slot is occupied, the slot is directly inserted into the linked list

  2. Time complexity analysis

    For hash table based on the linked list method to resolve conflicts, the time complexity of search and delete operations is proportional to the length of the linked list K, that is, O(k). For hash functions with uniform hash, k=n/mk =n/mk =n/m theoretically, where N represents the number of data in the hash table and M represents the number of slots in the hash table. When k is a small constant, we can roughly think of O(1) as the time complexity of searching and deleting data in the hash table.

  3. Load factor

    We call the top k the load factor. The load factor is expressed as: Load factor = number of elements in the hash table/length of the hash table Load factor = number of elements in the hash table/length of the hash table. The larger the load factor, the longer the linked list length, the lower the performance of the hash table.

Build industrial-grade hash tables

Hash table collision attack

Design hash functions

  1. Overly complex functions consume too much computation time

  2. The values generated by the hash function should be as random and evenly distributed as possible

  3. Some design methods

  • Data analysis: A method of designing hash functions according to the characteristics of data

  • Direct addressing

  • Take the average method

  • Folding method

  • Random number method

Solve the problem of too large load factor

When the load factor is too large, the dynamic capacity is expanded, a larger hash table is applied for, and the data is migrated to a new table

Avoid inefficient Capacity Expansion

  1. Expansion would take too much time

  2. To solve the problem of time-consuming capacity expansion, the capacity expansion operation can be interspersed with multiple insertion operations and completed in batches using the idea of allocation.

Choose the appropriate conflict resolution method

In Java, LinkedHashMap resolves conflicts by linked list method, while ThreadLocalMap resolves conflicts by open addressing method based on linear probe.

  1. Advantages and disadvantages of open addressing
  • advantages

Storing data in arrays can effectively use THE CPU cache to speed up queries. Easy serialization.

  • disadvantages

    Deleting data is complicated; High probability of conflict; The load factor can’t be too big, it has to be less than 1, so you need more storage.

  1. The linked list method
  • advantages

    The memory utilization is higher than open addressing method. High tolerance to large loading factor

  • disadvantages

    To store Pointers, consume extra memory space; CPU cache unfriendly;

  • To improve the

    Instead of a linked list, I could use a red-black tree

Industrial hash table examples

Java HashMap

  1. Initial size

    The default size is 16 and can be set by yourself

  2. Load factor

    The default value is 0.75, which is doubled after triggering expansion

  3. Hashing conflict resolution

    If we continue with the list method, we switch between the list and the red-black tree, depending on the length of the list

  4. The hash function

    Not to discuss

Some solutions to practical problems using hash tables

Optimized LRU elimination algorithm

  1. Several operations to be included in the buffering system
  • Add a data

  • Delete a data

  • Find a piece of data

  1. Idea: Build a hash table as an index on top of the original ordered list, and each node in the hash table stores an additional pointer to the ordered list node

Java LinkedHashMap

Linked refers to a bidirectional ordered list

  1. The data in the hash table is stored irregularly after being scrambled by the hash function, but the LinkedHashMap implements output in the order in which the data was inserted

Crawler to achieve the url link to heavy

Bit uses a bitmap, bloom filter

Prevent user information leakage after data is removed from the database

The principle of hashing algorithm: mapping arbitrary length binary value string to fixed length binary string.

A hash algorithm must meet the following requirements

  1. The original data cannot be derived from the hash value (hence the hash algorithm is also called one-way hash algorithm)

  2. Sensitive to data changes, even if the original data changes only one bit, the corresponding hash value is also different size

  3. The probability of hash collision is very small, the probability of different raw data corresponding to the same hash value is very small

  4. The hashing algorithm has high efficiency and can also calculate the hashing value quickly for long text.