The Concept and Application from Hash table to BitMap (PART 1)

Hash table

When it comes to hashtables, you might think of common collections like HashMap, HashTable, etc.

A Hash table (also called a Hash table) is a data structure that is directly accessed based on Key values. That is, it accesses records by mapping key values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array of records is called a hash table.

A hash table is a data structure that provides fast insert and lookup operations. When first introduced to hash tables, the benefits are mind-boggling. No matter how much data is in the hash table, insertion and deletion only take a near-constant time order O(1). In fact, it only takes a few machine instructions.

For users of hash tables, this is instantaneous. Hash tables are very fast, and computer programs that need to find thousands of records in a second often use hash tables (such as spell checkers) significantly faster than trees, which typically require O(N) time levels. Hash tables are fast and relatively easy to implement programmatically.

Hash tables also have some disadvantages. It is based on arrays, which are difficult to expand once created. Performance degrades dramatically when some hash tables are almost full, so programs must be aware of how much data is going to be stored in the table (or be prepared to periodically move data to a larger hash table, a time-consuming process).

When we hash an element, get a storage address, and then try to insert it, and find that it’s already occupied by another element, this is actually called a collision, also called a hash collision. As mentioned earlier, the design of the hash function is critical, and good hash functions are as simple and evenly distributed as possible. However, it is important to remember that an array is a contiguous chunk of memory of fixed length, and no good hash function can guarantee that the resulting storage addresses will never collide. So how do hash conflicts get resolved? There are many solutions to hash conflict: open addressing (conflict, continue to look for the next piece of storage address is not occupied), hash function method and chain address method, and HashMap is the use of chain address method, that is, array + linked list.

Let’s use HashMap to talk about the application of the dissolution list and conflict resolution.

Implementation principle of HashMap

The backbone of a HashMap in Java is an array of entries. Entry is the basic unit of a HashMap, and each Entry contains a key-value pair.

transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;
Copy the code

	static class Entry<K.V> implements Map.Entry<K.V> {
        final K key;
        V value;
        Entry<K,V> next;// Stores a reference to the next Entry, single linked list structure
        int hash;// The hash value of the key hashcode value is stored in the Entry to avoid double calculation

        Entry(inth, K k, V v, Entry<K,V> n) { value = v; next = n; key = k; hash = h; }}Copy the code

From this, we can see that Entry is actually a one-way linked list. That’s why we say HashMap resolves hash conflicts by zipping.

Simply put, HashMap consists of array + linked list. Array is the body of HashMap, while linked list is mainly used to solve hash conflicts. If the array location does not contain a linked list (the next of the current entry points to NULL), search, add and other operations will be very quick and only need one address. If the array to be located contains a linked list, the time complexity of the add operation is O(n), and the linked list is first traversed. If it exists, it will be overwritten; otherwise, it will be added. For lookups, you still need to traverse the linked list and then compare lookups one by one through the Equals method of the key object. Therefore, from a performance perspective, the fewer linked lists in a HashMap, the better the performance.

The Hash table algorithm

There are multiple methods to construct a Hash table, including direct addressing, residual, median, folding, random number, and mathematical analysis.

Direct addressing

Take a linear function of the key as a hash address, as in 或 A and B are constants.

For example, there is a statistical table of population numbers from 1 to 100 years old, where age is the keyword and the hash function takes the keyword itself. However, this method is inefficient. The time complexity is O(1), and the space complexity is O(n), where n is the number of keywords.

Division remainder method

The hash address is the remainder of the key value divided by a prime smaller than the length of the hash table. Hash(key) = key % p;

Here, the selection of P is very critical. If p is well chosen, conflicts can be minimized to the greatest extent. Generally, p is the largest prime number not greater than m.

Take the average method

The square of the inner code of the key identifier is calculated first, and then the middle bits are taken as the hash address according to the size of the hash table.

For example, if there is a keyword sequence {421,423,436} and the squared result is {177241,178929,190096}, {72,89,00} can be used as the Hash address.

Folding method

Divide the key from left to right into parts with equal number of digits. Each part should have the same number of digits as the hash table address, except that the last part can be shorter. The data of these parts is superimposed to obtain the hash address of the record with the key code. It is divided into shift method and boundary method.

Random number method

Select a random function and take the random function of the keyword as its hash address.

Where random is a random function. This method is usually used when the keyword length is unequal.

Mathematical analysis

Let’s say there are N D digits, each of which may have r different symbols. The r symbols may not appear at the same frequency on bits, but may be more evenly distributed on some bits, and each symbol has an equal chance of appearing; It is not evenly distributed in some bits, and only certain symbols occur frequently. According to the size of the hash table, several bits with evenly distributed symbols can be selected as the hash address.

The date of birth of a group of people is as follows:

Year.month.day 95.10.03 95.11.23 96.07.12 95.04.21 96.02.15...Copy the code

According to the analysis, the first, second and third positions are more likely to repeat, and the chances of conflicts are increased by taking these three positions. Therefore, it is better to take the last three positions rather than the first three.

Conflict resolution

The Hash table constructor was introduced above, and although there are many ways, different key values may map to the same Hash address. This will cause hash collision/hash collision. The following describes how to handle Hash table conflicts.

Closed hash method

Also known as open addressing method, linear detection and two detection.

Linear detection: When different key values are mapped to the same hash address through the hash function, detect whether the next address of the current address can be inserted, if so, the next address of the current location exists, otherwise, continue to look for the next address, address ++.

For example, if you have a set of keywords {12, 13, 25, 23, 38, 34, 6, 84, 91}, Hash table length is 11, Hash function address(key)=key%11, insert 12(Hash (12)=1), 13(Hash (13)=2), 25 (hash (25) = 3) can be directly inserted, when inserted into the 23, address 1 is occupied, thus probe down the address 1 in turn (detection step length can be according to the circumstances, Such as (hash (23) + 1) % = 2, 11 (hash (23) + 2) % = 3, 11 (hash (23) + 3) % 11 = 4), 4 until detected address, found empty, will be 23 plug it.
Secondary detection: it is an improvement of linear detection. The key values inserted after linear detection are too concentrated, so that the key values cannot be correctly mapped to the address after the hash function, and too concentrated will cause low efficiency in searching and deleting. Therefore, by the method of secondary detection, take the current address plusThe new addresses available are slightly spread out.

, until address 5 is detected and found to be empty, then 23 is inserted into it.

Then the hash method

When a conflict occurs, the address is computed using the second, third, and hash functions until there is no conflict. This increases the computation time.

Open chain method (Hash bucket)

When linear and quadratic probes are used, data is always stored in a finite hash table, which is less efficient when there is too much data. Therefore, the zipper method is used to reduce hash conflicts.

When there is too much data on a chain, we can use red-black trees to reduce the height, keep balance and not overload.

BitMap

A BitMap is a BitMap. A Bit is used to mark the Value of an element, and the Key is the element.

Of all the data structures that have been optimized for performance, the most used are Hash tables. As mentioned in the previous section, the Hash table has a time level O(1) on location lookups. But with a large amount of data, there is not enough memory. Since bitmaps store data in bits, they can save a lot of storage space.

BitMap algorithm idea

On 32-bit machines, an integer, such as int A; The BitMap algorithm uses this idea to sort and query a large amount of data.

Advantages:

Operation efficiency is high, do not compare and shift;
Low memory usage, for example, N=10000000; Only N/8=1250000Byte=1.25 MB memory is required.

Disadvantages: All data cannot be duplicated. That is, duplicate data cannot be sorted or searched.

For example: 00000000000000000000000000010100 marked the 2 and 4.

Decimal and binary bits require a map that maps decimal numbers to bits. The map mapping table is described in detail below.

The map mapping table

Assuming that the total number of sorts or searches N=10000000, the size of the memory space we need to apply for is int A [1 + N/32], where 32 of a[0] in the memory corresponds to the decimal number 0-31, and the BitMap table is shown in the following order:

a[0]———>0-31
a[1]———>32-63
a[2]———>64-95
a[3]———>96-127
.

How to convert a decimal number to the corresponding bit. Here is how to convert a decimal number to the corresponding bit using displacement.

In a[0], the remainder of the decimal number N to 32 can be converted to the subscript corresponding to the array A. When n=24, then n/32=0, then 24 is subscript 0 in array A. If n=51, then n/32=1, then 51 corresponds to the subscript 1 in array A. Similarly, we can calculate the subscript 0- n in array A.
0-n corresponds to the number in 0-31: decimal 0-31 corresponds to 0-31, and 32-63 corresponds to 0-31, that is, given a number N, the number in 0-31 can be obtained by modulo 32.
The corresponding 32bit bit is 1 by shifting 0-31.

BitMap application

The sorting

If we want to sort 5 elements (4,7,2,5,3) from 0 to 7 (assuming that the elements are not duplicated), we can use a bitmap to sort. To represent eight numbers, we need only eight bits (1Bytes). First, we create a space of 1Byte and set all the bits in this space to 0.

Iterate over the Bit area and output the number of the bits (2,3,4,5,7) of which the Bit is one, thus achieving the purpose of sorting, time complexity O(n).

Quick go to heavy

Find the number of non-repeating integers in 250 million integers. There is not enough memory space to hold these 250 million integers. There is not enough memory space to hold 250 million integers, which we can quickly think of as bitmaps. The next key question is how to design our bitmap to represent the states of these 250 million numbers. In fact, this problem is very simple, there are only three states of a number, respectively, there is no existence, only one, there is repetition. Therefore, we need only 2bits to store the state of a number. Suppose we set a number that exists once to 00, once to 01, and twice or more to 11. We’ll probably need a few tens of megabytes of storage.

The next task is to iterate over the 250 million digits, changing them to 01 if the corresponding status bit is 00. If the corresponding status bit is 01, change it to 11. If it is 11, the corresponding transition bit remains the same.

Finally, we make statistics of the state bits with 01, and then we get the number of non-repeating digits with time complexity O(n).

A quick query

Bitmaps can also be used for quick queries, in which case only one bit is needed for a number, 0 for nonexistence and 1 for existence. Suppose the above question is changed to how to quickly determine whether a number is sufficient to exist in the above set of 250 million numbers.

As before, we first iterate through all the numbers and then change the corresponding transition bit to 1. After traversal, the query is followed. Since our BitMap is stored consecutively (integer array, 32bits per array element), we are actually using a bucket idea. An array element can store 32 state bits. Divide the number to be queried by 32 to locate the corresponding array element (bucket), and then remainder (%32) to locate the corresponding state bits. If it is 1, it means that the changed number exists. Otherwise, the number does not exist.

Bloom filter

Sometimes using a BitMap alone is not enough. If the amount of data is large enough, such as 64-bit data, then use a BitMap? Storage size required:

1 pb = 1024 TB, 1 TB = 1024 gb. How big is the Exabyte (Exabyte), the unit of statistical data in computer science? One Exabyte =1024PB. At this scale, bitmaps are beyond the capabilities of human hardware. So the advantage of bitmaps is that the spatial complexity does not increase with the number of elements in the original set, and the disadvantage of bitmaps is that the spatial complexity increases linearly with the largest element in the set.

So next, we’ll introduce another well-known industrial implementation, the Bloom Filter. If Bitmap maps every possible integer value by direct addressing, which is equivalent to using a hash function, then The Bloen filter introduces K (k>1) and K (k>1) mutually independent hash functions to ensure the completion of element weight determination in a given space and misjudgment rate. The following figure shows the Bloom filter at k=3.

One application of bloom filters is caching avalanches.

conclusion

This paper first explains the concept and application of hash table. The Hash table actually provides a one-to-one mapping for every possible number, with each element having its own exclusive space, provided by the Hash function. The Hash table can even keep track of the number of occurrences of each element, which allows for more complex functionality.

Our requirement is that each element in the set has its own space and that we can find a way to map to that space. For our problem, a Boolean is enough, or 1 bit is enough, and we just want to know if an element has ever been present. If you allocate 1 bit for each of the possible values of a 32bit int, the memory required for all possible values of a 32bit int is:. This leads to the BitMap algorithm. We introduce the idea of BitMap algorithm and some applications, including sorting, deduplication, query and other applications, BitMap in these large data volume applications are very efficient. Bloom Filter can be seen as an extension of BitMap. An algorithm used to determine if a map is duplicated with a certain amount of error. About the specific application details of Bloom filter, the content is more, will be introduced in the next article.

Finally, welcome to buy the author’s new book “Spring Cloud Advanced Microservices Architecture”.

reference

Hash table (closed hash, zipper method – hash bucket)
BitMap
Big data processing -Bitmap