This is the 20th day of my participation in the August Text Challenge.More challenges in August

One, foreword

We were asked to implement HashMap in an interview. We were asked to implement HashMap based on an array and a linked list. Why do we say that HashMap is not implemented as an array + linked list? In fact, this is caused by your narrow understanding. A real HashMap is a generalized concept. The HashMap we commonly refer to is a HashMap implementation in Java. This is just one of all the ways to implement HashMap.

The generalized HashMap is addressed into Open Addressing HashMap and Closed Addressing HashMap. The Open Addressing is subdivided into Linear Probing, Quadratic Probing and Double Hashing according to detection techniques. In Open Addressing, there is also the concept of Primary Clustering and Secondary Clustering.

See these concepts must be a lot of students have been dizzy, is not found a lot of concepts have not heard. The concept of generalized HashMap and the underlying implementation principle will be explained in detail.

2. Principle of generalized HashMap

1. Addressing mode

First, let’s talk briefly about what addressing is. Addressing should be familiar to those familiar with Java HashMaps. In Java HashMap, the main structure is array + linked list. The process by which we Hash out where a key should be placed in an array is called addressing. Addressing is how to find the location of a Key in an array. Because all hashMaps are quick to find data based on quick location of arrays.

The generalized HashMap divides HashMaps into two main categories based on Addressing methods: Open Addressing and Closed Addressing. We’ll take a closer look at this.

A, Closed Addressing

This approach to addressing is easy to understand. Because Addressing in Java HashMap is Closed Addressing. Any element (the element’s key) that has been hashed to a location in the array must be in that location. When Hash conflicts occur, we can use extra storage to resolve them, as in the case of linked lists in Java HashMap.

B, Open Addressing

Development addressing is in contrast to closed addressing. That is, when we hash an element (its key) to a location in the array, we don’t have to put the element in that location. Instead, you can use other addressing methods (we’ll see later) to put it somewhere else in the array.

The difference between the two addressing methods can be clearly seen in the figure below (where the numbers 0-9 indicate the position of the array and the array in the circle represents the corresponding element).

C, Open Addressing vs Closed Addressing

From the above analysis we can see that there is a clear difference between Open Addressing and Closed Addressing. The following is a basic summary:

Compare the item Open Addressing Closed Addressing instructions
Elements of capacity Maximum array size The maximum is much larger than the array size Expansion scenarios are not considered
Data aggregation Data aggregation problems exist There is no data aggregation problem The instructions are data aggregation refer to the following section
Cache row utilization You can make full use of cached rows Cached rows cannot be utilized Using cached rows can improve read performance
Space occupied Array size only It takes up more space than the size of the array Regardless of the size of the element itself
High load factor performance Load factor=0.9 performs extremely well Load factor=0.75 requires expansion Load Factor: The ratio of array elements to the size of the array

Description:

  • Closed Addressing is large because it stores elements in extra space, such as linked lists. You could say upper limit
  • Data aggregation (described later) causes slow data reading
  • Open Addressing data is placed in arrays, and array structures can take full advantage of CPU power to be loaded into the same cache line to improve reading performance
  • China Closed admb has additional structural space such as linked lists, which leads to more memory usage.
  • Closed Addressing uses linked lists to resolve Hash conflicts, resulting in long lists under high load factors and poor query performance due to not being able to leverage cached rows. Open Addressing, on the other hand, keeps data in an array, makes use of cached rows, and a good Open Addressing implementation balances data aggregation so that it performs well at load Factor =0.9.

From the above comparison we can see that Open Addressing has many advantages. The implementation of Java HashMap is familiar with Closed Addressing and there are many articles on the web that explain Java HashMap. Therefore, this article does not cover Closed Addressing further. Next we will focus on the somewhat unfamiliar Open Addressing HashMap.

2. Detection techniques for Open Addressing

Because Open Addressing doesn’t have extra space to resolve hash conflicts, it needs to find an empty position in the array for this element when a hash conflict exists. This process of finding empty positions when a hash conflict is encountered is called probing. The technique used here is called addressing. At present, the commonly used addressing techniques include Linear Probing, Quadratic Probing and Double Hashing. Next, we’ll explain alignment in detail.

A, Linear Probing

Linear detection is simpler. When a hash conflict occurs while writing an element, we simply check to see if the next one at that location is available and insert the element if it is. Otherwise, continue looking for the next one at that location and keep looping until you find the right location or trigger the Rehash.

The formula for this detection technique to calculate the available position is as follows (I is the initial hash position) :

  • I     + 1
  • I     + 2
  • I     + 3

There is a problem with this approach to detection. It causes a large number of elements to cluster together, forming a continuous chain (contiguous on physical addresses, not a linked list). When the data we’re looking for is in the chain, we keep looking for it one by one. The longer the chain, the less efficient the query. This Clustering of data is called Clustering, or Primary Clustering.

Quadratic detection B. Quadratic detection

Quadratic detection is also simpler, in that instead of directly +1, you add quadratic every time you calculate the available location.

The formula for this detection technique to calculate the available position is as follows (I is the initial hash position) :

  • i  +   1 * 1
  • i    + 2 * 2
  • i  +   3 * 3

This approach, though not likely to lead to Primary Clustering, uses this sort of exponential jumping to find available locations. However, another type of chain can also be created, if the hash conflict is severe (a large number of elements hash to the same location), then these elements will also form a chain (physically discontinuous), which will still lead to slow queries during lookup. This kind of data link is also a kind of data aggregation. Such data Clustering is known as Secondary Clustering.

C, Double Hashing

The second Hash probe, as its name implies, computs the next available position using another Hash when a Hash conflict occurs.

The formula for calculating the available position for this detection technique is as follows (I is the initial hash position and j= another hash(key) value) :

  • I   + 1     x   j
  • I   +     2 x   j
  • I   +     3 x   j

With this detection technology, Primary and Secondary Clustering will not appear relatively. You can think about the specific reasons.

D. Linear Probing vs Quadratic Probing vs Double Hashing

Detection technology Conflicting slot calculation modes advantages disadvantages
Linear Probing – I     + 1
  • I     + 2
  • I   +   3 | good use cache line | existence primary clustering problem secondary | clustering problem

| Quadratic Probing | -i   +   1 * 1

  • i    + 2 * 2
  • i  + 3 * 3   | have clearance can avoid primary clustering | exist secondary clustering problem is unable to use the cache line |

| Double Hashing | j = hash2 (key) – I   + 1     x   j

  • I   +     2 x   j
  • I   +     3 x   j | to avoid the primary clustering problem to avoid the secondary clustering problem | can’t use the cache line performance is poor (more than a hash) |

3. Clustering

Element aggregation is when only the elements form a chain in an array, which can be physically continuous or discontinuous. This chaining can lead to performance degradation in numerical queries. As in Java HashMap, the query efficiency of a HashMap decreases as the list becomes longer.

A, Primary Clustering

That is, the aggregation of elements occurs in physical continuity, that is, the elements in the data are next to each other, with no spacing between them.

B, Secondary Clustering

Represents the elements on the array that form a physically discontinuous, but in the detection algorithm continuous chain according to the detection algorithm. That is, when the available vacancies are detected by the detection technology, it is found that the positions calculated for many times are occupied, which forms a physically discontinuous but logically continuous chain.

(3) Checking the addition and deletion of Open Addressing

The addition and deletion of Open Addressing is quite different from the Closed Addressing we are all familiar with. In the following part, we will explain the addition and removal of the Open Addressing Hash Map based on the Linear Probing.

1. Delete elements

There are three main scenarios for removing elements in Open Addressing, as shown below:

Scenario 1: Elements stand alone (Element 12 in figure)

Since the element exists independently in the array, we simply remove the interface from the array.

Scenario 2: Elements at the end of Clustering (element 9 in the figure)

This scenario is the same as scenario 1, we can simply delete it from the array.

Scenario 3: Elements in Clustering (Element 6 in figure)

This scenario is different from the previous one. If we just delete element 6 from the array. The problem is that the next query will not be able to find 7 and 8. Because 6 is deleted, the query is terminated when it reaches an empty position (see subsequent query logic).

There are two solutions to this problem:

Tombstones: Instead of removing a deleted element from the array, it is marked as unavailable and can still be queried back to elements 7 and 8. The tombstones that are thus marked for removal are called tombstones.

Backward shift deletion: After element 6 is deleted, all the following elements in the chain move forward one space. So elements 4, 5, 7 and 8 are right next to each other. It does not affect subsequent queries.

These two deletion methods have their own advantages and disadvantages.

Delete the way Execution speed Query Performance Impact The performance test
tombstones Direct marking, fast speed Deleting elements If the elements are not deleted, the link length is very long, which affects the query performance Robin Hood gravestone
backward shift deletion You have to move elements slowly Delete After an element is deleted, query performance is not affected Robin Hood backward Shift

PS: The tombstone method has low performance, and the backward Shift deletion method is directly used, which has high performance. The screenshot can be seen in the performance test links linking two different deleted elements of the Robin Hood HashMap.

2. Find elements

Finding an element is simpler, first locating it directly to the location of the array based on the hash, and then comparing the key equal. If the key does not match, use the corresponding detection technology to continue to detect the next position. If tombstones are found in the process of exploration, they will be skipped, i.e. looking for the next one. The loop will either find the corresponding element or end up in an empty position.

3. Insert elements

First hash to the position of the array, and if the position is empty or a tombstone, place the element there to end the insertion. If it is not empty or tombstone, it will continue to search for the next position through the corresponding detection technology, and so on until the empty position or tombstone position is found, then the element is inserted. Or rehash is triggered to rehash the search.

In the next installment we will look at specific implementation examples of Open Addressing HashMaps: ThreadLocal and Robin Hood HashMap.

Four, practice

If you have any questions or comments about this article, please add lifeofCoder.