This article simultaneously published in my personal blog, reproduced please indicate the source.

HashMap is one of the most important questions in Java interview. There are numerous articles on how HashMap is implemented. However, after looking through most of the HashMap related articles, I found that most of the articles were analysis of the HashMap source code without any mention of the concept of hash tables. This leads to the strange phenomenon that many people only remember how a HashMap works, but have no idea what a hash table is. In many cases, the interviewer may not directly ask how HashMap is implemented, but instead throw out a triad of questions:

Make a mistake, start again! What is a hash table? What is the Hashish conflict? How do I handle hash conflicts?

For those of you who are not familiar with hash tables, these three questions really hit the intellectual blind spot. To avoid this embarrassment, let’s learn about hash tables.

What is a hash table?

Before we answer this question, let’s consider a question: how do I find a data element in an unordered linear table?

Notice that this is an unordered linear table, which means that the position of the element you’re looking for in the linear table is random. In this case, to find the element, you have to traverse the linear table and compare it to the element you’re looking for. This means that the time to find this element is order n. For o(n) time complexity, finding large amounts of data is also a very performance intensive operation. Is there a data structure in which there is a corresponding relationship between the element and its location? In this way, we can directly find its location through this element, and the time complexity of finding this element becomes O (1), which can greatly save the search efficiency of the program. Of course, this data structure exists, and it’s what we’re going to talk about today.

Let’s first look at the definition of a hash table:

Hash table, also called hash table, is a data structure that maps a group of keywords to a limited and continuous address range according to the set mapping function F (Key), and takes the “image” of keywords in the address range as the storage location of elements in the table. This mapping process is called hash table or hash, the mapping function F (key) is hash function is also called hash function, through the hash function to get the storage location is called hash address or hash address

Definitions are always tricky and hard to understand. Simply put, a hash table is a data structure that hashes a set of data in an array through a mapping function f(key). In this hash table, there is a mapping relationship between the key of each element and its storage position. We can quickly find the position of this element in the table through f(key).

For example, we have a set of data: [19,24,6,33,51,15], which we hashed in an array of length 11. The remainder of the data is used as the storage position of the element in the array, and the length of the array is divided into f(key)=key % 11. Then you get a hash table like the following:

At this point, if we want to find 15 in this table, all we have to do is modulo 15 with 11 to get where 15 is stored in the array. Visible hash tables are very efficient at finding elements.

What is hash conflict

In the last section, we used a very simple example to explain what a hash table is. The set of data in this example has only six elements. If we insert some more elements into this set of data, the data after insertion is: [19,24,6,33,51,15,25,72], the new element 25 module 11 will get 3, and there is no problem storing it in the location of 3. And then we do 72 modulo 11 and we get 6, which is already occupied by something else in the array. “72” just means where do I put it?We call this hash conflict. The official definition of hash conflict is:

For different keywords, the same hash address may be obtained, that is, key1≠ KEY2, and f(key1)=f(key2). For this phenomenon, we call it hash collision, also called hash collision

In general, hash conflicts can only be reduced as much as possible, but cannot be completely avoided. Because a hash function is a mapping from a keyword set to an address set, generally the keyword set is relatively large, and its elements theoretically include all possible keywords, while the elements of the address set are only the address values in the hash table. This leads to the inevitability of hashing.

1. How to reduce hash conflicts?

Although hash conflicts are inevitable, we should minimize them as much as possible. A good hash function can effectively reduce the occurrence of hash conflicts. So what’s a good hash function? In general, a good hash function is equally likely to map any keyword in the keyword set to any address set.

The commonly used methods of constructing hash function are as follows: (1) the method of dividing the residue

We’ve already touched on this method up here. The remainder of a key divided by a number p not larger than m is the hash address. That is, f(key)=key % p, p≤m;

(2) Direct addressing method

Direct addressing means taking the value of a keyword or a linear function of the keyword as a hash address. F (key)=key or F (key)=a*key+b,

(3) Numerical analysis

If a keyword is a base number (for example, a decimal number based on 10), and all possible keywords in the hash table are known in advance, several digits of the keyword can be selected to form a hash table.

Of course, there are more ways to select a hash function than the ones listed above. We just need to know that choosing the right hash function can effectively reduce hash collisions.

2. How to handle hash conflicts?

Although we can reduce hash conflicts by choosing good hash functions, hash conflicts are inevitable. So, what should I do if I run into a hash conflict? Let’s look at several ways to handle hash collisions.

(1) Open addressing method

Open addressing means that when an address conflict occurs, the other storage units in the hash table continue to be explored in a certain way until an empty location is found.

We use the example at the beginning of this section to show how open addressing handles conflicts. Mode 6 after 11, 72 and six position has been occupied by other elements, then will be 7, 6 + 1 found the location of the 7 at this time were occupied, then add 1 to get the next address of 8, and 8 still occupied, then add 1 9, 9 place is empty, at this time will be 72 in among them, namely get hash table as follows:

Detection methods like the one above are calledLinear detection and hashing. Of course, in addition to linear hashing, there is also quadratic hashing, which is the original hash address plus d (d= d)
Plus or minus 1 2 Plus or minus 1 ^ 2
,
Plus or minus 2 2 Plus or minus 2 ^ 2
,
Plus or minus 3 2 Plus or minus 3 ^ 2
.
Plus or minus m 2 Plus or minus m ^ 2
), after secondary detection and hash, the hash address of 72 will be obtained as 5, as shown in the following figure:

(2) Rehash

The rehash method is to select several different hash functions and calculate another hash function when the hash conflict occurs until the conflict no longer occurs.

(3) Establish a public overflow area

A dedicated overflow table is maintained to fill in values when a hash conflict occurs.

(4) Chain address method

The chained address method is to store the conflicting elements in the form of a linked list when encountering hash conflicts. That is, all the elements whose hash address is I are inserted into the same list. The insert position of the element can be the head of the table (head interpolation) or the end of the table (tail interpolation). Taking the data set of [19,24,6,33,51,15,25,72] as an example, we use the chain address method to deal with hash conflicts and get the hash table as shown in the figure below:

We can to add some elements in this set of data, and get a new set of data,24,6,33,51,15,25,72,37,17,4,55,83 [19]. Using the chain address method, the following hash table is obtained:

3. Disadvantages and optimization of chain address method

In the last section we looked at several common ways to handle hash collisions. Among them, chain address method is commonly used. For example, HashMap is a hash table structure based on chain address method. Although chained address method is a good way to deal with hash conflicts, it can cause problems in some extreme cases. For example, we now have this set of data: [48,15,26,4,70,82,59]. We hash the data into an array of length 11 and get the following result:

It can be found that the hash table has degenerated into a linked list. When we look for an element in such a data structure, the time complexity changes back to O (n). This is clearly not what we expected. Therefore, when the linked list in the hash table is too long, we need to optimize it. As we know, the query efficiency of binary search tree is much higher than that of linked list. So, when a hash list gets too long we can turn that list into a red-black tree. After optimization of the above set of data, the following results can be obtained:

Red-black tree is a self-balancing binary search tree. Its query has a time complexity of O (LGN). Through this optimization, the query efficiency of hash table can be improved.

Expansion and Rehash of hash tables

Under the condition that the length of the hash table remains unchanged, as more and more elements are inserted into the hash table, the probability of hash conflicts will increase, and the corresponding search efficiency will decrease. This means that the factors affecting the performance of the hash table are not only the hash function and the method of handling conflicts, but also the size of the loading factor of the hash table.

We call the ratio of the number of elements in the hash table to the length of the hash table the loading factor. Loading factor α= number of elements in hash table length of hash table \frac{number of elements in hash table}{length of hash table} number of elements in hash table length of hash table

Obviously, the lower the value of alpha the lower the probability of hash collisions, the more efficient the search. Decreasing the value of alpha means decreasing the hash table usage. Obviously this is a paradoxical relationship, and there can be no perfect solution. In order to give consideration to each other, the maximum loading factor is generally selected between 0.65 and 0.9. In HashMap, for example, the loading factor is 0.75. Once the HashMap’s loading factor is greater than 0.75, it is necessary to expand the hash table to reduce hash conflicts. For example, we could double the length of our hash table.

We should know that expansion does not increase the size of the original array, but requires a new array twice the size of the original array. Therefore, after the expansion, the original data needs to be hashed from the old array to the new array. This process is called Rehash.

Next we still,24,6,33,51,15,25,72,37,17,4,55,83 [19] this set of data, for example to demonstrate the process of expansion and hash table to Rehash. Assume that the initial length of the hash table is 11 and the maximum value of the load factor is 0.75. The data before capacity expansion is inserted as follows:

When we insert the ninth element, we find that the loading factor is already greater than 0.75, thus triggering the expansion operation. For the sake of drawing, I’ve expanded the array to 18. Expansion after the,24,6,33,51,15,25,72,37,17,4,55,83 [19] hash this set of data, the results will be shown in the figure below:

You can see that elements are stored in different locations before and after the expansion. The Rehash operation will Rehash the data stored before expansion, which involves a lot of element movement and is a very performance intensive operation. Therefore, we should try to avoid rehashes in development. For example, you can estimate the number of elements and specify the length of the hash table beforehand, which can effectively reduce Rehash.

Five, the summary

Hash table is a very important knowledge point in data structure. This article explains the related concepts of hash table in detail, so that we have a clear understanding of hash table. Hash table makes up for the problem of low search efficiency of linear table or tree. Under ideal circumstances, the time complexity of searching an element can be reduced to O (1) through hash table. However, due to the existence of hash conflict, the search efficiency of hash table is difficult to achieve the ideal effect. In addition, the expansion of hash tables and Rehash operations have a great impact on the performance of hash table storage. This shows that using hash tables to store data is not a perfect solution. However, the data structure of the hash table is our best choice for high performance lookup cases.

Finally, knowing hash tables is a great help in understanding hashMaps. After all, HashMap itself is based on hash tables.