There are a lot of things that I didn’t pay much attention to when I was studying HashMap. When I reviewed HashMap, I also found that there are a lot of questions that can be studied in detail, which will eventually return to mathematics. For example, why is the loading factor of HashMap 0.75?
This paper mainly introduces the following contents:
- Why does a HashMap need to load factors?
- What are the methods of conflict resolution?
- Why must the loading factor be 0.75? Instead of 0.8, 0.6?
Why does a HashMap need to load factors?
The bottom layer of a HashMap is a hash table, which is a structure type storing key-value pairs. It requires certain calculation to determine the storage position of data in the hash table:
A HashMap is a data structure that is slow to be inserted and fast to be queried.
But this data structure is prone to two problems: (1) if the space utilization is high, then after the hash algorithm to calculate the storage location, will find that many storage locations have data (hash conflict); ② If the array capacity is increased to avoid hash conflicts, the space utilization will be low.
The load factor represents how full the elements in the Hash table are.
Load factor = number of elements filled in/length of hash table
The larger the loading factor is, the more elements are filled and the higher the space utilization rate is, but the chance of conflict is larger.
The smaller the load factor, the fewer the filled elements, the less chance of collisions, but more space is wasted, and it also increases the number of expanded rehash operations.
The greater the chance of conflicts, the higher the cost of finding the data to be found. Therefore, a balance and compromise must be found between “opportunities for conflict” and “space utilization”.
Therefore, we can also know that factors affecting search efficiency mainly include the following:
- Does a hash function hash data evenly in a hash table?
- How to handle conflict?
- How to choose the load factor of the hash table?
This paper mainly introduces the latter two problems.
What are the methods of conflict resolution?
1. Open addressing method
H(key) is the hash function, m is the table length of the hash table, di is the incremental sequence, and I is the number of conflicts that have occurred. Among them, open addressing method can be divided into three types according to different step sizes:
Linear Probing: di = 1,2,3… ,m-1
To put it simply, the container starts with the current conflict position, and the step size is 1 until an empty position is found. If the position is not occupied by the end of the cycle, it indicates that the container is full. For example, it’s like going down the street to eat at meal times and going from house to house to see if there’s a table.
Quadratic Probing: DI = ±12, ±22, ±32… ±k2 (k≤m/2)
Compared to the linear probe, this is the equivalent of taking steps di = i2 and iterating until you find an empty position. In the example above, instead of going from door to door to see if there is a location, you now go to store I2 with your phone and ask if there is a location.
1.3 Pseudo-random detection method: DI = pseudo-random number sequence
This is just taking a random number as the step size. Again using the above example, this time is completely according to the mood to choose a shop to ask if there is a location.
But open addressing has these disadvantages:
- The hash table established by this method is not friendly to search when the data is easy to pile up together when there are many conflicts.
- When deleting a node, do not simply empty the space of the node, otherwise you will truncate the search path of the synonym node after it is filled into the hash table. Therefore, if you want to delete the node, you can only add the deletion mark on the node to be deleted, but can not really delete the node.
- If the hash table space is full, you need to create an overflow table to store the extra elements.
2. Rehash
The RHi() function is a hash function different from H() and is used to calculate the address of another hash function in case of an address conflict with a synonym until no conflict occurs. This method is not easy to generate clustering, but will increase the calculation time.
So the disadvantage of rehashing is that it increases the computation time.
3. Create a public overflow area
Suppose the range of hash function is [0, m-1], let vector HashTable[0,…,m-1] be the basic table, each component stores a record, and set vector OverTable[0,…,v] as the overflow table. The base table stores records of keywords, and in case of a conflict, whatever hash address their hash function gets, the overflow table is filled in.
The disadvantage of this approach is that when looking for conflicting data, you need to traverse the overflow table to get the data.
4. Chain address method (zipper method)
Constructs a linked list of elements with conflicting positions. When adding data, if the hash address conflicts with an element in the hash table, it is placed on the linked list at that location.
Advantages of the zipper method:
- The method of conflict handling is simple, and there is no clustering phenomenon. Non-synonyms will never conflict, so the average search length is short.
- Because the node space of each linked list in zipper method is applied dynamically, it is more suitable for the situation that the length of table cannot be determined before table construction.
- Deleting nodes is easy to implement by simply deleting the corresponding nodes on the linked list.
Downside: Extra storage.
From the underlying structure of HashMap, we can see that HashMap adopts the combination of array + linked list/red-black tree as the underlying structure, that is, the open address method + linked address method to implement HashMap.
Why must the HashMap load factor be 0.75? Instead of 0.8, 0.6?
As we know from the above, the underlying HashMap is actually a hash table (hash table), and the way to resolve the conflict is the chained address method. By default, the initial capacity of a HashMap is 16. To reduce the probability of collisions, when the HashMap array reaches a critical value, expansion is triggered, and all elements are rehash and then put into the expanded container, which is a time-consuming operation.
This threshold is determined by the loading factor and the current container size:
Critical value = DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR
That is, if the default value is 16×0.75=12, the expansion operation is triggered.
So why was 0.75 chosen as the load factor for the HashMap? This has to do with a very important principle in statistics, the Poisson distribution.
Poisson distribution is a common discrete probability distribution in statistics and probability. It is used to describe the probability distribution of the number of random events occurring per unit time. Interested readers can check out Wikipedia or this article by Ruan Yifeng: Poisson and Exponential Distributions: a 10 minute tutorial [1]
On the left-hand side, P is probability, N is some kind of functional relationship, T is time, N is quantity. On the right hand side of the equal sign, λ indicates the frequency of events.
The HashMap source code contains this comment:
Ideally, random Hash codes are used, and the frequency of node occurrences in the Hash bucket (table) follows a Poisson distribution with parameters averaging 0.5 at a capacity expansion threshold (load factor) of 0.75. Ignore the variance, i.e. X = λt, P(λt = k), where λt = 0.5, according to the formula:
The calculation results are shown in the above list. When the length of the linked list in a bin reaches 8 elements, the probability is 0.00000006, which is almost an impossible event.
Therefore, we can know that in fact, constant 0.5 is substituted into Poisson distribution as a parameter to calculate, while loading factor 0.75 is used as a condition to expand when HashMap length is length/size ≥ 0.75. Under this condition, the zipper length and probability after conflict are as follows:
So why not 0.8 or 0.6?
In addition to the hash algorithm, there are two parameters in a HashMap that affect performance: the initial capacity and the load factor. The initial capacity is the capacity of the hash table when it is created, and the load factor is a measure of how full the hash table can be before its capacity is automatically expanded.
To describe load factors in Wikipedia:
For open addressing, the loading factor is particularly important and should be strictly limited to below 0.7-0.8. Above 0.8, the CPU cache missing increases exponentially. Therefore, some hash libraries that use open addressing, such as the Java system library, limit the load factor to 0.75, beyond which the hash will be resized.
When setting the initial capacity, the number of entries required in the mapping and its loading factor should be taken into account to minimize the number of rehash operations for expansion. Therefore, it is generally recommended to set the initial capacity based on the estimated value when using HashMap to reduce expansion operations.
The choice of 0.75 as the default loading factor was a compromise sought in terms of time and space costs.
The following is a small series through some large factory friends to their internal Java interview questions, information is rare, but also nearly a year of real interview questions;
There are: Ant Financial, Pinduoduo, Ali Cloud, Baidu, Vipshop, Ctrip, Fengchao Technology, Letxin, Isoftstone, OPPO, Yinsheng Payment, Ping an of China and other primary, intermediate, advanced Java interview questions set, with super detailed answers, I hope to help you.
In recent years, I have specially organized interview questions related to JVM, SPringBoot, SpringCloud, database, Linux, cache, message-oriented middleware, source code, etc., covering a wide range of topics, as well as super detailed answers.
Share the Java ebooks with everyone, about 10 GIGABytes of resources