IT brother

Elder brother is through self-study into Dachang as a senior JAVA engineer, every day to share technology dry goods, help you into Dachang

There are a lot of things that I didn’t pay much attention to when I was learning. When I reviewed HashMap, I also found that there are many problems that can be studied in detail, which will eventually return to mathematics. For example, why is the loading factor of HashMap 0.75?

This paper mainly introduces the following contents:

  • Why does a HashMap need load factors?

  • What are the ways to resolve conflicts?

  • Why must the loading factor be 0.75? Instead of 0.8, 0.6?

Why does a HashMap need load factors?

The bottom layer of HashMap is the hash table, which is the type of structure that stores key-value pairs. It requires some calculation to determine the location of data stored in the hash table:

`static final int hash(Object key) {` `int h; ` `return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); ` `}` `// AbstractMap` `public int hashCode() {` `int h = 0; ` `Iterator<Entry<K,V>> i = entrySet().iterator(); ` `while (i.hasNext())` `h += i.next().hashCode(); ` `return h; ` ` `}

A typical data structure is one that is either fast to query or fast to insert, and a HashMap is one that is slow to insert and fast to query.

However, this kind of data structure is prone to two kinds of problems: (1) if the space utilization is high, then when the hash algorithm is used to calculate the storage location, it will find that many storage locations already have data (hash conflict); ② If the size of the array is increased in order to avoid the occurrence of hash conflicts, the space utilization will be low.

Load factors represent the degree to which elements in the Hash table are filled.

Load factor = number of elements to fill the table/length of the hash table

The larger the loading factor, the more filled elements, the higher the space utilization, but the greater the chance of conflict;

The smaller the loading factor, the fewer elements to fill, the less chance of conflict, but the more wasted space, and also increase the number of capacity expansion rehash operations.

The greater the chance of conflict, it means that the data to be searched need to be searched through another way, so the higher the cost of search. Therefore, a balance and compromise must be found between “the opportunity of conflict” and “the utilization of space”.

Therefore, we can also know that the main factors affecting the search efficiency are the following:

  • Can a hash function hash the data evenly in a hash table?

  • How to handle conflict?

  • How to select the load factor of the hash table?

This paper mainly introduces the latter two issues.

What are the ways to resolve conflicts?

1. Open addressing

'Hi = (H(key) + di) MOD m, where I =1,2... ,k(k<=m-1)`

H(key) is the hash function, M is the length of the hash table, DI is the incremental sequence, and I is the number of collisions that have occurred. Among them, the open addressing method can be divided into three types according to the different step sizes:

1.1 Linear Probing: DI = 1,2,3… ,m-1

Simply put, the current conflict position is taken as the starting point, and the step size is 1. Loop until an empty position is found. If no position can be occupied at the end of the loop, the container is full. For example, it’s like going down the street to eat at a restaurant and going from house to house to see if there’s room.

QUADRATIC PROBING: DI = ±12, ±22, ±32… , ± K2 (K ≤m/2)

As opposed to the linear probe method, this is equivalent to taking a step of di = i2 and repeating until you find something empty. In the example above, instead of going door to door to see if there is a room, you can count the I2 store with your phone and ask if it has a room.

1.3 Pseudo random detection method: DI = pseudo random number sequence

This is just taking a random number as the step size. Using the example above, this time it was simply a matter of mood to choose a store and ask if there was a location.

But open addressing has these disadvantages:

  • This method establishes a hash table, when there are many conflicts when the data is easy to stack together, this time is not friendly to search;

  • Deleting a node cannot simply empty the node’s space, otherwise it will truncate the synonym node’s lookup path after it is filled in the hash table. Therefore, if you want to delete a node, you can only add a delete mark on the deleted node, but you can’t really delete the node.

  • If the space in the hash table is full, you need to create an overflow table to store the extra elements.

2. Re-hashing

'Hi = RHi(key), where I =1,2... ,k`

The RHi() function is a hash function different from H(), and is used to calculate the address of another hash function when a synonym has an address conflict until there is no conflict. This method does not produce a heap easily, but increases the computation time.

So the disadvantage of the rehashing method is that it increases the calculation time.

3. Create a public overflow area

Assume that the range of the hash function is [0, m-1]. Let the vector Hashtable [0,…,m-1] be the basic table and each component store a record. In addition, set the vector Overtable [0,…, V] as the overflow table. The base table stores records of keywords, and in the event of a conflict, the overflow table is filled regardless of the hash address of their hash function.

The drawback to this approach is that when looking for conflicting data, you need to traverse the overflow table to get the data.

4. Chain address method (zipper method)

Construct the elements at conflicting positions into a linked list. When adding data, if the hash address conflicts with an element in the hash table, it is placed on the linked list at that location.

Advantages of the zipper method:

  • The way to deal with conflicts is simple, and there is no stacking phenomenon. Non-synonyms will never conflict, so the average search length is short.

  • Because the node space of each linked list in the zipper method is dynamically applied, it is more suitable for the situation that the length of the table cannot be determined before making the table.

  • Delete node operation is easy to implement, as long as the corresponding node on the linked list is deleted.

Disadvantages of the zipper method: additional storage space is required.

As can be seen from the underlying structure of HashMap, the combination of array + linked list/red-black tree is adopted as the underlying structure, that is, the open address method + chain address method is adopted to implement HashMap.

Why must the HashMap load factor be 0.75? Instead of 0.8, 0.6?

From the above, we know that the underlying HashMap is also a hash table (hash table), and the way to resolve conflicts is chain address method. The initial size of the HashMap is 16 by default. In order to reduce the probability of conflicts, when the array size of the HashMap reaches a critical value, it is triggered to expand, and all elements are rehash and then put into the expanded container, which is a very time-consuming operation.

The critical value is determined by the load factor and the size of the current container:

Critical value = DEFAULT\_INITIAL\_CAPACITY * DEFAULT\_LOAD\_FACTOR

That is, the default is 16×0.75=12, will trigger the expansion operation.

So why chose 0.75 as the load factor for the HashMap? This has to do with a very important principle in statistics called the Poisson distribution.

Poisson distribution is a common discrete probability distribution in statistics and probability, which is applicable to describe the probability distribution of the number of random events occurring in unit time. Interested readers can check out Wikipedia or this article by Professor Ruan Yifeng: Poisson Distribution and Exponential Distribution: A 10-minute tutorial

On the left-hand side, P represents probability, N represents some kind of function, t represents time, and N represents quantity. On the right hand side of the equal sign, lambda represents the frequency of the event.

The HashMap source code contains the following comment:

`* Ideally, under random hashCodes, the frequency of` `* nodes in bins follows a Poisson distribution` `* (http://en.wikipedia.org/wiki/Poisson_distribution) with a ` ` * parameter of about 0.5 on average for the default Resizing ' '* threshold of 0.75, although with a large variance because of' '* resizing granularity. Ignoring variance, The expected ' '* orces of list size k are (exp(-0.5) * pow(-0.5), K)/' '* factorial(k). The first values are:' '* 0: 0.60653066' '* 1: 0.30326533' '* 2: 0.07581633' * 3: 0.01263606 ' '* 4: 0.00157952' '* 5: 0.00015795' '* 6: 0.00001316' '* 7: 0.00000094' '* 8: 0.00000006' '* more: less than 1 in ten million`

In the ideal case, using random Hash codes, the node appears in the frequency of the Hash bucket (table) following a Poisson distribution with an average parameter of 0.5 when the scaling threshold (load factor) is 0.75. Ignoring the variance, that is, X = λt, P(λt = k), where λt = 0.5, according to the formula:

The calculation result is shown in the above list. When the length of the linked list in a bin reaches 8 elements, the probability is 0.00000006, which is almost an impossible event.

Therefore, we can know that in fact, the constant 0.5 is substituted into the Poisson distribution as a parameter to calculate, while the loading factor 0.75 is regarded as a condition. When the length of HashMap is Length/Size ≥ 0.75, the capacity will expand. Under this condition, the results of the length and probability of the zipper after the conflict are:

'0: 0.60653066' 1: 0.30326533 '2: 0.07581633' 3: 0.01263606 '4: 0.00157952' 5: 0.00015795 '6: 0.00001316' 7: 0.00000094 ` ` 8:0.00000006 `

So why not 0.8 or 0.6?

In addition to the hash algorithm, there are two parameters in HashMap that affect performance: initial capacity and load factor. The initial capacity is the capacity of the hash table when it is created, and the load factor is a measure of how full the hash table can be before its capacity is automatically expanded.

To describe the loading factor in Wikipedia:

For open addressing methods, the loading factor is particularly important and should be strictly limited to 0.7-0.8. Above 0.8, CPU cache miss increases along an exponential curve. Therefore, some open addressing hash libraries, such as Java’s system library, limit the load factor to 0.75, above which the hash table will be resized.

The number of items required in the map and their load factors should be taken into account when setting the initial capacity to minimize the number of rehash operations, so it is generally recommended when using a HashMap to set the initial capacity based on the pre-estimate to minimize the number of rehash operations.

The choice of 0.75 as the default load factor is entirely a trade-off between time and space costs.