The other day I saw someone in a group discussing why the load factor in hashMap is 0.75 by default.
Load factor in HashMap source code
Static final float DEFAULT_LOAD_FACTOR = 0.75f;Copy the code
The idea was that it should be a compromise between hash conflict and space utilization. Just as a data structure is either query fast or insert fast, a HashMap is a data structure that is slow to insert and fast to query.
The load factor is the degree to which the elements in the Hsah table are filled. The larger the load factor, the more elements are filled, the higher the space utilization, but the greater the chance of conflict. On the contrary, the smaller the loading factor is, the fewer elements will be filled, and the chance of conflict will be reduced, but the space will be wasted.
The greater the chance of conflict, the higher the cost of the search. On the other hand, the search cost is smaller.
Therefore, we must find a balance and compromise between “conflict opportunity” and “space utilization”.
But why does it have to be 0.75? Not 0.8, 0.6###
In the spirit of not overdoing it, let’s dig deeper, but before we do, let’s briefly cover the basics of this article:
1. Definition of conflict: Assuming that the address set of the hash table is [0, n), the conflict refers to the position where the hash address obtained by the keyword is j (0<=j<=n-1). If the hash address obtained by the keyword has been recorded, it is called the conflict
2. Handle conflicts: insert the keyword record into another “empty” hash address. That is, when dealing with the hash address conflict, if the obtained hash address H1 still conflicts, then find the next address H2, if H2 still conflicts, then find the next address H3, until Hk does not conflict, then Hk is the address recorded in the table.
There are several ways to handle conflicts: #
One, open addressing method
Hi=(H(key) + di) MOD m I =1,2… K (k<=m-1) where H(key) is the hash function; M is the length of the hash table; Di is an incremental sequence.
According to different step sizes, open addressing methods can be divided into three types:
| | | | | | | | | | | | | | | | | | | | M -1 simply means that the container is full, starting from the current conflict position, and the step is 1. The container is searched until an empty position is found, and then the element is inserted. It’s like when you go to a restaurant on a street, ask the first one and they tell you it’s full, and then go next to each one and ask if there’s room.
2) linear compensation detection method: di=Q next position Hi=(H(key) + Q) mod m I =1,2… K (k<=m-1) requires Q and m to be mutually prime in order to detect all the cells in the hash table. Continue with the example above, now that you’re not going from house to house, get out your calculator, do the math, and then ask each other if there’s room.
3) Pseudo-random probe rehash: di= sequence of pseudo-random numbers. Or that example, this is completely according to the mood to choose a store to ask
Disadvantages:
- The hash table built by this method is easy to gather data when there are many conflicts, which is not friendly to search.
- Deleting a node cannot simply leave the space of the deleted point empty, otherwise it will truncate the search path for the synonym node that fills the hash table after it. Therefore, the deletion operation on the hash table that uses the open address method to deal with conflicts can only make the deletion mark on the deleted node, but cannot actually delete the node
- When the space is full, an overflow table is created to store the extra elements.
2. Re-hashing
Hi = RHi (key), I =1,2… K RHi is a different hash function, that is, when a synonym generates an address conflict, another hash function address is calculated until no conflict occurs. This method is less likely to produce aggregation, but increases computation time.
Disadvantages: Increased computation time.
Establish a public overflow area
Assuming the range of hash function is [0,m-1], then the vector HashTable[0…m-1] is set as the basic table, each component stores a record, and the vector OverTable[0….v] is set as the overflow table. All records whose keys are synonyms with those in the base table, regardless of their hash address from the hash function, are filled in the overflow table in case of a conflict.
Simply put, create a new table that contains conflicting elements.
Four, chain address method (zipper method)
All records with synonyms for keywords are stored in the same linear linked list, that is, elements in conflicting positions are constructed into a linked list.
Advantages of the zipper method:
- The zipper method is simple to deal with the conflict, and there is no accumulation phenomenon, that is, non-synonyms will never conflict, so the average search length is shorter;
- Because the node space of each linked list in the zipper method is applied dynamically, it is more suitable for the situation where the length of the list cannot be determined before the table is made.
- In the hash table constructed by the zipper method, the operation of deleting nodes is easy to realize. Simply delete the corresponding node on the linked list.
Disadvantages of the zipper method:
-
Pointer needs extra space, so when the node size is small, open addressing method saves more space, and if the saved pointer space is used to expand the size of the hash table, the loading factor can be reduced, which reduces the conflict in open addressing method, so as to improve the average search speed
Java HashMap data structure #
A HashMap is actually a “list hash” data structure, which is a combination of an array and a list.
As you can see from the diagram, hashMap in Java uses the zipper method to handle conflicts. HashMap has an initial size, which defaults to 16
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
Copy the code
To reduce the probability of collisions, when the length of the hashMap array reaches a critical value, it triggers expansion, and rehash all the elements back into the expanded container, which is a time-consuming operation.
The critical value is determined by the loading factor and the capacity of the current container: DEFAULT_INITIAL_CAPACITY*DEFAULT_LOAD_FACTOR, that is, 16×0.75=12 by default, the capacity expansion is triggered.
So when using the hash container, try to estimate the amount of data you have to set the initial value. Concrete code implementation to study the source code of HashMap.
Why should the default load factor be 0.75? Found this paragraph in the hashMap source code comment
Ideally, under random hashCodes, the frequency of
-
nodes in bins follows a Poisson distribution
-
(en.wikipedia.org/wiki/Poisso…). with a
-
Parameter of about 0.5 on average for the default resizing
-
Threshold of 0.75, although with a large variance because of
-
resizing granularity. Ignoring variance, the expected
-
Occurrences of list size k are (exp(-0.5) * POw (0.5, k) /
-
factorial(k)). The first values are:
-
0:0.60653066
-
1:0.30326533
-
2:0.07581633
-
3:0.01263606
-
4:0.00157952
-
5:0.00015795
-
6:0.00001316
-
7:0.00000094
-
8:0.00000006
-
more: less than 1 in ten million
Note the keyword Poisson_distribution in the wiki link
In the ideal case, using a random hash code, the frequency of the nodes in the hash bucket follows the Poisson distribution, and the number of elements in the bucket is compared with the probability.
From the above table, we can see that the probability of reaching 8 elements in the bucket has become very small, which means that it is almost impossible for the list length of each collision location to exceed 8 with the loading factor of 0.75.
All right, if you dig any further you’re going to get into statistics, so I’ll stop here, but just to reiterate that if you’re using a hash container you should specify the initial capacity as much as possible, and it should be a power of two.
For information on the songme distribution, see
www.ruanyifeng.com/blog/2015/0…
By Eric Newhelp
Link: www.jianshu.com/p/dff8f4641… Source: Jane Book
supplement
Why is it 1 or 0.5
First of all, write the hash data structure. Before JDk1.8, it is array + linked list. After JDk1.8, it is array + linked list + red-black.
When the load factor is 1.0, it means that the expansion will occur only when all eight values of the array (which is represented in this graph) are filled in. This is a big problem, because Hash collisions are inevitable. When the load factor is 1.0, that means a lot of Hash collisions and the underlying red-black tree becomes very complex. Extremely bad for query efficiency. In this case, you’re sacrificing time to keep space efficient.
With a load factor of 0.5, this means that when the array reaches half of the elements, it expands, and since there are fewer elements to fill, there will be fewer Hash collisions, and the underlying list length or the height of the red-black tree will be reduced. Query efficiency will increase.
When the load factor is 0.75, the space utilization is high, and a large number of Hash conflicts are avoided. As a result, the height of the underlying linked list or red-black tree is relatively low, and the space efficiency is improved.
supplement
1. Book Project:
codeGoogler/ProgramBooks
2: Video Tutorial:
Must read Java books SpringBoot, Spring, Mybatis, Redis, RabbitMQ, SpringCloud, High concurrency
Feel good friends remember to like and search oh, the follow-up will continue to update the selection of technical articles!