In software development, a hash table is equivalent to randomly placing N keys into B buckets to store N data in B units of space.
Finally, we can see some interesting things about the hash table:
Distribution rules of keys in the hash table
When the number of keys and buckets in the hash table is the same (n/b=1
) :
- Thirty-seven percent of the barrels were empty.
- 37% of buckets have only one key in them.
- 26% of buckets have more than one key(hash conflict).
If n=b=20, the number of keys in each bucket can be sorted by the number of keys in the bucket.
Counterintuitively, our first impression of a hash table is that if keys are thrown randomly into all buckets, the number of keys in the bucket should be fairly uniform, with the expected number of keys in each bucket being 1.
In fact, the distribution of keys in the bucket is very uneven when n is small, even though the average is 1! As n increases, the unevenness tends to average out.
The number of keys affects the number of bucket types
The following table shows how n/b values affect the proportion of the three types of buckets (the conflicting proportion is the bucket that contains more than one key) when B is constant and n increases:
N /b: (Average number of keys per bucket) | Empty bucket accounted for | Bucket ratio of one key | The conflict of |
---|---|---|---|
N/b = 0.5 | 61% | 30% | 9% |
N/b = 0.75 | 47% | 35% | 17% |
n/b=1.0 | 37% | 37% | 26% |
N/b = 2.0 | 14% | 27% | 59% |
N/b = 5.0 | 01% | 03% | 96% |
N/b = 10.0 | 00% | 00% | 100% |
To be more intuitive, we use a graph to show the change trend of the empty bucket rate and the conflict rate with n/ B:
The effect of the number of keys on the uniformity of buckets
The numbers above are useful references when n/ B of the hash table is small, but as n/ B increases, empty buckets are almost certainly zero, as are buckets with one key, and most buckets contain multiple keys.
When n/ B exceeds 1 (a bucket allows multiple keys to be stored), our primary observation becomes the distribution of the number of keys in the bucket.
The table below shows how uneven the number of keys per bucket tends to be when n/ B is large.
To describe this unevenness, we use the ratio between the maximum and minimum number of keys in the bucket ((most-fewest)/most).
The table below shows the tendency for the distribution of keys to become more and more average as n increases when b=100.
N /b: (Average number of keys in a bucket) | The number of keys in the least key bucket | Most difference (most fewest)/most |
---|---|---|
1 | 0 | 100.0% |
10 | 2 | 88.0% |
100 | 74 | 41.2% |
1000 | 916 | 15.5% |
10000 | 9735 | 5.1% |
100000 | 99161 | 1.6% |
1000000 | 997996 | 0.4% |
It can be seen that as the average number of keys in each bucket of N/B increases, the unevenness of bucket gradually decreases.
Unlike the empty bucket ratio or the bucket ratio with one key (which depends only on n/b), the uniformity depends not only on the value of n/b, but also on the value of B itself, as discussed later.
Here we do not use the mean square deviation commonly used in statistics to describe the uneven degree of key distribution, because in the software development process, most of the time to consider the worst case, to prepare the required memory and other resources.
Load Factor: N/b < 0.75
The concept load factor. Is commonly used to describe hash table features.
In general, based on the memory stored hash table, its n/b <= 0.75. In this way, space is not wasted and key conflicts in the hash table are kept relatively low. A low conflict rate means less frequent hash relocations and faster hash table inserts.
Linear detection is a commonly used algorithm for resolving insert-time hash collisions. When a key is placed in a bucket, the algorithm looks back at the buckets behind the bucket in order of increasing step size until an empty bucket is found. It is therefore very sensitive to hash conflicts.
In the n/b=0.75 scenario, approximately 47% of the buckets would be empty if linear probes were not used (such as using linked lists within buckets to hold multiple keys). If linear probes are used, about half of these 47% buckets will be filled with linear probes.
In many implementations of in-memory hash tables, n/b<=0.75 is chosen as the upper limit of hash table capacity. Not only does the collision rate increase with the increase of N /b, but more importantly, the efficiency of linear detection increases rapidly with the increase of N /b. For detailed analysis, you can refer to the implementation and analysis of linear detection.
Tips for hash table features:
-
The hash table itself is an algorithm that uses a certain amount of wasted space in exchange for efficiency. Not all three at the same time: low time overhead (O(1)), low space waste, low conflict rate.
-
Hash tables are only suitable for the storage of pure memory data structures:
-
It has to be fast, because hash tables are actually a waste of space in exchange for access speed. There are times when the waste of disk space is intolerable, and a small waste of memory is acceptable.
-
The hash table is only suitable for storage media with fast random access. Data storage on hard disks uses Btrees or other ordered data structures.
-
-
Most advanced languages (such as built-in hash tables and hash sets) maintain n/ B <=0.75.
-
The hash table does not evenly distribute keys when n/ B is small.
Load Factor: n/b>1
Another implementation of the hash table, designed to store a large number of keys, stops linear probing when n/b is greater than 1.0 (there are not enough buckets to store each key). In this case, instead of storing one key in a bucket, it is usually used to chaining all the keys in the bucket to solve the storage of multiple keys when conflicts occur.
Linked lists only work if n over B is not very large. Because list lookups require O(n) time overhead, for very large N/BS, trees are sometimes used instead of lists to manage keys within buckets.
One of the scenarios where large n/ B is used is when users of a web site are randomly assigned to several different Web-servers, where each Web-server can serve many users. In most cases, we want this kind of user allocation to the Web-server to be as even as possible to make efficient use of each Web-server’s resources.
In this case, what we need to focus on is how uniform the hash is, and what we’re going to talk about here, assuming that the hash function is completely random, is how uniform it is depending on n and b.
n/b
The larger the key, the more evenly distributed it is.
When n/ B is very large, the probability that a bucket is empty tends to zero, and the number of keys in each bucket tends to average out.
Statistically, the expected number of keys in each bucket is
We define the average number of keys in a bucket to be 100%: the number of keys in a bucket is exactly N over B,
The following three figures simulate the distribution of the number of keys in a bucket when b=20 and n/b is 10, 100, and 1000 respectively.
We see that as n/ B increases, the difference between the buckets with the most keys and the buckets with the least keys gradually Narrows. The key distribution ((most fewset)/most) changes as b and n/b increase:
b \ n | 102 | 103 | 104 | 105 | 106 |
---|---|---|---|---|---|
100 | 37.4% | 13.6% | 4.5% | 1.4% | 0.5% |
1000 | 47.3% | 17.7% | 6.0% | 1.9% | 0.6% |
10000 | 54.0% | 20.9% | 7.1% | 2.3% | 0.7% |
Conclusion:
scenario | trend |
---|---|
The number of keys (n) is determined | The more buckets, the less uniform. |
When the number of buckets (b) is determined | The more keys, the more even. |
When the proportion of buckets and keys is the same (N/B), the number of buckets and keys is the same | The larger n and b are, the more uniform they are. |
To calculate
Most of the above structures are derived from procedural simulations, so let’s look at how to mathematically calculate these values.
The number of buckets of each type
The type of the bucket | The bucket number |
---|---|
contains0 The ratio of buckets to keys |
|
contains1 The ratio of buckets to keys |
|
contains> 1 The ratio of buckets to keys |
Empty the bucket number
For a key, the probability that it’s not in a particular bucket is.
The probability that none of the keys are in a particular bucket is
We know that
.
The probability that a bucket is empty is:
The total number of empty buckets is:
The number of buckets with one key
For a particular bucket, the probability that there is exactly one key is as follows: 1 out of n keys has a probability of 1/b falling into this bucket, and other keys do not fall into this bucket with a probability of 1-1/b:
The number of buckets with exactly one key is:
Buckets with multiple keys
Here’s the rest:
Evenly distributed keys in buckets
Similarly, the probability that there are exactly I keys in a bucket is that any of the n keys will fall into this bucket with a probability of 1/ B, and the other N-i keys will not fall into this bucket with a probability of 1-1/b:
This is a famous binomial distribution.
We can use a binomial distribution to estimate the number of keys for the largest bucket, and the number of keys for the smallest bucket.
througheyesIt’s a normal distribution
When n and b are both large, the binomial distribution can be approximated by the normal distribution to estimate the uniformity of the key distribution:
The probability that a bucket contains exactly I keys is:
The probability that a bucket contains no more than x keys is:
So, the number of buckets with fewer than X keys is:
The number of keys containing the smallest bucket can be estimated using this method: if the number of buckets with fewer than X keys is 1, then the only bucket is the least key bucket. So we just need to find the smallest X, and let the total number of buckets containing fewer than x keys be 1. This x is the number of keys for the smallest bucket
Calculate the minimum number of keysx
The probability that a bucket contains no more than X keys is:
Is the cumulative distribution function of normal distribution. When x-u approaches 0, it can be approximated in the following ways:
This function is still not very easy to calculate, but if we just find x, we can reverse traverse x in the range [0 to u] to find an X where the expected number of buckets containing no more than X keys is 1.
This x can be roughly thought of as the number of keys in the bucket with the fewest keys, and the degree of unevenness in the hash can be described by the difference between the number of keys with the most and the number of keys with the least: since the normal distribution is symmetric, the maximum number of keys can be expressed as u + (u-x). Finally, the ratio of the least uniform maximum bucket to minimum bucket is:
U is the mean n over b.
Process simulation
The following Python script simulates the distribution of keys in buckets and compares the results to verify our results above.
import sys import math import time import hashlib def normal_pdf(x, mu, sigma): X = float(x) mu = float(mu) m = 1.0 / math.sqrt(2* math.pi)/sigma n = math.exp(-(x-mu)**2 / (2*sigma*sigma)) return m * n def normal_cdf(x, mu, sigma): # integral(-oo,x) x = float(x) mu = float(mu) sigma = float(sigma) # to standard form x = (x - mu) / sigma s = x v = x for i in range(1, 100): V = v *x *x/ (2* I +1) s += v return 0.5 + s/(2*math.pi)**0.5 * math.e ** (-x*x/2) def difference(nbucket, nkey): nbucket, nkey= int(nbucket), int(nkey) # binomial distribution approximation by normal distribution # find the bucket with minimal keys. # # the probability that a bucket has exactly i keys is: # # probability density function # normal_pdf(i, mu, sigma) # # the probability that a bucket has 0 ~ i keys is: # # cumulative distribution function # normal_cdf(i, mu, sigma) # # if the probability that a bucket has 0 ~ i keys is greater than 1/nbucket, we # say there will be a bucket in hash table has: # (i_0*p_0 + i_1*p_1 + ...) /(p_0 + p_1 + ..) Keys. p = 1.0 / nbucket mu = nkey * p sigma = math. SQRT (nkey * p * (1-p)) target = 1.0 / nbucket minimal = mu while True: xx = normal_cdf(minimal, mu, sigma) if abs(xx-target) < target/10: break minimal -= 1 return minimal, (mu-minimal) * 2 / (mu + (mu - minimal)) def difference_simulation(nbucket, nkey): t = str(time.time()) nbucket, nkey= int(nbucket), int(nkey) buckets = [0] * nbucket for i in range(nkey): hsh = hashlib.sha1(t + str(i)).digest() buckets[hash(hsh) % nbucket] += 1 buckets.sort() nmin, mmax = buckets[0], buckets[-1] return nmin, float(mmax - nmin) / mmax if __name__ == "__main__": nbucket, nkey= sys.argv[1:] minimal, rate = difference(nbucket, nkey) print 'by normal distribution:' print ' min_bucket:', minimal print ' difference:', rate minimal, rate = difference_simulation(nbucket, nkey) print 'by simulation:' print ' min_bucket:', minimal print ' difference:', rateCopy the code