Today’s article tells you about HashMap, a data structure that all Java engineers claim to know. Why is it that all Java engineers can do it? It’s simply that they can’t get a job doing it. It’s almost a standard part of every interview.
In today’s article we will uncover many mysteries. For example, why do get and PUT operations on a HashMap have O(1) complexity, even faster than red-black trees? What is the relationship between hashMaps and hash algorithms? What are the parameters of a HashMap and what are they used for? Is hashMap thread safe? How do we maintain hashMap balance?
Let’s take a look at the basic structure of a HashMap with some doubt.
The basic structure
The hashMap data structure is actually not that difficult. It is very, very clear, and I can explain it in one sentence. It is actually an adjacency list. Although the two serve very different purposes, their structure is identical. In plain English, an array of fixed length, each element of which is the head of a linked list. ** Let’s draw the structure so that you can see it at once.
Headers is a fixed-length array in which each element is the head of a linked list. In other words, based on this header, we can walk through the list. ** Arrays are fixed length, but linked lists are variable length, ** so if we add, delete, change or check elements, it is essentially done through linked lists.
This is the basic structure of a HashMap, and if asked in an interview, you can answer it directly: It is essentially an array of linked lists.
The hash function
Now that we understand the basic structure of a HashMap, let’s move on to the next question: What does such a structure have to do with hash?
It’s not hard to guess. Let’s think about a scenario. Let’s say we already have a HashMap, and now we have a new piece of data to store. The length of the array is 6, which means there are 6 lists to choose from. Which list should we place the new element in?
You might say, of course, put it in the shortest one, so that the list will balance out. This is all well and good, but there is a problem. While it is easy to store, it can be very problematic to read. Because when we store it, we know it’s in the shortest list, but when we read it, we don’t know which list was the shortest, and it’s possible that the whole structure is completely different. So we ** can not determine the location of nodes according to this dynamic quantity, ** must be determined according to the static quantity.
That static quantity is the hash value, and we all know that a hash algorithm is essentially a mapping operation, mapping a value of any structure to an integer. Our ideal situation is that different value mappings give different results, and the same value mappings give the same results. That is, a variable and an integer are one-to-one. But since we have a finite number of integers, and variables are infinite, there must be some variables that are not equal but they map to the same thing. This situation is called a hash collision.
In a Hashmap we don’t need to worry about hash collisions because we don’t want different keys to map to different values. ** because we just use the hash value to determine which list the node should be in. ** As long as the hash function is defined, the calculated hash value will not change as long as the value does not change. Therefore, we can also follow this logic when querying, and find the hash value and linked list corresponding to the key.
In Python, the whole process is made easier by the hash function. We only need two lines of code to find the linked list for the key.
hash_key = hash(key) % len(self.headers)
linked_list = self.headers[hash_key]
Copy the code
Get and PUT are implemented
Once you understand what the hash function does, the hashMap problem is largely solved. Because all that’s left is a problem of adding, deleting, or checking a linked list, for example, when we want to find a value by key. Once we’ve figured out which list to use the hash function, all we have left to do is walk through the list to find the value.
This function we can implement in the LinkedList class is very simple, just a simple traversal:
def get_by_key(self, key): cur = self.head.succ while cur ! = self.tail: if cur.key == key: return cur cur = cur.succ return NoneCopy the code
Once you have the node query logic for a linked list, you have the query logic for a HashMap. Because essentially you only do two things, one is find the list based on the hash function value, and the second thing is go through the list and find the node.
We can also easily implement:
def get(self, key):
hash_key = self.get_hash_key(key)
linked_list = self.headers[hash_key]
node = linked_list.get_by_key(key)
return node
Copy the code
Once you’ve implemented the GET method, it’s just as easy to write the PUT method, because the put method has the opposite logic to get. Let’s change the search to add or modify:
Def put(self, key, val): node = self.get(key) # if node is not None: node. Val = val else: Node = node (key, val) linked_list.append(node)Copy the code
Guarantee of complexity
Now that get and put are implemented, is the entire Hashmap complete? Obviously not, because there is one very important thing we haven’t done, which is to keep the HashMap complex.
A brief analysis shows that the hashMap implemented this way has a major problem. Since hashMap starts with a fixed array length, no matter how long the array is, as long as we store enough elements, each list will have a large number of elements allocated. We all know that the list traversal speed is zero
(1), so how can we ensure that the speed of the query is constant?
In addition, there is another problem, which is the hash value skew. For example, we have 100 linked lists, but our data hash values are mostly 0 when modulo 100. So a lot of data will be stored in bucket 0, so that the other buckets have no data, and this one bucket is full. How can we avoid this situation?
In fact, whether there is too much data or uneven distribution, the story is the same. Too much data is stored in at least one bucket, which reduces efficiency. In this case, the HashMap has a check mechanism that triggers a reset when the number of elements in a bucket exceeds a certain threshold. ** doubles the number of linked lists in the HashMap and reshuffles all the data. This threshold is set using a parameter called load_factor. When the number of elements in a bucket is greater than load_factor * capacity, the reset mechanism is triggered.
If we add the reset logic, the put function looks like this:
def put(self, key, val): Hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] # If the threshold is exceeded if linked_list.size >= self.load_factor * self.capacity: Hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] node = linked_list.get_by_key(key) if node is not None: node.val = val else: node = Node(key, val) linked_list.append(node)Copy the code
The reset logic is simple: double the size of the array, read the original data one by one, and rehash it into a new bucket.
def reset(self): Headers = [LinkedList() for _in range(self. Capacity * 2)] cap = self. Capacity = self self.capacity * 2 for i in range(cap): Linked_list = self.headers[I] nodes = linked_list.get_list() # for u in nodes: hash_key = self.get_hash_key(u.key) head = headers[hash_key] head.append(u) self.headers = headersCopy the code
In fact, the threshold here is our maximum query time, we can approximate it as a relatively large constant, so put and GET efficiency is guaranteed. Because we inserted a lot of data or just happened to have a hash that wasn’t evenly distributed, we solved it all.
Detail and sublimation
If you read the JDK source code for Hashmap, you’ll see that ** The capacity of hashmap (the number of linked lists) is a power of two. ** Why?
After the hash function has been used to calculate the hash value, we need to modulo capacity. This is hash(key) % capacity, which is also shown in the code above.
One small problem here is that modulo is very, very slow, tens of times slower than addition, subtraction and multiplication at the system level. To optimize and improve the performance of this part, we use a power of 2 so that we can use hash(key) & (capacity-1) instead of hash(key) % capacity, since the two calculations are equivalent when capacity is a power of 2. We all know that ** bit operation is the fastest of all the operations in the computer, ** so we can improve a lot of computing efficiency.
Finally, thread safety. Is hashMap thread safe? The simple answer is, of course not. Since there are no locks or mutex restrictions, thread A can modify A node and thread B can read the same node at the same time. So it is easy to have problems, especially if there is a long operation like reset. If another query was entered during reset, the result must not be queried, but the data may exist. So hashMap is not thread-safe and should not be used in concurrent scenarios.
Finally, we attach the complete implementation code for hashMap:
import random class Node: def __init__(self, key, val, prev=None, succ=None): Self. key = key self.key = key self.prev = prev # self.succ = succ def repr__(self): return str(self.val) class LinkedList: def __init__(self): self.head = Node(None, 'header') self.tail = Node(None, 'tail') self.head.succ = self.tail self.tail.prev = self.head self.size = 0 def append(self, node): Prev = self.tail.prev node.prev = prev node.succ = prev.succ = node node.succ.prev = node node.succ self.size += 1 def delete(self, node): Prev = node.prev succ = node.succ succ, prev. Succ = prev, succ self.size -= 1 def get_list(self): Ret = [] cur = self.head. Succ while cur! = self.tail: ret.append(cur) cur = cur.succ return ret def get_by_key(self, key): cur = self.head.succ while cur ! = self.tail: if cur.key == key: return cur cur = cur.succ return None class HashMap: def __init__(self, capacity=16, load_factor=5): self.capacity = capacity self.load_factor = load_factor self.headers = [LinkedList() for _ in range(capacity)] def get_hash_key(self, key): return hash(key) & (self.capacity - 1) def put(self, key, val): hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] if linked_list.size >= self.load_factor * self.capacity: self.reset() hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] node = linked_list.get_by_key(key) if node is not None: node.val = val else: node = Node(key, val) linked_list.append(node) def get(self, key): hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] node = linked_list.get_by_key(key) return node.val if node is not None else None def delete(self, key): node = self.get(key) if node is None: return False hash_key = self.get_hash_key(key) linked_list = self.headers[hash_key] linked_list.delete(node) return True def reset(self): headers = [LinkedList() for _ in range(self.capacity * 2)] cap = self.capacity self.capacity = self.capacity * 2 for i in range(cap): linked_list = self.headers[i] nodes = linked_list.get_list() for u in nodes: hash_key = self.get_hash_key(u.key) head = headers[hash_key] head.append(u) self.headers = headersCopy the code
-end-
- It is not easy to create, welcome your likes, comments and attention
- Your likes, comments, and followings
- Is the greatest support and encouragement to me!