preface

HashMap HashMap HashMap HashMap HashMap HashMap HashMap To be honest, I have seen the design and implementation of Redis (sincerely recommend), but I can’t describe it accurately, so I write this article.

Note: Since this article focuses on the hash process, the source code section is easy to read

Hash function (hash algorithm)

  • A hash function, also known as a hash algorithm or hash function, is a method of creating small digital “fingerprints” from any kind of data, and this “fingerprint” is the hash value.

  • Hash functions can be used in a variety of fields, such as protecting data, ensuring that real information is delivered, and hash tables. This article focuses on the application of hash tables.

  • Of course, we want hash functions to ensure that each key corresponds to a “fingerprint”, that is, a hash value, the so-called perfect hash, but due to performance, application scenarios and other considerations, can accept not too many hash collisions.

  • Hash collision: There is not A unique relationship between the input and output of the hash function. For example, the hash value of the input A and B of the hash function is C.

  • Common hash functions:

Direct addressing method number analysis method square take center method fold method divide remainder method random number method

  • We all know that HashMap and Redis dictionary scenarios are optimized based on the outleave remainder method as hash functions. I don’t want to belabor the concept of each method here, but I want you to look at it here.

Hash collision (hash collision) resolution

  • Hash? -> hash algorithm selection -> hash conflict how to resolveSurely this is the mindset of most peers. So, what are the main algorithms for resolving hash conflicts?

Open addressing

  • Once a conflict occurs, go to the next empty hash address, the simplest algorithm company is as follows.For example, SettingsIf (26+1) %12 = (37+1) %12, then (37+2) %12 = (37+1) %12, then (37+2) %12 = 4, then (37+2) %12 = 4.

Hashing again

  • Prepare multiple hash functions at the same time so that in case the first hash function conflicts, the alternate hash function can be used to calculate.

Chain address method

  • The hash value is first obtained by the division and remainder method. If the hash value conflicts, the nodes in the conflict are inserted one by one through the linked list to form the structure as shown in the following figure. The following example is,You can see that 48%12 = 12%12 = 0, forming a linked list.

Public overflow zone method

  • Use additional common storage to store elements with conflicting hash values.

HashMap

Hash algorithm of HashMap

episodeLOAD_FACTOR(Load factor)

  • The default LOAD_FACTOR for HashMap is 0.75. What does it do? Follow the source below trace ~

  • New HashMap (capacity = 16, load_factor = 0.75);

/** * Constructs an empty HashMap with the default initial capacity * (16) and the default load factor (0.75). * /
    public HashMap(a) {
        this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
    }
Copy the code
  • As theputElement, must be expanded. Lookresize()Function, refers to post the iconic part.
else {               // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
Copy the code
  • As you can see, the threshold is passedcapacityMultiplied by theload_factorThe result is, namely16 * 0.75 = 12.HashMapIf the number of elements in theIf the threshold is small relative to the capacity, the probability of a hash collision is reduced, because the probability of a hash collision with 16 elements is smaller than the probability of a hash with 12 elements given the same hash function.

Hash function formula and analysis

  • The hash algorithm of HashMap approximates the division and remainder method, but it is not usefulMODI’m going to do it in bits, let’s sayIs the input value and the hash function isThe specific formula is as follows:
f(key) = hash(key) & (table.length - 1) 
hash(key) = (h = key.hashCode()) ^ (h >>> 16)
Copy the code
  • hash(key) & (table.length - 1)Is thetable.lengthThe optimized version of the mod, which actually does the same thing, is based on the division remainder method. Due to theHashMapThe feature is expanded each timetable.lengthIt would be, so bit operations are significantly more efficient.
  • >>>Is the unsigned right shift operator. We know thathashCode()It’s a very wide range of values, so it’s very unlikely to conflict with each other, but it’s uptable.length - 1The odds are going to go up becausetable.lengthIs a smaller value. That’s why it works, right> > > 16The reason,hashCode()The high and the low are both rightf(key)With a certain amount of influence, it makes the distribution more uniform, so there’s less chance of hash collisions.

HashMap hash conflict resolved

  • HashMapThe hashing conflict resolution method is obviousChain address methodIn fact, you can see it in the structure.
  • Available from theHashMaptheresize()Method can also be seen, see part of the source code… Comments on.
if(oldTab ! =null) {
			// Iterate over the nodes on the old array
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                if((e = oldTab[j]) ! =null) {
                    oldTab[j] = null;
                    // The current list has only one node (no hash collisions)
                    if (e.next == null)
	                    // The hash algorithm is used to calculate the storage location and place
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
	                    // The red black tree went to...
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
	                    // Low order list, high order list
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
	                        // Iterate over the list with hash conflicts
                            next = e.next;
                            // Hash values less than the old array capacity are put into the low list
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            // Hash values greater than or equal to the size of the old array are put into the high-order list
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                elsehiTail.next = e; hiTail = e; }}while((e = next) ! =null);
                        // Place the list under the original index
                        if(loTail ! =null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        // Place the list in the old index + old array size
                        if(hiTail ! =null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
Copy the code

Redis dictionary

Introduction to Data Structures

withHashMapAbout the same place

First of all, I will briefly explain the data structure of Redis dictionary, which is the C language that produced shadows when I was in college.

typedef struct dictht {
	// Hash table array
	dictEntry **table;

	// Hash table size
	unsigned long size;

	// Hash table size mask, used to calculate index values
	// always equal to size-1
	unsigned long sizemask;
	
	// The number of nodes in the hash table
	unsigned long used;
	
} dictht
Copy the code
  • There’s an array of hash tables, doesn’t that look familiar, andNode<K,V>[]So let’s seedictEntryThis class.
typedef struct dictEntry {
	/ / key
	void *key

	/ / value
	union {
		void *val;
		unit64_tu64;
		int64_ts64;
	} v

	/ / next pointer
	struct dictEntry *next;
}
Copy the code
  • That looks familiar again, key-value pairs! The key property holds the key in the key-value pair, while the V property holds the value, which can be a pointer, unit64_t, or int64_t integer. Next is a pointer to another hash table node that joins the same key-value pairs of multiple hash tables into a linked list.

  • As you can see, the HashMap is basically the same so far, except for a few additional attributes (hash table size, number of nodes in use, mask, and so on).

The difference

  • The bottom layer of the dictionary is implemented by the hash table structure, so what is the real face of the dictionary?
typedef struct dict {
	// Type specific functions
	dictType *type;
	
	// Private data
	void *privdata;

	// hash table
	dictht ht[2];

	/ / rehash index
	// When rehash is not in progress, the value is -1
	int trehashidx;
}
Copy the code
  • At this point, the data structure of Redis dictionary is described. The following figure shows the dictionary in normal state.

Redis dictionary hash algorithm

  • To calculateHash valueThe process of calculating a hash for a dictionary-based hash function:hash = dict -> type->hashFunction(key)
  • Using the sizemask attribute of the hash table and the hash value, calculate the index value, depending on the caseht[0]orht[1]. In fact, andHashMapthehash(key) & (table.length - 1)Same thing, because it says it in the notessizemaskAlways equal to thesize - 1.

index = hash & dict->ht[x].sizemask

  • The hash calculation algorithm used by Redis isMurmurHash2I am not in a position to say,Give the connection.

Redis dictionary hash conflict resolved

  • Redis hash tables are also usedChain address method, each node has onenextPointer, multiple nodes form a one-way linked list, andHashMapThe difference is because there is no table tail pointer to useThe first interpolationAdds a new node to the head position of the linked list.

Redis is different from HashMap

See hash algorithm, hash conflict resolution, there is not much difference, so what is the difference? That’s rehash.

Rehash

  • As mentioned earlier, there are two hash tables in the dictionary data structure (HT [2]), and that’s the secret.

  • The purpose of Rehash is to keep the load factor of a hash table within a reasonable range. A hash table needs to expand or contract when there are too many or too few key-value pairs.

  • The steps are as follows:

    1. For a dictionaryht[1]Hash table allocates space, if executedextensionOperation, thenht[1]The magnitude of theta is greater than or equal to the first oneht[0].used *2the; Student: If you doshrinkageOperation, thenht[1]The first one is greater than or equal toht[0].usedthe.
    2. Will be stored in theht[0]All key-value pairs inrehashtoht[1]Above, the key hash and index values are recalculated, and the key-value pair is placed intoht[1]Hash table at the specified location.
    3. whenht[0]All of the included key-value pairs have been migrated toht[1]After (ht[0]Becomes an empty hash tableht[0]That will beht[1]Set toht[0]And, inht[1]Create a new blank hash table for next timerehashTo prepare for.
  • Note: Progressive Rehash is not covered in this paper, the author is really not motivated to draw, the data structure is complicated, please understand.

conclusion

This article from the hash algorithm, hash collision solution, a brief analysis of the HashMap, Redis dictionary hash, views are not necessarily correct, please comment!

reference

  • zh.wikipedia.org/wiki/ hash function
  • Redis design and implementation
  • www.zhihu.com/question/26…