Interviewer: Tell me about Redis Hash

The little sister of learning Java 0618

preface

Hello, lovely people, hello again. Today’s article is from last year’s interview reading essay interview question, the result is abused. This part does not say, next time special open one, write down my interview was abused famous scene, embarrassing not good, the whole conversation. Ha, ha, ha, ha, ha, ha, ha.

The Hash type is widely used in Redis, where the key-to-value mapping is maintained through dictionary structures. Take notes here.

Use the API

The API is relatively simple to use, so the following is sketched out.

Insert data hset

Use the hset command to insert two key/value pairs (name, Zhangsan) and (age,20) into the MyHash to return the length of the current Myhash.

Get data hGET

Run the hget command to obtain the value of myhash whose key is name.

Get all data hGEtall

Use the hgetall command to getall the keys and values in the MyHash.

Get all keys

Use the hkeys command to get all the keys in myHash.

To obtain the length of the

Use the hlen command to get the length of myhash.

Get all values

Get all values in myhash using the hvals command.

Concrete logic diagram

The text is about to begin. The bottom layer of hash mainly adopts the dict structure and presents layer encapsulation.

DictType is a dictType,dictht a core, rehashidx a marker of progressive hash, and iterators. The most important of these are Dictht and rehashidx.

Next up is dictht, which consists of two arrays, one for the real data store location, and one for the hash procedure, including the real data table and some common variables.

Finally, the data nodes, just like the bidirectional linked list mentioned in the previous part, each node has a next pointer, which is convenient to point to the next node, in order to solve the hash collision. See the picture below for details.

It doesn’t matter if you don’t understand it, we’ll explain it for each module later. (Don’t skip this one.)

Definition of a bidirectional linked list

Dict structure dict

Let’s start with dict, a dictionary structure with four parts, focusing on dictht[2](real data) and Rehashidx (tokens for progressive hash). The specific figure is as follows.

The specific code is as follows:

// typedef struct dict {dictType *type; // Type, including custom functions that enable keys and values to store void * privData; // Private data dictht HT [2]; / / twohashTable long rehashidx; / / progressivehashFlag, if -1, no progress is madehashunsigned long iterators; // Number of iterators being iterated} dict;Copy the code

Array structure dictht

Dictht mainly consists of four parts. 1 is the real array of data dictEntry type, which stores data nodes. 2 is the array length size; 3 is the parameter to hash, sizemask, that’s not important, just remember it’s equal to size-1; 4 indicates the number of data nodes. Used indicates the number of data nodes.

The specific code is as follows:

//hashTypedef struct dictht {dictEntry **table; // Array of real data unsigned long size; // Array size unsigned long sizemask; / / the user will behashThe value of the position index mapped to table is always equal to size-1 unsigned long used; } dictht;Copy the code

Data node dictEntry

DictEntry is the real data node, including the key, value, and next nodes.

// typedef struct dictEntry {void *key; //key union { void *val; uint64_t u64; int64_t s64; double d; } v; //value struct dictEntry *next; // the address of the next data node} dictEntry;Copy the code

Expansion process and progressive Hash diagrams

Let’s start with the first part, why does dictht[2] require two arrays to store real data in one array?

In fact, this is similar to Java HashMap, which is a data plus linked list structure. As the amount of data increases, hash collisions occur more frequently, and the linked list behind each array becomes longer, making the whole linked list very cumbersome. If the business needs a large number of query operations, because it is a linked list, the query can only be started from the head, and the next array can only be started after all the list of an array is queried, so the query time will be extended indefinitely.

This is definitely scaling up, so the first array holds the real data and the second array is used for scaling up. Nodes in the first array are hash mapped to the second array, and so on. Can we provide services to the outside world during the process? And the answer is yes, because it can stop at any time, which brings us to the next variable, rehashidx. (Not a stiff transition, hahaha)

Rehashidx is actually a marker. If it is -1, it indicates that the capacity is not expanded; if it is not -1, it indicates the subscript position to which the capacity is expanded to facilitate the expansion from the subscript position next time.

This is not too abstract, or a face meng force, close to the expansion of the process of all solutions, must point praise comment more praise me oh.

Step 1

First, before capacity expansion, rehashidx is -1, indicating capacity expansion. The dictEntry length of the first array is 4, and there are 5 nodes in total, so used is 5.

Step 2

When expansion occurs, rahashidx is the first subscript of the first array, which is 0. The enlarged size is the minimum that is greater than 2 to the n of used*2, that is, the minimum that can contain multiples of 2 of these nodes *2. Since there are currently 5 data nodes, used*2=10. After expansion, the size of the array is greater than 10 to the power of 2, which is the minimum value of 16. Start at the index 0 of the first array, look for the first element, find the node whose key is name and value is triple, hash it, find the node whose index is 1 in the second array, and move it over, which is actually a pointer move. So I’m just going to make it easy here.

Step 3

After the node whose key is name and value is zhang SAN is moved, the subsequent node with subscript 0 of the first array dictht[0] is moved in the same way as above.

Step 4

Continue to move the subsequent nodes with subscript 0 of the first array dictht[0], and start to move the node with subscript 1, finding that there is no data, so move the node with subscript 2, and change rehashidx to 2 at the same time. The move procedure is the same as above.

The whole process is focused on rehashidx, which is the subscript of the first array being moved, and can be stopped at any time if memory is running low or the operating system is busy.

What would it look like to operate on that object after it’s stopped?

If it’s a new array, you just add the second array, because if you add it to the first array, you still have to move it over, so there’s no need to waste time
If delete, update, query, search first array, if not found, query second array.

Dictionary implementation (source code analysis)

Create and initialize the dictionary

First, allocate memory, then call the initializer _dictInit, mainly assignment, focusing on rehashidx with a value of -1(this verifies the diagram, -1 indicates no hash expansion), and finally return whether the creation was successful.

/* Create and initialize the dictionary */ dict *dictCreate(dictType *type,void *privDataPtr) {      dict *d = zmalloc(sizeof(*d));      _dictInit(d,type,privDataPtr);      return d; }  /* Initialize the hash table */ int _dictInit(dict *d, dictType *type,void *privDataPtr) {      _dictReset(&d->ht[0]);      _dictReset(&d->ht[1]);      d->type = type; d->privdata = privDataPtr; d->rehashidx = -1; // Assign -1 to indicate no progresshash      d->iterators = 0; return DICT_OK; } Copy the code

capacity

Dict has a static method called dictexpandifNeed to determine whether an expansion is required.

DictIsRehashing (d) ((d)->rehashidx! If it is -1, it is not in hash state. If the condition is false, expansion can be performed; if it is not -1, it is in hash state. If the condition is true, expansion cannot be performed.

Then check whether the size of the first array is 0. If it is 0, expand to the default size 4. If it is not 0, execute the following code.

Then determine whether expansion is needed. There are three conditions in IF, and specific analysis is as follows.

At last, the dictExpand method is called, and the parameter is the double size of data node HT [0]. Used *2. This verifies that the array size of the expansion process above is 16.

The capacity expansion method is relatively simple. Obtain the expanded size and set the second size to the new size.

It’s kind of empty. Look at the flow chart.

Capacity Expansion Flowchart

Specific code:

Static int _dictExpandIfNeeded(dict *d) {// Determine whether the state is expanding by calling macro constants#define dictIsRehashing(d) ((d)->rehashidx ! If (dictIsRehashing(d)) return DICT_OK; // Check whether the first array size is 0, if 0, call expansion method, //#define DICT_HT_INITIAL_SIZE 4 if (d->ht[0]. Size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE); // static int dict_can_resize = 1; // static int dict_can_resize = 1; Static unsigned int dict_force_resize_ratio = 5; static unsigned int dict_force_resize_ratio = 5; / / we are to analyze the if condition, if the first array of all the nodes number greater than or equal to the size of the first array (table node data has some more) / / and available capacity (number 1) or all nodes number divided by the array size is greater than 5 / / conditions of the expansion and said that the need is greater than the first is the node is equal to the length of the array, // The second point is that the capacity can be expanded and the data is too much, More than 5 two choose one of the if (d - > ht [0]. 2 > = d - > ht [0]. The size && (dict_can_resize | | d - > ht [0]. 2 / d - > ht [0]. Size > Dict_force_resize_ratio)) {dictExpand(d, d->ht[0]. Used *2); } return DICT_OK; } int dictExpand(dict *d, unsigned long size) { dictht n; Unsigned long realsize = _dictNextPower(size); / / if some judgment conditions (dictIsRehashing (d) | | d - > ht [0]. 2 > size) return DICT_ERR; if (realsize == d->ht[0].size) return DICT_ERR; n.size = realsize; n.sizemask = realsize-1; n.table = zcalloc(realsize*sizeof(dictEntry*)); n.used = 0; If (d->ht[0]. Table == null) {d->ht[0] = n; return DICT_OK; } // Set the length of the second hash to new, d->ht[1] = n; d->rehashidx = 0; // Set the current hash return DICT_OK; } /* Find the minimum value greater than size, */ static unsigned long _dictNextPower(unsigned long size) {unsigned long I = DICT_HT_INITIAL_SIZE; if (size >= LONG_MAX) return LONG_MAX; while(1) { if (i >= size) return i; i *= 2; }}Copy the code

Progressive hash

The progressive hash process has been illustrated above, but let’s focus on how the code is implemented and whether the process is correct.

After the expansion, the dictRehash method is executed, taking the hash table D to be moved and the step number N.

First, check whether the marker rehashidx is equal to -1. If it is equal to -1, it means that the hash is complete. If it is not equal to -1, execute the following code.

We then loop through each subscript on the first array, updating the rehashidx value by 1 each time we move the subscript position.

Then carry on the second loop, traverses the next linked list each node, completes the data migration, mainly is the pointer movement and some parameter modification.

Finally, it returns an int value, 0 indicating that the entire data has been hashed, or 1 indicating that part of the data has been hashed but not all. You can continue the hash using the rehashidx value next time.

The specific code is as follows:

Static int _dictExpandIfNeeded(dict *d) {// Determine whether the state is expanding by calling macro constants#define dictIsRehashing(d) ((d)->rehashidx ! If (dictIsRehashing(d)) return DICT_OK; // Check whether the first array size is 0, if 0, call expansion method, //#define DICT_HT_INITIAL_SIZE 4 if (d->ht[0]. Size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE); // static int dict_can_resize = 1; // static int dict_can_resize = 1; Static unsigned int dict_force_resize_ratio = 5; static unsigned int dict_force_resize_ratio = 5; / / we are to analyze the if condition, if the first array of all the nodes number greater than or equal to the size of the first array (table node data has some more) / / and available capacity (number 1) or all nodes number divided by the array size is greater than 5 / / conditions of the expansion and said that the need is greater than the first is the node is equal to the length of the array, // The second point is that the capacity can be expanded and the data is too much, More than 5 two choose one of the if (d - > ht [0]. 2 > = d - > ht [0]. The size && (dict_can_resize | | d - > ht [0]. 2 / d - > ht [0]. Size > Dict_force_resize_ratio)) {dictExpand(d, d->ht[0]. Used *2); } return DICT_OK; } int dictExpand(dict *d, unsigned long size) { dictht n; Unsigned long realsize = _dictNextPower(size); / / if some judgment conditions (dictIsRehashing (d) | | d - > ht [0]. 2 > size) return DICT_ERR; if (realsize == d->ht[0].size) return DICT_ERR; n.size = realsize; n.sizemask = realsize-1; n.table = zcalloc(realsize*sizeof(dictEntry*)); n.used = 0; If (d->ht[0]. Table == null) {d->ht[0] = n; return DICT_OK; } // Set the length of the second hash to new, d->ht[1] = n; d->rehashidx = 0; // Set the current hash return DICT_OK; } /* Find the minimum value greater than size, */ static unsigned long _dictNextPower(unsigned long size) {unsigned long I = DICT_HT_INITIAL_SIZE; if (size >= LONG_MAX) return LONG_MAX; while(1) { if (i >= size) return i; i *= 2; }}Copy the code

conclusion

This paper mainly describes the Dict structure of Redis Hash data type, which is the bottom implementation of Dict structure. First, using some Hash API, Dict structure is derived, and its three main components are analyzed: Dict structure Dict, array structure Dictht, data node structure DictEntry, Furthermore, the process of expansion and rehash is explained through multiple process diagrams. Finally, the dictionary is described with source code, such as the creation process, expansion process, and progressive hash process, with flow charts in the middle.

Interviewer: Tell me about Redis Hash

Related Posts

How much CPU does a process/thread context switch cost you?

Explain the underlying architecture logic of the data center in detail

How do I know if parentheses are valid?