About hash tables

The concept of a hash table

A Hash table, also known as a Hash table, is a data structure that accesses a storage location in memory based on a key. Hash tables can be found in many high-level languages. For example, in Python, there is a data structure called dict, which is supposed to be based on hash tables. The following to look up the phone number in life as a popular example, explain what is a hash table.

An example of understanding a hash table

Think of a hash table as a phone book arranged alphabetically. When we look up Joe’s phone number, we can easily tell that Joe’s initial is Z. Ideally, John would be at the beginning of the Z in the phone book, but not ideally, he would have to find a few people later. Obviously, it’s much faster to look up phone numbers by their initials than it is to look up phone numbers one by one, which is the nature of hash tables. It’s fast.

Hash table nouns corresponding to the above example:

  1. Hash table: telephone book
  2. Key: Name, such as Zhang SAN
  3. Hash function (F) : calculation of the first letter of the name, the argument is the name, return the corresponding first letter
  4. Hash address: The return value of the hash function. In this example, A-Z can be understood as the hash address

What is conflict

For different keywords, the same hash address may be obtained through the calculation of hash function. For example, ba (key1) and Ba (key2) both start with B, even though they have different names (key1 ≠key2) but hash the same result (F(key1)=F(key2)=B). This situation is called conflict.

Methods of conflict resolution

There are many methods to resolve conflicts, such as open address method and chain address method, which can be selected according to the specific application scenario. Generally speaking, the chain address method is used more in actual projects and development. The basic idea of chained address method is to put the keywords of the same hash address in the same linked list. The hash table resolving conflicts by using the linked address method can be understood as the combination of array and linked list. In the figure above, the first letter is stored in an array of length 26, and each element of the array can be thought of as a single linked list. The linked list’s data field holds the name, and the pointer field points to the next node that holds the name with the same initial.

Dictionary Design

Now that we have an overview of hash tables, we can design and implement a dict that holds key-value pairs and retrieves val based on keys.

  1. Basic idea: using chain address method, using fixed-length Array (Array) + single-linked list to represent the dictionary, assuming that the Array length is SIZE, the initial hash table is an Array with all elements of 0.
  2. Key/value pair: Given a key-value pair (key1, val1), hash function F is used to calculate the hash value (hash_code), i.ehash_code1 = F(key1). And then, throughhash_code1 % SIZEGet the address (denoted by Array[index] since it is a position in an Array), and the modulo operation ensures that the address is in the range of the Array address. Next, create a new single-linked list node (Node 1), pointer fieldnextforNULL, the data field stores key1 and val1. Finally, Array[index] stores a pointer to node 1.
  3. Come into conflict: Given a key-value pair (key2, val2), key2≠ KEY1, if the hash function is used to calculate hash_code2 = hash_code1, then the same address Array[index] will be obtained, and the conflict will occur because the pointer to node 1 has been stored in this position of the Array.
  4. Resolve the conflict: Create a single linked list node (node 2), save key2 and val2 in the data field, and pointer to the fieldnextTo NULL. Let’s make node 1nextPointer to node 2 to resolve the conflict, this is the chain address method, also known as the zipper method, later conflicts, continue to use this method to resolve.
  5. Update operation: We inserted key-value pairs (key1, val1) previously. If a new key-value pair (key1, val0) needs to be inserted on this basis, where val0≠val1, the update operation is required. There are two ways to do this. The first way is to directly treat the node with this key as the first node in the corresponding array position. The second way is to find the node with key=key1 in the corresponding array position and update its val pointer.
  6. The dictionaryGiven a key, look at val. Array[index] = F(key) % SIZE = Array[index] = F(key) % SIZE = Array[index] = F(key) % SIZE = F(key) % SIZE Ideally equal, but may need to be along the nodes because of conflictsnextPointer down, and therefore, the hash algorithm does not have O(1) time. When the data is found, return. If there is no data, Array[index]=0 and NULL is returned.

Dictionary representation

/* Dictionary type */
#define DICT_TYPE_INT 0
#define DICT_TYPE_STR 1

typedef struct dict_entry {
    /* Successor node */
    struct dict_entry *next;
    /* 键 */
    void *key;
    /* 值 */
    void *val;
}dict_entry;

typedef struct dict {
    /* Hash function */
    unsigned int (*hash)(void *key);
    /* the table array holds the dict_entry pointer */
    dict_entry **table;
    /* Table array length */
    int size;
    /* Mask (size-1) */
    int sizemask;
}dict;
Copy the code

Start with the dict_entry structure, which has three members that represent the next pointer and keys and values for the nodes of a single linked list. Next comes the dict structure, which represents the dictionary itself.

  1. * Hash: Different types of keys, such as integers or character arrays (strings), need different hash functions. The member is a pointer to the hash function of the dictionary.
  2. tableNote that the member table is an array, used to storedict_entryType. You can use dict_entry* table[size] to help understand.
  3. Size: indicates the length of the table array.
  4. Sizemask: mask used to calculate array indexes through and operations. Sizemask = size-1. Given a number x, x % size is equivalent to x & sizemask. And operations may be faster than modular operations, so choose the former.

Function listing

Below are the names of the functions used to operate on queues and their effects and complexity

function role Algorithm complexity
hash_integer Computes the hash value of the integer key O(1)
hash_33 Computes the hash value of the character key O(N)
dict_create Creating a new dictionary O(1)
dict_create_entry Create a dict_entry O(1)
dict_put_entry The dictionary inserts an entry O(1)
dict_get_value Get val for key Best O(1), worst O(N)
dict_empty Clears all entries in the dictionary O(N2)
dict_release Free the entire dictionary O(N2)

Hash function selection

/* Hash function (for integers) */
static unsigned int hash_integer(void *key)
{
    return(* (int *)key * 2654435769) > >28;
}

/* Hash function TIME33 algorithm (for strings)*/
static unsigned int hash_33(void *key)
{   
    unsigned int hash = 0;
    while(* (char*)key ! =0)
    {
        /* Moving 5 bits to the left equals *32, and +hash equals *33; * /
        hash = (hash << 5) + hash + *(char *)key++;
    }
    return hash;
}
Copy the code

Hash function is a mapping relationship, the construction of hash function is a mathematical problem, there are many methods, in general, to minimize conflict, address distribution as far as possible. Here we choose a simple function for calculating integer hashes and the TIME33 algorithm for calculating string hashes. In addition, there is an algorithm called MurmurHash that is widely known for its use in Redis. It was invented in 2008 by Austin Appleby, the inventor of which was invited to work at Google.

Hash table creation

/* Create a dict */
dict *dict_create(int type)
{
    dict *dict = (struct dict *)malloc(sizeof(struct dict));
    if(dict == NULL) return NULL;
    if(type == DICT_TYPE_INT)
        dict->hash = &hash_integer;
    else
        dict->hash = &hash_33;
    dict->size = 1024;
    dict->sizemask = dict->size - 1;
    /* Allocates memory for the array */
    dict->table = (dict_entry **)malloc(sizeof(dict_entry *) *(dict->size));
    if (dict->table == NULL) return NULL;
    /* Set all array elements to zero */
    memset(dict->table, 0.sizeof(dict_entry *) * (dict->size));
    return dict;
}
Copy the code

The function takes a type argument to determine the type of the dictionary and hence the hash function. Then set the size of the dictionary and allocate memory for the table array. Then set all elements of the table to 0, indicating that the array is empty. Finally, the new dictionary is returned.

Create dict_entry

/* Create a dict_entry */
dict_entry * dict_create_entry(void *key, void *val)
{
    dict_entry * entry = (dict_entry *)malloc(sizeof(dict_entry));
    if(entry == NULL) return NULL;
    entry->key = key;
    entry->val = val;
    entry->next = NULL;
    return entry;
}
Copy the code

Create a dict_entry, which is the node of a single linked list. Two void Pointers are accepted as arguments to allow the dictionary to hold various types of data.

Dictionary inserts key-value pairs

The first method:

/* Dictionary inserts a key-value pair */
dict *dict_put_entry(dict *dict, void *key, void *val)
{
    unsigned int hash = dict->hash(key);
    int pos = hash & dict->sizemask;
    
    dict_entry *entry;
    entry = dict_create_entry(key, val);
    
    entry->next =  dict->table[pos];
    dict->table[pos] = entry;

    return dict;
}
Copy the code

This method is simple and effective. For new, collision, or update operations, the new node generated by the key-value pair to be inserted is the first node of the corresponding array position. Both additions and collisions are linked list inserts in nature, and updates are not substantial updates when using this method. Since the new node is the first node corresponding to the position of the array, the old data (nodes with the same key) is arranged after the new node, and the search starts from the first node in the linked list corresponding to the position of the array, so the new key-value pair is always found first.

The advantages and disadvantages:

  • Advantages: Simple and elegant operation, high insertion efficiency, no need to traverse the linked list and calculate the hash value of each node key.
  • Disadvantages: The old nodes are still in the linked list, so they take up a bit more memory.

It is worth mentioning that Redis’s DCIT uses this method when inserting key-value pairs.

The second method:

/* Dictionary inserts a key-value pair */
dict *dict_put_entry(dict *dict, void *key, void *val)
{
    unsigned int hash = dict->hash(key);
    int pos = hash & dict->sizemask;
    dict_entry *entry, *curr;
    / * new * /
    if(dict->table[pos]==0) {printf("The new \ n");
        entry = dict_create_entry(key, val);
        dict->table[pos] = entry;
    } else {
        curr = dict->table[pos];
        
        /* Check whether the first node meets the update condition */
        if(dict->hash(curr->key) == dict->hash(key)) {
            printf("Update \ n");
            curr->val = val;
            return dict;
        }

        /* If not, search down until a node with the same hash key is found, then update, * or until next==NULL, at which point the new element is added to the end of the list. * /
        while(curr->next ! =NULL) {    
            printf("Go down to \n");
            if(dict->hash(curr->next->key) == dict->hash(key)) {
                printf("Update \ n");
                curr->next->val = val;
                return dict;
            };
            curr = curr->next;
        }

        printf("Tail insert \n");
        entry = dict_create_entry(key, val);
        curr->next = entry;
    }
    return dict;
}
Copy the code

This method can refer to the dictionary design mentioned above. Its advantage is that it uses less memory, but its disadvantage is that it is not elegant enough and increases the algorithm complexity.

During debugging and testing, you can set dict->size to 1 to observe additions, updates, conflicts, etc.

The dictionary

/* dict gets the value */
void * dict_get_value(dict *dict, void *key) 
{
    unsigned int hash = dict->hash(key);
    int pos = hash & dict->sizemask;
    if(dict->table[pos]==0) return NULL;
    dict_entry *current = dict->table[pos];
    while (current)
    {
        if(dict->hash(current->key) == dict->hash(key))
            return current->val;
        else
            current = current->next;
    }
    return NULL;
}
Copy the code

To look up a dictionary is to look up val for a given key. Refer to the dictionary design mentioned above.

Dictionary cleanup and release

/* Clears all dict entries, not the dict itself */
void dict_empty(dict *dict)
{
    int i;
    for(i=0; i<dict->size; i++){if(dict->table[i] ! =0){
            dict_entry * current, *next;
            current = dict->table[i];
            while (current)
            {   
                next = current->next;
                free(current);
                current = next;
            }
            dict->table[i] = 0; }}}/* Release dict */
void dict_release(dict *dict)
{
    dict_empty(dict);
    free(dict->table);
    free(dict);
}
Copy the code

To clear all dict entries but not the dict itself, you only need to iterate over the table array, and iterate over the corresponding linked list when the non-0 element is found. To release the dict, you only need to release all entries and then the dict itself.

Test in the main function

int main(a)
{   
    /* Create a dictionary with a string key */
    dict * dict = dict_create(1);

    char str[] = "name";
    char str2[] = "Austin";
    char str3[] = "Lookcos";
    char str4[] = "age";
    int age = 18;

    /* key-value pairs :("Austin", "Austin") */
    dict_put_entry(dict, &str2, &str2);
    puts(dict_get_value(dict, &str2));

    /* key-value pair :("name", "Austin") */
    dict_put_entry(dict, &str, &str2);
    puts(dict_get_value(dict, &str));

    /* key-value pairs :("name", "Lookcos") */
    dict_put_entry(dict, &str, &str3);
    puts(dict_get_value(dict, &str));
    
    /* key-value pairs :("age", 18) */
    dict_put_entry(dict, &str4, &age);
    printf("age: %d\n", * (int *)dict_get_value(dict, &str4));

    /* Dictionary release */
    dict_empty(dict);
    dict_release(dict);

    return 0;
}
Copy the code

In the test, I used the second method to insert key pairs, and I also set the size in the dict to 1, so that there was only one place in the table to observe changes in the linked list during inserts, updates, and collisions.

Compiler output

# gcc -fsanitize=address -fno-omit-frame-pointer -g dict.c && ./a.outAdd Austin tail insert Austin look down update Lookcos look down tail insert age: 18Copy the code

The complete code

This article is from my Github project: Data Structures (C language description) github.com/LookCos/lea…