Learn JavaScript data structures and algorithms from 0 (10) hash table

This album is based on the video “JavaScript data structures and Algorithms”, the author of this article has the notes, source code, test environment hosted on GitHub, welcome students Star and Fork.

It is recommended that you learn the structure of the directory in order, from the simple to the deep, step by step, easy to figure out the data structure and algorithm.

Get to know hash tables

Hash table is a very important data structure that almost all programming languages use directly or indirectly.

Hash tables are usually implemented based on arrays, but they have many advantages over arrays:

Hash tables can provide very fast insert-delet-find operations.
No matter how much data there is, the insertion and deletion of values need only approximate constant time, O(1) time complexity. In fact, it only takes a few machine instructions.
A hash table is faster than a tree and can find the desired element almost instantaneously.
Hash tables are much easier to code than trees.

Hash tables also have their drawbacks:

The data in a hash table is not ordered, so you cannot traverse its elements in a fixed way (from small to large, for example).
In general, the hash tablekeyIs not allowed to duplicate, can not place the samekeyIs used to hold different elements.

What is a hash table?

Hash tables are not easy to understand, unlike arrays, linked lists, and trees, whose structure and principles can be graphically represented.
The structure of a hash table is an array, but the magic of it is a transformation of the underlying values, which we can call a hash function, through which we get HashCode.

Learn about hash tables through the following examples:

Case 1: The company wants to store the information of 1000 people, and each employee id corresponds to one employee’s information. If the array is used, it is troublesome to add and delete data. With linked lists, getting data is cumbersome. Is there a data structure that translates an employee’s name into a corresponding employee number and then looks up the full information about that employee based on that employee number? That’s right, and then you can use the hash function of the hash table.
Case 2: Store contacts and corresponding phone numbers: When looking for a number for example, if you use an array: Because you do not know the subscript value of the storage data object, it is very difficult to find it, and the same is true for linked lists. And the use of hash table can be through the hash function to change the name of zhang SAN into its corresponding subscript value, and then through the subscript value search efficiency is very high.

In other words, the hash table is still based on data, but the hash table can use hash functions to transform strings into corresponding subscripts, and establish a mapping relationship between strings and subscripts.

So we know what hash is

To convert strings into subscripts, you need a coding system. To make things easier to understand, let’s create a coding system that says, for example, a is 1, b is 2, C is 3, and so on, z is 26, space is 27 (regardless of capitalization).

With a coding system, there are many ways to convert letters into numbers:

Plan 1: add numbers.

For example, if cats is converted to a number: 3 + 1 + 20 + 19 = 43, store 43 in the array as the subscript of the word cats.

There is a problem with this method: many words turn out to be 43 this way, such as was. In an array, each subscript value can store only one data, so this approach does not make sense.

Scheme two: power multiplication.

The number greater than 10 that we usually use is a power multiple to indicate its uniqueness. Example: 6543 = 6 * 10^3 + 5 * 10^2 + 4 * 10 + 3; Cats = 3 * 27^3 + 1 * 27^2 + 20 * 27 + 17 = 60337.

This ensures character uniqueness, but long characters (such as aaaaAAaaaa) represent large numbers, which requires a large array with many subscripts pointing to invalid data (such as the absence of words such as ZXCVVV), resulting in a waste of array space.

Summary of the two schemes:

The first scheme (adding and summing numbers) produces too few subscripts in the array.
The second option (summing by a power of 27) produces too many subscripts.

Now a compression method is needed to compress the large integer range obtained in the power multiplication scheme system into an acceptable array range. This can be done with a mod operation. Although it is possible to duplicate the structure resulting from the mod operation, it can be solved in other ways.

Some concepts of hash tables

Hash,

The process of converting a large number to an array range index is called hashing.
The hash function

We usually convert words to large numbers, and put the code implementation of hashing large numbers into a function called a hash function.
Hash table

The hash table is the result of encapsulating the entire structure of the array into which the final data is inserted.

Address conflict

In practice, the subscript values obtained after hash function hashing may be repeated, which is called conflict. Conflict is inevitable, and we can only solve the conflict.

There are two common solutions to conflict resolution: chain address method (zipper method) and open address method.

Chain address method (zip method)

As shown in the figure below, we mod each number to 10, so the remainder ranges from 0 to 9 as the subscript value of the array. And instead of storing a number at the location of each subscript, the array stores an array or linked list of numbers that have the same remainder after a mod operation.

This will get the entire array or list based on the subscript value, and then continue to look up the array or list. Also, there are generally not too many elements for conflict.

Summary: The link address method solves the conflict by storing a chain instead of a single data in each array cell. The data structure commonly used in this chain is array or linked list, and the efficiency of the two data structures is similar (because the chain usually does not have too many elements).

Open address method

The main way the open address method works is to find blank cells to place conflicting data items.

According to the different ways of detecting the position of blank cells, there are three methods:

Linear detection
Second detection
Then the hash method

Linear detection

When 13 is inserted:

The hash (mod 10) results in index=3, but the data 33 has already been placed there. Linear detection, on the other hand, starts from index position +1 and goes back one by one to find the appropriate position to place 13. The so-called “appropriate position” refers to the empty position, as shown in the figure above, where index=4 is the appropriate position.

When query 13:
- First, hash 13 to get index=3. If the data stored at index=3 is the same as the data 13 to be queried, it will be returned directly. If not, the linear search, starting at index+1, looks for data 13 position by position.
- The query does not traverse the entire hash table, stopping as soon as it reaches an empty position, because 13 does not skip the empty position to insert another position.
When 13 is deleted:
- The delete operation is similar to the above two cases, but it is important to note that when deleting a data item, do not set the subscript content of the location to NULL, otherwise other query operations will be affected, because the search will stop once a null location is encountered.
- Usually when we delete an item at a location, we can make it special (such as -1) so that when we encounter -1 we know to keep looking.

Problems with linear detection:

There is a serious problem with linear detection, which is aggregation.
If 23, 24, 25, 26, 27 are inserted before any element is inserted into the hash table, this means that the subscripts of 3, 4, 5, 6, and 7 place data. This sequence of fill cells is called aggregation.
Clustering affects hash table performance, whether it is insert/query/delete.
For example, if you insert 13, you will find that sequential cells 3 through 7 do not allow data to be inserted, and you will have to go through this situation several times during the insertion process. The second detection method can solve this problem.

Second detection

Problems with linear detection mentioned above:

If the previous data was inserted consecutively, the newly inserted data may need to be probed over a long distance;

Secondary detection is optimized on the basis of linear detection:
Linear detection: we can regard it as the detection of step 1, for example, starting from the table value x, then linear detection is based on the subscript value: X +1, x+2, x+3 and so on.
Second probe: Step size is optimized, such as probe from subscript x: x+1^2^, x+2^2^, x+3^3^. In this way, a long distance is detected at one time, avoiding the influence of data aggregation.
Problems of secondary detection:

When inserting a set of data with a large data distribution, such as 13-163-63-2133, this situation will cause a clustering of different step sizes (although this is less likely than the clustering of linear detection), which also affects performance.

Then the hash method

The best way to find empty cells in the open address method is to rehash.

The steps of the secondary probes are fixed: 1,4,9,16 and so on.
Now you need a way to generate a sequence of probes that depend on keywords (data), rather than having the same size for each keyword probe step.
In this way, different keys can use different probe sequences even if they map to the same array subscript.
The method of rehash is: use another hash function for the keyword, do the hash again, and use the result of the hash as the step of the keyword.

The second hash needs to satisfy the following two points:

It’s not the same as the first hash function, otherwise it’s going to be the same hash function;
The output cannot be 0, otherwise each probe will be an infinite loop of marking time;

Good hash functions:

StepSize = constant – (key % constant);
Where constant is a prime number and smaller than the capacity of the array.
For example: stepSize = 5 – (key % 5), meet the requirements, and the result cannot be 0;

Efficiency of hashing

It is very efficient to perform insert and search operations in hash tables.

If there is no conflict, then efficiency is higher;
If a conflict occurs, the access time depends on the length of subsequent probes;
The average probe length, as well as the average access time, depends on the filling factor, which increases with the probe length.

Loading factor

The loading factor represents the ratio of the data items already contained in the current hash table to the entire length of the hash table;
Loading factor = Total data items/hash table length;
The open address method has a maximum loading factor of 1, because only blank cells can fit elements;
The chain-address method can have a loading factor greater than 1, because the zipper method can go on indefinitely if you want;

Comparison of performance of different detection methods

Linear detection

It can be seen that with the increase of the loading factor, the average detection length increases exponentially and the performance is poor. In practice, the best loading factor depends on the balance between storage efficiency and speed. As the loading factor becomes smaller, storage efficiency decreases while speed increases.
Secondary detection and rehashing performance

Secondary detection and rehashing have similar performance, and their performance is slightly better than linear detection. As can be seen from the following figure, with the increase of the loading factor, the average detection length increases exponentially, and so does the number of detection needed, showing low performance.
Performance of chain address method

It can be seen that with the increase of the loading factor, the average detection length increases linearly and gently. In the development of the chain address method is used more, such as Java HashMap is used in the chain address method.

The hash function

The advantage of hash table is its speed, so the hash function can not use high consumption performance of complex algorithms. One way to speed things up is to minimize multiplication and division in hash functions.

A high-performance hash function should have the following two advantages:

Quick calculation;
Evenly distributed;

Fast calculation

Horner’s Law: In China horner’s law is also called Qin Jiushao algorithm, the specific algorithm is:

When calculating the value of the polynomial, first calculate the value of the first-order polynomial in the innermost bracket, and then calculate the value of the first-order polynomial layer by layer from inside to outside. This algorithm transforms finding the value of n degree polynomial f(x) into finding the value of n degree polynomial.

Before transformation:
- Multiplication times: n(n+1)/2 times;
- Number of addition: n times;
After transformation:
- Multiplication times: n times;
- Number of addition: n times;

If we use big O to represent the time complexity, we go directly from O(N^2) to O(N).

Uniform distribution

In designing hash tables, we already have a way of dealing with cases that map to the same subscript value: chained address method or open address method. However, to provide efficiency, it is best to distribute the data evenly across the hash table. Therefore, we need to use prime numbers where constants are used. For example, the length of the hash table, the base of the N power, etc.

HashMap in Java adopts the chain address method, and hashization adopts the formula: index = HashCode(key) & (Length-1), that is, the data is converted into binary for and operation instead of mod operation. This makes computing binary data directly more efficient. However, JavaScript will have problems in the sum operation of large data, so we use JavaScript to implement the hash operation using the mod operation.

Encapsulate the hash table

Hash table common operations

put(key, value)Insert or modify operations.
get(key)Gets the element at a specific position in the hash table.
remove(key)Removes the element at a specific position in the hash table.
isEmpty()Returns if the hash table does not contain any elementstrun, returns if the hash table length is greater than 0false.
size()Returns the number of elements in the hash table.
resize(value)Expand the hash table.

A simple implementation of hash functions

The value of hashCode is first computed using Horner’s rule and then hashed by resizing, simply specifying the size of the array.

hashFn(string, limit = 7) {

  // Use a prime number (no mandatory, prime is ok)
  const PRIME = 31;

  // 1. Define a variable to store hashCode
  let hashCode = 0;

  // 2, calculate the value of hashCode using Horner's rule
  for (let item of string) {
    hashCode = PRIME * hashCode + item.charCodeAt();
  }

  // 3. Mod hashCode and return
  return hashCode % limit;
}
Copy the code

Hash function test

console.log(hashFn("123")); / / -- > 5
console.log(hashFn("abc")); / / -- > 6
Copy the code

Hash table implementation

Create the hash table class

Data structure model for encapsulated hash tables:

First create the HashTable class HashTable, and add the necessary attributes and hash functions implemented above, and then implement the other methods.

class HashTable {
  constructor() {
    this.storage = []; // Hash table stores data variables
    this.count = 0; // The number of elements currently stored
    this.limit = 7; // Hash table length (initially set to prime 7)}}Copy the code

put(key,value)

Insert and modify hash tables are the same function: when a user passes in a [key, value], it is an insert if the key does not already exist, and a modify if the key already exists.

Implementation idea:

First, obtain the index value index according to the key, in order to insert the data to the corresponding location of the storage.
Then, the bucket is fetched based on the index value. If the bucket does not exist, the bucket is created and placed at the index value.
Next, determine whether to add or modify the original value. If it already has a value, change it; If no, perform subsequent operations.
Finally, add data.

Code implementation

// put(key, value) adds data to the hash table
put(key, value) {

  // select index from storage by key (hash function)
  const index = hashFn(key, this.limit);

  // select * from bucket; // select * from bucket
  let bucket = this.storage[index];

  // 3. Check whether a bucket exists
  if (bucket === undefined) {
    bucket = [];  // Create one if it does not exist
    this.storage[index] = bucket;
  }

  // 4. Determine whether to insert data or modify data
  for (let i = 0; i < bucket.length; i++) {
    let tuple = bucket[i]; // Tuple format: [key, value]
    if (tuple[0] === key) { // If the key is equal, modify the data
      tuple[1] = value;
      return; // After modifying data in the tuple, return terminates and does not proceed further.}}// Add bucket data
  bucket.push([key, value]); // bucket stores tuples in the format of [key, value]
  this.count++;

  // Determine whether to expand the hash table. If the loading factor is greater than 0.75, expand the hash table
  if (this.count / this.limit > this.loadFactor) {
    this.resize(this.getPrime(this.limit * 2)); }}Copy the code

get(key)

Implementation idea:

First, the hash function is used to retrieve the key instorageThe corresponding index value ofindex.
Then, get the corresponding value based on the index valuebucket.
Next, judge what you getbucketWhether it isnullIf fornull, return directlynull.
Then, linear traversalbucketIn each onekeyIs equal to the incomingkey. If it’s equal to, it just returns the correspondingvalue.
Finally, I’m done iteratingbucketAfter, still did not find the correspondingkeyDirectly,return nullCan.

Code implementation

// Get value based on get(key)
get(key) {

  const index = hashFn(key, this.limit);
  const bucket = this.storage[index];

  if (bucket === undefined) {
    return null;
  }

  for (const tuple of bucket) {
    if (tuple[0] === key) {
      return tuple[1]; }}return null;
}
Copy the code

remove(key)

Implementation idea:

First, the hash function is used to retrieve the key instorageThe corresponding index value ofindex.
Then, get the corresponding value based on the index valuebucket.
Next, judge what you getbucketWhether it isnullIf fornull, return directlynull.
And then, linear lookupbucketFind the corresponding data and delete it.
Finally, still not found, returnnull.

// remove(key) Deletes the data of the specified key
remove(key) {

  const index = hashFn(key, this.limit);
  const bucket = this.storage[index];

  if (bucket === undefined) {
    return null;
  }

  // Traverse the bucket to find the tuple and delete it
  for (let i = 0, len = bucket.length; i < len; i++) {
    const tuple = bucket[i];
    if (tuple[0] === key) {
      bucket.splice(i, 1); // Delete the corresponding array entry
      this.count--;

      // Determine whether to compress the hash table according to the size of the loading factor
      if (this.limit > 7 && this.count / this.limit < this.minLoadFactor) {
        this.resize(this.getPrime(Math.floor(this.limit / 2)));
      }

      returntuple; }}}Copy the code

isEmpty()

isEmpty() {
  return this.count === 0;
}
Copy the code

size()

size() {
  return this.count;
}
Copy the code

Expansion and compression of hash table

Why does it need to be expanded?

We used an array of length 7 in the hash table. Because the chained address method is used, the loadFactor can be greater than 1, so the hash table can insert new data without limit.
However, as the amount of data increases, the bucket array (linked list) corresponding to each index in the storage becomes longer and longer, which reduces the efficiency of the hash table.

When does the capacity need to be expanded?

The common situation is thatLoadFactor > 0.75To expand the capacity.

How do I expand capacity?

A simple expansion can be made twice as large (about prime numbers, discussed later).
After capacity expansion, all data items must be modified synchronously.

Implementation idea:

First, define a variable, such as oldStorage, to point to the originalstorage.
Then, create a new, larger array and letthis.storagePointing to it.
Finally, pull out each piece of data from each bucket in oldStorage and add it in turn tothis.storagePoints to a new array.

The realization of the resize ()

Loading factor = data in the HashTable/length of the HashTable, that is, loadFactor = count/hashtable.length.

The resize method can not only realize the expansion of the hash table, but also realize the compression of the hash table capacity.

// Resize the hash table, expand or compress it
resize(newLimit) {

  // save the contents of the old storage array
  const oldStorage = this.storage;

  // 2, reset all attributes
  this.storage = [];
  this.count = 0;
  this.limit = newLimit;

  // 3, go through oldStorage, fetch all data, put again to this.storage
  for (const bucket of oldStorage) {
    if (bucket) {
      for (const b of bucket) {
        this.put(b[0], b[1]); }}}}Copy the code

Generally, when the loading factor laodFactor > 0.75, the hash table is expanded. Add the following code to the add method (push method) in the hash table to determine whether the expansion function needs to be called for expansion.

// Determine whether to expand the hash table. If the loading factor is greater than 0.75, expand the hash table
if (this.count / this.limit > this.loadFactor) {
  this.resize(this.getPrime(this.limit * 2));
}
Copy the code

When the loading factor laodFactor < 0.25, the hash table capacity is compressed. Add the following code to the remove method in the hash table to determine whether to call the expansion function for compression.

// Determine whether to compress the hash table according to the size of the loading factor
if (this.limit > 7 && this.count / this.limit < this.minLoadFactor) {
  this.resize(this.getPrime(Math.floor(this.limit / 2)));
}
Copy the code

Select a prime number as the hash table capacity

Prime Numbers judgment

1 is not prime

Method 1: Primes are divisible only by 1 and number, but not by 2 to (number-1). Traverses 2 to (num-1).

Although this method can realize prime number judgment, it is not efficient.

function isPrime(number) {
  if (number <= 1) return false;
  for (let i = 2; i < number; i++) {
    if (number % i === 0) {
      return false; }}return true;
}
Copy the code

Method 2: Just iterate over the square root of 2 ~ num. This method has good performance.

function isPrime(number) {
  if (number <= 1 || number === 4) return false;
  const temp = Math.ceil(Math.sqrt(number));
  for (let i = 2; i < temp; i++) {
    if (number % i === 0) {
      return false; }}return true;
}
Copy the code

The capacity of the expanded or compressed hash table is prime

Implementation idea:

After doubling capacity expansion or compression, check whether the obtained capacity isPrime by cyclic calling isPrime. If it is not, +1 until it is. 14 is not prime, 14 + 1 = 15 is not prime, 15 + 1 = 16 is not prime, 16 + 1 = 17 is prime, stop the loop and get 17.

We need to add isPrime and getPrime methods to the HashTable class:

// getPrime(number) gets the nearest prime number from the number passed in
getPrime(number) {
  while(! isPrime(number)) { number++; }return number;
}
Copy the code

Modify the operations related to array expansion in the put method for adding elements and remove method for removing elements:

Add the following code to the PUT method:

// Determine whether to expand the hash table. If the loading factor is greater than 0.75, expand the hash table
if (this.count / this.limit > this.loadFactor) {
  this.resize(this.getPrime(this.limit * 2));
}
Copy the code

Add the following code to the remove method:

// Determine whether to compress the hash table according to the size of the loading factor
if (this.limit > 7 && this.count / this.limit < this.minLoadFactor) {
  this.resize(this.getPrime(Math.floor(this.limit / 2)));
}
Copy the code

Hash table complete implementation

class HashTable {
  constructor() {
    this.storage = []; // Hash table stores data variables
    this.count = 0; // The number of elements currently stored
    this.limit = 7; // Hash table length (initially set to prime 7)

    // Loading factor (existing number/total number)
    this.loadFactor = 0.75;
    this.minLoadFactor = 0.25;
  }

  // getPrime(number) gets the nearest prime number from the number passed in
  getPrime(number) {
    while(! isPrime(number)) { number++; }return number;
  }

  // put(key, value) adds data to the hash table
  put(key, value) {
    // select index from storage by key (hash function)
    const index = hashFn(key, this.limit);

    // select * from bucket; // select * from bucket
    let bucket = this.storage[index];

    // 3. Check whether a bucket exists
    if (bucket === undefined) {
      bucket = []; // Create one if it does not exist
      this.storage[index] = bucket;
    }

    // 4. Determine whether to insert data or modify data
    for (let i = 0; i < bucket.length; i++) {
      let tuple = bucket[i]; // Tuple format: [key, value]
      if (tuple[0] === key) {
        // If the key is equal, modify the data
        tuple[1] = value;
        return; // After modifying data in the tuple, return terminates and does not proceed further.}}// Add bucket data
    bucket.push([key, value]); // bucket stores tuples in the format of [key, value]
    this.count++;

    // Determine whether to expand the hash table. If the loading factor is greater than 0.75, expand the hash table
    if (this.count / this.limit > this.loadFactor) {
      this.resize(this.getPrime(this.limit * 2)); }}// Get value based on get(key)
  get(key) {
    const index = hashFn(key, this.limit);
    const bucket = this.storage[index];

    if (bucket === undefined) {
      return null;
    }

    for (const tuple of bucket) {
      if (tuple[0] === key) {
        return tuple[1]; }}return null;
  }

  // remove(key) Deletes the data of the specified key
  remove(key) {
    const index = hashFn(key, this.limit);
    const bucket = this.storage[index];

    if (bucket === undefined) {
      return null;
    }

    // Traverse the bucket to find the tuple and delete it
    for (let i = 0, len = bucket.length; i < len; i++) {
      const tuple = bucket[i];
      if (tuple[0] === key) {
        bucket.splice(i, 1); // Delete the corresponding array entry
        this.count--;

        // Determine whether to compress the hash table according to the size of the loading factor
        if (this.limit > 7 && this.count / this.limit < this.minLoadFactor) {
          this.resize(this.getPrime(Math.floor(this.limit / 2)));
        }

        returntuple; }}}isEmpty() {
    return this.count === 0;
  }

  size() {
    return this.count;
  }

  // Resize the hash table, expand or compress it
  resize(newLimit) {
    // save the contents of the old storage array
    const oldStorage = this.storage;

    // 2, reset all attributes
    this.storage = [];
    this.count = 0;
    this.limit = newLimit;

    // 3, go through oldStorage, fetch all data, put again to this.storage
    for (const bucket of oldStorage) {
      if (bucket) {
        for (const b of bucket) {
          this.put(b[0], b[1]);
        }
      }
    }
  }
}
Copy the code

If it helps, give the author a thumbs up at ❤️

Album series

Learn JavaScript data structures and algorithms from 0
Learn JavaScript data structures and algorithms from 0 (2) arrays
Learn JavaScript data structures and algorithms from 0 onwards (iii) stack
Learn JavaScript data structures and algorithms from 0 (iv) queue
Learning JavaScript data structures and algorithms from 0 (5) Priority queue
Learning JavaScript data structures and algorithms from 0 (6) one-way linked lists
Learn JavaScript data structures and algorithms from 0
Learn JavaScript data structures and algorithms from 0 onwards
Learn JavaScript data structures and algorithms from 0

Learn JavaScript data structures and algorithms from 0 (10) hash table

Get to know hash tables

So we know what hash is

Some concepts of hash tables

Address conflict

Chain address method (zip method)

Open address method

Linear detection

Second detection

Then the hash method

Loading factor

Comparison of performance of different detection methods

The hash function

Fast calculation

Uniform distribution

Encapsulate the hash table

Hash table common operations

A simple implementation of hash functions

Hash table implementation

Create the hash table class

put(key,value)

get(key)

remove(key)

isEmpty()

size()

Expansion and compression of hash table

The realization of the resize ()

Select a prime number as the hash table capacity

Prime Numbers judgment

The capacity of the expanded or compressed hash table is prime

Hash table complete implementation

Album series

Related Posts

Use gulp to build wechat applets development workflow

The front end of toutiao today

CSS Combat Note (1) Hover effect