In December 2017, researchers at Google and THE Massachusetts Institute of Technology published a paper on their work in the “learning index Structure.” These studies are very exciting, as the authors describe in their abstract:
“We believe that the idea of replacing core components of data management systems through learning models has profound implications for future system design, and this work provides only a glimpse of what is possible.”
In fact, the research presented by the Google and MIT researchers includes results that indicate new competition in the backbone of the indexing world: B-trees and hash graphs. The engineering community is nervous about the future of machine learning, so this research paper has been discussed in depth on Hacker News, Reddit, and the engineering community.
New research is a great opportunity to revisit the basics of a field, and the underlying things that are indexed are not always breakthroughs. This article is an introduction to hash tables, giving a brief introduction to what makes them slow and fast, intuitive machine learning concepts, and how these concepts are applied to indexing.
(If you are already familiar with hash tables, collision handling strategies, and hash function performance considerations; You can skip part 2 of this article to learn more about these content topics.)
In response to the results of the Google/MIT collaboration, Peter Bailis and a Stanford research team reviewed the basics and warned us not to throw away our algorithm books. Bailis and his Stanford team recreated the learning indexing strategy and were able to achieve similar results without any machine learning by using a classic hash table strategy called Cuckoo Hashing.
In response to the Google/MIT study, Thomas Newman describes another approach similar to learning the performance of the metric strategy without abandoning the well-tested and well understood B-trees. That’s why the Google/MIT team is so excited, in a paper they wrote:
“Importantly, we do not advocate learning index structures to completely replace traditional index structures. Instead, we created a new approach to indexing that complements existing work and arguably opens up a whole new line of research in the field for decades to come.”
So are hash charts and B-trees destined to be aging algorithms? Will computers rewrite algorithm textbooks? If machine learning strategies really are better than the general-purpose indexes we know and love, what does it mean for the computer world? Under what circumstances does the learning index surpass the old standby index?
To solve these problems, we need to understand what indexes are and what problems they solve.
What is an index?
The whole point of indexing is to make things easier to find. Humans have been indexing things long before computers were invented. When we use a well-organized filing cabinet, we use an index system; Full-volume encyclopedias can also be seen as an indexing strategy; The label aisle in a grocery store is a kind of index. Any time we have a lot of things and we need to find or identify specific things in the collection, indexes can be used to make things easier.
Zenodotus, the first librarian of the Great Library of Alexandria, was responsible for organizing the library’s vast collection. The system he devised involved grouping books into rooms by genre and shelving them in alphabetical order. His colleague Callimachus went further, introducing a central directory called Pinakes, which allowed librarians to look up an author and determine the location of each book by that author in the library. (You can read more about ancient libraries here). Many innovations have been made in library indexes since the invention of the Dewey decimal system in 1876.
In the Library of Alexandria, indexes are used to map a piece of information (the name of a book or author) to a physical location within the library. Even though our computers are digital devices, any particular data in the computer is actually mapped to a physical location as well. Whether it’s the text of this article, a recent credit card transaction or a video of scaring a cat, the data exists somewhere physical on a computer.
In RAM and SSD drives, data is stored through a series of many transistors that transmit voltage. In older rotary hard disk drives, data is stored in disk format on specific arcs of the disk. When we index the information in the computer, we create algorithms that map portions of the data to physical locations in the computer. In a computer, things that are indexed are always part of the data, and the index is used to map those data to their address.
A database is a typical use case for indexing. Databases are designed to hold a lot of information, and in general, we want to retrieve that information efficiently. The heart of a search engine is to build a vast index of the information available on the Internet. Hash tables, binary lookup trees, forests, B-trees, and Bloom filters are all forms of indexes.
It’s easy to imagine the challenge of finding something concrete in the lobby of a large library like Alexandria, and the fact that the scale of human-generated data is growing exponentially. The amount of data available on the Internet far exceeds that of any single library in any era, and Google’s goal is to index it all. Humans have created many strategies for indexing; Here, we examine the most prolific data structure, which happens to be an indexed structure: the hash table.
What is a hash table?
At first glance, hash tables are based on simple data structures called hash functions. There are many different behaviors of hash functions and they are used for different purposes. For the next section, we will describe only hash functions used in hash tables, not cryptographic hash functions, checksums, or any other types of hash functions.
A hash function takes some input value (such as a number or some text) and returns an integer, which we call a hash code or hash value. The hash code is always the same for any given input, which just means that the hash function must be deterministic.
When building a hash table, we first allocate some space (in memory or storage) for the hash table, and you can imagine creating a new array of any size. If we have a lot of data, we might use a larger array; If we have less data, we can use a smaller array. Any time we want to index an individual piece of data, we create a key/value pair where the key is some identifying information about the data (such as the primary key of the database record) and the value is the data itself (the overall database record).
To insert the value into the hash, we send the key of the data to the hash function. The hash function returns an integer (hash code) that we use as the storage index of the values in our array, modulo the size of the array. If we want to get the value back from the hash table, we simply recalculate the hash code in the key and get the data from that location in the array, which is the physical address of our data.
In the Dewey Decimal system, the key/value pair is the set of categories to which the book belongs, and the data is the book itself. The “hash code” is the number we create using the Dewey decimal process. For example, books on analytic geometry get a “hash code” of 516.3, science is 500, mathematics is 510, geometry is 516, and analytic geometry is 516.3. In this way, the Dewey decimal system can be thought of as a hash function for books; These books are then placed on a set of shelves corresponding to their hash value and arranged alphabetically within the shelves by author.
Basically, this simple process is all hash tables. However, to ensure the correctness and efficiency of hash-based indexes, a great deal of complexity is built on this simple idea.
Performance considerations for hashing based indexes
A major source of complexity and optimization in hash tables is the hash collision problem. Conflicts occur when two or more keys produce the same hash code. Consider this simple hash function, where the key is assumed to be an integer:
function hashFunction(key) {
return (key * 13) % sizeOfArray;
}Copy the code
A simple hash function
Although any unique integer will yield a unique result when multiplied by 13, the resulting hash code will still repeat itself due to the pigeon nest principle. Pigeon nest principle: If n items are put into M containers, n> M, then at least one container must contain more than one item.
We’ll discuss popular strategies for dealing with these inevitable collisions for a while, but first it should be noted that the choice of hash functions can increase or decrease the rate of collisions. Imagine that we have a total of 16 storage locations and we have to choose between these two hash functions:
function hash_a(key) {
return (13 * key) % 16;
}
function hash_b(key){
return (4 * key) % 16;
}Copy the code
In this case, if we were to hash the numbers 0-32, hash_B would produce 28 conflicts; Seven collisions were used for hash values 0,4,8, and 12 (the first four inserts did not collide, but each subsequent insert did). However, Hash_A splits the collisions evenly, once per index, for a total of 16 collisions. This is because in Hash_B, the number we multiply by (4) is a factor of the hash table size (16). We chose a prime number in hash_A, and unless our table size is a multiple of 13, we won’t have the grouping problems we saw with Hash_B.
To see this, you can run the following script:
function hash_a(key) { return (13 * key) % 16; } function hash_b(key){ return (4 * key) % 16; } let table_a = Array(16).fill(0); let table_b = Array(16).fill(0); for(let i = 0; i < 32; i++) { let hash_code_a = hash_a(i); let hash_code_b = hash_b(i); table_a[hash_code_a] += 1; table_b[hash_code_b] += 1; } console.log(table_a); / /,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 [2] the console log (table_b); / /,0,0,0,8,0,0,0,8,0,0,0,8,0,0,0 [8]Copy the code
Better hash functions can avoid collisions.
This hash strategy, multiplying the input key by a prime number, is actually quite common. Prime numbers reduce the likelihood that the output hash code shares a common factor with the array size, thus reducing the likelihood of collisions. Because hash tables have been around for quite some time, there are many other competing hash functions to choose from.
Multiple shift hashing is similar to the initial module strategy, but avoids the relatively expensive mode-taking operation in favor of very fast shift operations. MurmurHash and Tabulation Hashing are powerful alternatives to the multi-shift family of hash functions. Benchmarking these hash functions involves checking their computational speed, the distribution of the generated hash code, and their flexibility in handling different types of data, such as strings and floating-point numbers, other than integers. For an example benchmark suite of hash functions, see SMhasher.
If we choose a good hash function, we can reduce our collision rate and still compute the hash code quickly. Unfortunately, no matter what hash function we choose, we end up colliding. Deciding how to handle conflicts will have a significant impact on the overall performance of our hash tables. Two common collision handling strategies are chaining and linear probing.
Links are easy. Instead of storing a single item at each index of the hash table, we store a header pointer to the linked list. Any time an item conflicts with an already populated index through our hash function, we add it as the last element in the linked list. Lookups are no longer strictly “constant time” because we have to traverse the linked list to find any particular item. If our hash function causes a lot of collisions, we will have long chains, and the performance of the hash table will degrade over time due to longer lookups.
Links: Repeated conflicts create a longer list of links, but do not occupy any other index of the array
Linear detection is still simple in concept, but cumbersome to implement. In linear probing, each index in the hash table remains a single element. When index I collides, we check if index I + 1 is empty, and if so, we store the data there; If I + 1 also has elements, we check I + 2, then I + 3, and so on, until we find an empty slot. As soon as we find an empty slot, we insert the value. Again, lookups may no longer be strictly constant in time; If we have multiple collisions in an index, we end up having to search a series of long items before we find the item we are looking for. More importantly, every time we have a collision, we increase the chance of subsequent collisions because (unlike links) incoming items end up occupying a new index.
Linear probe: Given the same data and hash function as the linked image above, we get a new result. The element that caused the collision (red) now resides in the same array and occupies the index in order from the collision index.
This may sound like chaining is the better choice, but linear detection is widely accepted as having better performance characteristics. In most cases, this is due to poor cache utilization using linked lists and arrays with favorable cache utilization. The short version is that checking all links in a linked list is much slower than checking all indexes in an array of the same size. This is because each index is physically adjacent in the array. However, in the linked list, each new node is assigned a location when it is created. This new node is not necessarily physically adjacent to the neighbors in the list. As a result, the nodes “next to each other” in the list order are rarely physically next to each other in terms of their actual location inside our RAM chip. Because of how our CPU cache works, access to adjacent memory locations is fast, and access to random memory locations is much slower.
Dozens of ari cloud products limited time discount, quickly click on the coupon to start cloud practice!
This article was translated by @Aliyunqi Community Organization.
The original title of the article was “An-Introduction-to-Hashing in-the-era-of-machine-learning,”
The tiger said eight ways.
The article is a brief translation. For more details, please refer to the original text.