This article is originally published by AI Frontier. The original link is t.cn/RTVTqN2
The translator | Ma Zhuoqi
Edit | Emily
Database index structures are also models, such as B-tree indexes, hash maps, and bitmaps. From this point, this paper proves that the current index models can be replaced by other learnable models. The preliminary experimental results show that compared with the B-tree structure based on cache optimization, the neural network can improve the speed by 70% and save the memory greatly. Machine learning models have the potential to have significant benefits over current state-of-the-art database indexing. If there is more to come from this research, perhaps we will look back and say that the indexing field was the first to take a hit, and gradually the other database components (sorting algorithms, query optimization, joins) were replaced by neural networks (NN). In any case, it’s a promising line of research, and this paper is really thought-provoking.
Study motivation
Databases started out as a uniform, one-size-fits-all “black box” problem. Over time, this view has been refined to “standard size” OLAP and OLTP databases.
Databases use indexes to quickly access data. B-tree and hash mapping are common techniques for implementing indexes. But from a “black box” point of view, database processing treats the data as opaque and blindly uses these indexing methods without making any assumptions about the data. However, it is clear that not knowing the distribution of data prevents an algorithm from doing its best. Consider an experiment like this: if the keyword is 0 to 500M, it is faster to index directly with the keyword than with the hash. If we know the cumulative distributed function (CDF) of the data, this observation can be extended to other data distributions. We can summarize as “* CDF keyword * record size” to get the approximate location of the record to which the keyword points.
Well, by understanding the data distribution, we can get a performance boost. But reusability is lost when we see this as an all-white box problem. We can’t all “white box” because it’s too expensive to go through the data from scratch and design the index every time.
This paper shows that using neural networks to learn the distribution of data, we can have a “gray box” approach to index design, and by designing indexes as data aware, we can gain performance advantages.
Currently the more mainstream index structure can be divided into the following three categories according to the different requirements of various access modes:
- B-tree to process range query
- Hash mapping for point lookup queries
- Bloom filter, used for lookup containing collections
We focus on replacing parts of the B-tree structure with learnable structures. For hash mappings, the learned structure is a direct function based on the cumulative distribution function of the data.
B-tree
B-tree provides a hierarchical efficient index method.
Why can a neural network model be used instead of b-Tree? Conceptually, b-tree itself is a model, similar to the regression tree model in machine learning: it can map the lookup key to a location on a page with a minimum error of zero, a maximum error of the page size, and guarantee that the key can be located if it exists. Therefore, we can replace this kind of indexing with other machine learning models, including deep learning models, as long as they provide the same minimum and maximum error guarantees.
Next, how can other machine learning models provide the same error guarantees? All we need to do is execute the model for each key and remember the worst-case prediction for one location. Given a key, the model predicts the position of the data. If the key exists, it is guaranteed to be within the prediction range defined by the minimum and maximum errors. We use existing data to train the model. Therefore, we can replace the B-tree with any other type of regression model, including linear regression and neural network.
We can obtain the following benefits by replacing b-Tree with the learned model:
- Smaller indexes: Less memory or L1 cache
- Faster lookups: Because the index is smaller, the lookups are faster
More parallelism (TPU), or hierarchy if it is a B-tree.
The key idea here is to reduce memory at the expense of higher computing, hoping for a market trend of lower and lower computing costs (if you can use it on TPU or GPU, you’ll get more benefits).
The main algorithm
In order to overcome the above challenges and explore the potential of learnable Model as an alternative or improvement of Index structure, the author proposed learnable Index Framework (LIF), Recursive Model Indexes (RMI), And search strategy based on standard error. The authors focus primarily on simple, fully connected neural networks, simply because of their simplicity, but many other types of models can be considered as well.
Learning Index Framework (LIF)
LIF can be thought of as an index composition system. Given an index specification, LIF can generate different index configurations, optimize them, and test them automatically. Given a trained Tensorflow model, LIF automatically extracts all the weights from the model and generates an effective index structure in C++ based on the model specification. Experiments show that the simple model can be run in just 30 nanoseconds.
Recursive Model Indexes (RMI)
It is very difficult to reduce the min Max error with a single model, but it is quite easy to achieve the same improvement in prediction accuracy by replacing the first two layers of the B-tree with a simple model. Similarly, it is easier to reduce errors if the model can focus only on subsets of data.
Based on the above observations, the author proposed a recursive regression model (as shown below).
The author establishes the hierarchical structure of the model. At each level, the model takes the query key as input and selects the next level model based on it until the last level model predicts the position. Each model makes a certain prediction about the position of the key, and uses the prediction results to select the next model, which is responsible for a certain region of the key space, and makes better predictions with lower errors.
This model has the following advantages:
(1) It takes advantage of the overall shape of data distribution, which is easy to learn.
(2) The structure effectively divides the space into smaller sub-ranges, similar to B-tree or decision tree, which can achieve the required “last step” accuracy in fewer operation times.
(3) No search process is required between stages.
Hybrid index
Another advantage of this recursive model index is that a hybrid model can be constructed. For example, at the top level of the model, a simple ReLU neuron may be the best choice because it is often able to learn a variety of complex data distributions, while the underlying model can be a large number of simple linear regression models because they consume very little space and execution time.
Note that mixed indexes preserve the worst-case performance of learnable index structures at the B-tree level. In other words, when the data distribution is difficult to learn, all models are automatically replaced with b-tree, making it almost a complete B-tree.
Search strategy
The paper proposes several search strategies:
(1) Model binary search: the default search strategy. The only difference between traditional binary search is that the first intermediate point is set to the value predicted by the model.
(2) Bias search: This search strategy is an improvement of the default search strategy by unevenly dividing the range from the middle to the left and right positions in each iteration.
(3) Biased quad search: Finally, we developed a new search strategy, in each iteration, instead of choosing a new intermediate point for testing, three new data points, namely quad search.
String index
The paper focuses on indexed real-valued keys, but many databases rely on string indexes, but fortunately, a lot of machine learning research focuses on string modeling.
Converting a string into a feature vector that can be entered into a model requires an entry operation. Since most machine learning models perform better with the same input size, the authors set a maximum input length of N. The authors use a relatively small hierarchy of feedforward neural networks. The difference from the previous model is that the input is a vector, not a real value.
It is very difficult to design a general machine learning model based on the cumulative distribution function (CDF) of strings.
The experimental results
To compare learnable indexes with B-tree indexes, the authors created four secondary indexes, three real world databases, (1) Weblogs, (2) Maps, (3) Web-documents, and a composite database, (4) Lognormal.
Numerically indexed database
Weblogs database, can learn the index structure and b-tree index comparison results:
Map database, can learn the index structure and b-tree index comparison result:
Mysql > alter table Lognormal; mysql > alter table Lognormal; mysql > alter table Lognormal
The author mainly compares the total query time and local search time, index structure size, space saving, model error and error variance. It can be seen from the experimental results that the learnable index structure is three times faster and one order of magnitude smaller than b-tree in almost all configurations.
String index database
Web-documents, a string-based index database, can learn the index structure and b-tree index comparison results:
From the experimental results, it can be seen that the advantage of learnable index structure is not very prominent for string index. The author thinks that the computational cost of running the model is too large, which can be solved by GPU or TPU.
Comparison results of learned string indexes under different search strategies:
Future research Challenges
Insert and update
(1) Compromise between model generalization and “last step” accuracy. The more accurate the model’s “last step” prediction is, the easier the model will be to overfit and difficult to generalize to new data items.
(2) What if the data distribution changes? Can a learnable index detect data changes? Can search and insert operations always be O(logn) like b-tree?
The authors suggest that a simpler way to handle inserts is to build a Delta index: all inserts are placed in a buffer and merged into the retraining of the model from time to time. This approach is already heavily used and can be hardware accelerated with gpus or TPus. Alternatively, you can hot-start training for each model by using the previous solution as a starting point. In particular, models that rely on gradient descent optimization can benefit from this optimization method.
paging
For indexes of data stored on disk, it is common to split the data into larger pages stored in different areas of disk. In this case, the CDF distribution of learning data no longer meets the modeling requirements.
The authors suggest the following alternatives:
(1) Use RMI structure to minimize the overlap of model coverage area.
(2) Use translation tables.
(3) Use a more complex model to learn the actual pointer to the page.
For more details on the algorithm, please refer to the original paper:
Arxiv.org/abs/1712.01…
英文原文 :
Muratbuffalo.blogspot.com/2017/12/pap…
Follow our wechat account “AI Front “and reply to “AI” in the background to obtain the SERIES of “AI Front “PDF e-books