JeffDean is doing something with deep learning again: this time upending the entire architecture of computer systems

Planning to edit | Natalie

The author | Milad Hashemi, etc

Compile | Rays

Edit | Emily

AI Front Line introduction:Last month, SysML 2018, a conference on system machine learning sponsored by Jeff Dean, Michael I.Jordan, Fei-Fei Li, LeCun and others, concluded at Stanford. This is a conference specifically focused on the intersection of systems and machine learning, with the goal of bridging the gap left by the current AI/ML discussion, which is mostly about algorithms and applications, with little discussion about the underlying infrastructure.

In his keynote, Dean referred to Google’s research paper “The Case for Learned Index Structures,” which made a splash in tech circles last year,
See AI Frontline’s first report), the key is to swap memory for computation. In the future, Moore’s Law-based cpus will essentially cease to exist, and databases will benefit from hardware trends by using neural networks instead of branched-reindex structures. According to Jeff Dean, this represents a very promising and interesting direction:
In traditional systems development, many new applications can be found using the ML perspective.In addition to databases, what other aspects of the system can ML be used? According to Jeff Dean, a big opportunity is heuristics, which are widely used in computer systems, so ML has the opportunity to make a difference wherever heuristics are used.

Today’s AI Front is a follow-up to “Replacing database index Structures with ML Models,” in which Google researchers attempt to solve the von Neumann structure memory performance bottleneck with deep learning and achieve some results. Jeff Dean commented on Twitter that similar work will continue in many areas of computer systems in the future.

This paper is a pioneering work of applying neural network to computer structure design.How to apply machine learning to the field of computer hardware structure is still rarely studied. This paper demonstrates the application of deep learning to solve the memory performance bottleneck problem of von Neumann architecture computers. This paper focuses on the key problem of learning memory access patterns and aims to build an accurate and efficient memory predictor. In this paper, the expected strategy is regarded as n-Gram model in NLP, and RNN model is used to completely replace traditional table-based prefetch. Experiments on benchmark test sets show that the neural network model can stably give better accuracy and recall rate in predicting memory access patterns.

Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

The following content is compiled by AI Front according to the paper:

In the design of computer architecture, many problems involve the use of prediction and heuristic rules, among which memory prefetching is a classic problem. Memory prefetch prefetches instructions and data into faster memory before they are actually used by the microprocessor. Because microprocessor processing is orders of magnitude faster than Memory access in von Neumann computers, Memory prefetching is a key bottleneck, also known as the Memory Wall. Statistics show that 50% of computing cycles in a computer are spent waiting for data to load into memory. To solve the memory wall problem, microprocessors use hierarchical memory, that is, multi-level cache design in the structure, so that smaller but faster cache closer to the processor. To reduce memory read latency, you need to predict when and what data should be prefetched into the cache. Therefore, a key problem to improve performance is how to give effective prediction of memory access patterns.

In modern processors, some predictive methods are used. Many predictors are implemented based on record tables. A record table is an in-memory array structure that records the relationship between future events and past historical information. But modern data centers have exponentially increased workloads compared to traditional stand-alone workloads, posing a serious challenge to memory access pattern prediction. The accuracy of the forecast decreases significantly with the increase of the size of the forecast table. Simply scaling the size of the prediction table requires a significant hardware implementation cost.

Neural network technology has achieved good results in NLP and text understanding, and it provides a powerful technology to solve the problem of sequence prediction. For example, in the SPARC T4 processor, a simple perceptron has been deployed to predict branch processing instructions. However, how to effectively introduce sequence learning algorithm is still an open problem for microprocessor architecture design.

In this paper, we consider the introduction of sequence-based neural network model in microprocessing architecture, aiming to solve the challenge of memory wall and apply sequence learning to the problem of memory prefetch. The memory prefetch problem can be regarded as a regression problem in essence. However, the output space of this problem is very large and very sparse, so it is difficult to give good results by using standard regression model. In this paper, the authors consider using some of the latest research results in image and language generation, such as PixelRNN and Wavenet. These results discretize the search space and make the memory prefetch model more similar to the neural language model, which is the starting point of constructing neural network memory prefetch. The neural network model proposed in this paper can successfully model the output space of the problem, implement a neural network memory prefetch, and give a performance significantly better than the existing traditional hardware prefetch. In addition, the RNN used in this paper can identify the underlying semantic information of the application according to the memory access trajectory of the application, and the results given are more interpretable.

Background knowledge

Memory Prefetcher

A memory prefetcher is a computer hardware structure that uses historical memory access records to predict future memory access patterns. Existing memory prefetchers fall into two broad categories. One class is called the “Stride Prefetcher” and the other is called the “correlation prefetcher.” Step prefetchers are commonly used in modern processors. A stable and repeatable address difference (the difference between two adjacent memory addresses in a sequence) is fixed in the sequence. For example, given a memory access pattern (0,4,8,12) with an address difference of 4 for each access, the prefetcher can predict the following address access pattern as (16,20,24).

Associative prefetchers can predict repeated patterns in which address differences are not fixed. It uses a large record table to keep the memory access history information. Compared with step-size prefetchers, associative prefetchers can better predict irregular patterns of change. Typical management prefetchers include Markov prefetchers, GHB prefetchers, and some recent research prefetchers using large-scale in-memory structures. Associative prefetchers require a large record table, which is expensive to implement and is not usually implemented in multi-core processors.

Recursive Neural Network (RNN)

Deep learning is suitable for many sequence prediction problems, such as language recognition and natural language processing. Where, RNN can model long distance dependence. LSTM, as a widely used variant of RNN, solves the training problem of standard RNN by propagating internal state in addition rather than multiplication. LSTM consists of implicit state, cellular state, input gate, output gate and memory gate, which represent the information to be stored and the information propagated in the next timestep respectively. Given a time step and input, the state of the LSTN can be calculated using the following procedure:

Calculate input gate, output gate and memory gate
Update cell status
Calculate LSTM implied state (output)

A concatenation using multi to indicate the current input and the previous implicit state. Elder-wise multi indicates an operation using a multi. It is a Sigmoid nonlinear function. The process given above represents the processing form of a single LSTM layer, is the weight of this layer, and is the deviation. The LSTM layer can be stacked, and the output of the LSTM layer in time step N can be used as the input of another LSTM layer, which is similar to the multi-layer structure in the forward feedback neural network. This allows the model to be more flexible with relatively few additional parameters.

Problem definition

Treat prefetching as a prediction problem

Prefetching can be regarded as a process of predicting future memory access patterns. If the data is not in the cache (i.e., a cache miss), the predictor needs to fetch the data from main memory based on historical access. Each memory address is generated by memory operation instructions (Load/Store). Memory operation instruction is a subset of computer instruction set, which implements operations on address memory in computer system.

Many hardware vendors recommend using two characteristics when making prospective decisions. One is the sequence of addresses that have been missed in the cache so far, and the other is the sequence of instruction addresses, also known as Program counter (PC), that are associated with generating each cached missed address.

Each PC can uniquely identify an instruction that is compiled for a particular function in a given code file. The PC sequence informs the model of the control flow pattern of the application code, and the cache missed address sequence informs the model of the address that needs to be expected next. In modern computer systems, the above characteristics are represented by 64-bit integers.

The initial model can use two input characteristics at each time step, namely the cache miss address generated at the step and the N+1 step cache miss predicted by the PC. One problem, however, is that the address space of the application is very sparse. For the training data of this problem, the space complexity of the cache miss is of an order of magnitude, where only the cache miss address appears in the physical address space. Figure 1 shows caching using the standard SPEC CPU2006 benchmark set OMnetpp, with red dots indicating cache misses. This space has a wide range and a heavily multimodal nature, which poses a challenge to the traditional temporal regression model. For example, while neural networks are well suited to regularizing input, a finite precision floating-point representation of such data results in significant information loss when regularizing such data. This problem affects the modeling of the input and output layers. In this paper, the author gives a solution to this problem.

Figure 1 cache misses on dataset OMnetpp. The sparse access pattern is shown at multiple scales.

Treat prefetching as a classification problem

In the solution proposed by the authors, we consider the entire address space as a discrete large-scale vocabulary and perform classification operations on the vocabulary. This is similar to the problem of predicting subsequent characters or words in NLP. However, this space is extremely sparse, and some addresses are accessed more frequently than others. For an RNN model, it is possible to manage an effective vocabulary of this size. In addition, model processing can become more flexible by generating multimodal outputs compared to single-mode regression techniques. In the field of image processing and speech generation, some studies have successfully treated output output as a predictive problem rather than a regression problem.

However, the expected problem has a possible Softmax target, so a quantization pattern is required. More importantly, prefetching must be completely accurate to work. This should be true for the typical 64 byte case, and it should be true for pages that are typically 4096 bytes. Predictions at the page level will give possible targets in. To avoid applying Softmax to more than one value, some studies have proposed applying nonlinear quantization mode to reduce the dimension of the space to 256 classes. But these solutions do not apply to the problems in this article. Because these methods reduce the resolution of the address, prefetching cannot achieve the required high precision.

Due to dynamic boundary effects, such as address space layout randomization (ASLR) problems, multiple rounds of the same program may access different original addresses. However, the same program will give consistent behavior for a given layout. Therefore, a feasible strategy is to predict the address difference rather than directly predict the address itself. As shown in Table 1, the likelihood of an address difference is exponentially lower than the likelihood of an address.

Table 1 Program tracking data set statistics, “M” stands for “millions”

Model describes

This paper presents two prefetch models based on LSTM. The first model is called the “embedded LSTM” model, which is similar to the standard language model. The second model uses the memory access space structure to reduce the size of the vocabulary and reduce the amount of memory the model occupies.

Embed LSTM model

The model limits the size of the output vocabulary and models only the most frequent address intervals. According to Table 1, the vocabulary size required to achieve the optimal 50% accuracy is of magnitude or less. A vocabulary of this size is well within the reach of the standard language model. Based on this, the first model proposed in this paper limits the size of the output vocabulary and selects the 50000 most frequently occurring unique address differences. For input vocabularies, the model considers all address differentials that have occurred at least 10 times in the vocabulary. Further expansion of the vocabulary is statistically and numerically challenging. These challenges can be solved by hierarchical Softmax class approach and need further study.

The structure of the embedded LSTM is shown in Figure 2. The model represents the difference between the input and output addresses by classification (i.e. independent thermal coding). In the time step, input and are embedded respectively, and the embedded words are cascaded together to form the input of two-layer LSTM. Then, LSTM classifies the address difference terms and selects the highest probability address difference for prefetching. The classification method can directly give predictive probability, which is another advantage for traditional hardware.

Figure 2 Structure of embedded LSTM model. Where and represent the embedding functions respectively

In practice, prefetchers return multiple predicted values, so there is a trade-off. While more predictive values increase the likelihood that the cache will hit the next time step, it is possible to remove other useful items from the cache. In this paper, 10 predicted values in the head of LSTM are prefetched at each time step. There are other possibilities, such as using beam-search to predict the next time step, or to predict the time step by directly learning a forward operation that precedes the LSTM.

Embedding the LSTM model has some limitations. First, large vocabularies increase the computational complexity and storage capacity of the model. Second, the interception of partial vocabularies undoubtedly sets an upper limit on model accuracy. Finally, it is difficult to deal with address differentials that occur infrequently because they are rarely present in the training set. This problem is called the “rare vocabulary problem” in NLP.

Combination model of clustering and LSTM

Assume that most interaddress interesting operations occur in the local address space. For example, data structures of structures and arrays typically use persistent block storage and are accessed repeatedly by instructions. This paper introduces the above concepts into the model design and makes a detailed modeling of the local context. In contrast, the embedded LSTM model not only considers the local context, but also introduces the global context.

By looking at the address space for a narrower area, you can see that there is indeed a rich local context. To this end, part of the address set was extracted from OMnetPP and clustered into 6 different regions by K-means. Two of these clusters are shown in Figure 3.

Figure 3 shows two of the six K-means clusters for the benchmark dataset OMNETPP. Memory access is colored according to the PC that generated the access.

In order to achieve the relative accuracy of modeling the local address space region, the paper conducts K-means clustering on the original address space, divides the data into multiple clusters, and calculates the address difference within each cluster. Figure 4A is a visualization of the above example. This method has a significant advantage that the address difference set in the cluster is significantly smaller than the difference in the global vocabulary, which alleviates some problems existing in the embedded LSTM model.

In order to further reduce the scale of the model, multi-task LSTM is used to model all distances, that is, one LSTM is used independently for each cluster. Cluster ids are also added as a feature of the model, which provides a different set of biases for each LSTM.

Partitioning the address space into smaller regions means that the magnitude of the address group within each cluster is roughly the same. Therefore, the generated address difference can be effectively regularized into the real input values applicable to LSTM, eliminating the need to maintain a large embedding matrix, further reducing the size of the model. More importantly, the problem of predicting the next address difference can be treated as a classification problem, whereas regression models are usually not accurate enough in reality.

This model solves some problems existing in embedded LSTM. The pre-processing step (i.e. the clustering of the address space) is a tradeoff in the model. In addition, the model only models the local context, and can also dynamically model different address space regions accessed by programs.

FIG. 4 Structure of combined clustering and LSTM model and data processing

The experiment

An effective memory prefetcher should have the ability to accurately predict cache misses. Therefore, the experiment of this paper is mainly aimed at how to measure the effectiveness of prefetch.

Collection of experimental data

The data used in the experiment is mainly the dynamic tracking data of the program, which contains a memory address sequence calculated by the program. Trace data was captured using the dynamic tool Pin. The tool attaches to the process and gives a “PC, virtual address” tuple of memory accessed by the measured application in the form of a file. The raw access trace data mainly contains methods such as stack access that hit memory. As for the prediction of cache misses concerned in the study, a simulator simulating Intel Broadwell microprocessor was used in the experiment to obtain cache misses.

The test program uses the SPEC CPU2006 application that does intensive memory access. This set of benchmarks is used to fully evaluate the performance of computer systems. However, SPEC CPU2006 is still a small working set compared to modern data center workloads. Therefore, Google’s Web search workload was added to the experiment in this paper.

In the experiment, the memory access data set is divided into training set and test set, of which 70% is used for training and 30% for testing. Each LSTM is trained independently on each data set. ADAM training was used for embedded LSTM, and Adagrad training was used for clustering LSTM. See appendix for the hyperparameters used.

Experimental measurements include accuracy and recall. During the experiment, 10 predictions were made for each model. An accurate prediction is considered to have been generated if the correct address difference is given in the first 10 predictions.

Model comparison

LSTM – based prefetchers are compared with two hardware prefetchers. The first is a standard stream prefetcher. In order to maintain consistency between machine learning prefetchers and traditional prefetchers, hardware structures supporting up to 10 concurrent streams are simulated in the experiment. The second is the GHB PC/DC predictor. This is an associative prefetcher that uses two tables. A table is used to store PCS and the stored PCS are used as Pointers to execute the second table. The second table stores historical information about address differences. On each access, the GHB prefetcher jumps to the second table and prefetches the historical address difference. Associative prefetchers perform well against complex memory access patterns and have lower recall rates than stream prefetchers.

Figure 5 shows how the different prefetchers compare on the various benchmark data sets. Although stream prefetches can achieve the highest recall rate due to their dynamic vocabulary characteristics, the LSTM model performs better overall, especially for accuracy.

Figure 5 Comparison of accuracy and recall rates between traditional prefetchers and LSTM prefetchers. Where, “Geomean” represents the geometric mean

By comparing the embedded LSTM model with the combination model of clustering and LSTM, it can be seen that the accuracy of the two models is similar. The latter tend to give better recall rates. Of course, combining more models will give better results, which needs to be explored in future work.

Compare the PC forecast

The experiment was removed or PC from the input embedded in the LSTM model respectively. The experimental design is intended to determine the relative information content contained in the different input modules.

The experimental results are shown in Figure 6. As you can see from the figure, PC and address differences contain a lot of predictive information. The prediction information in the address difference sequence can improve the accuracy, while the PC sequence can improve the recall rate.

FIG. 6 Comparison of accuracy and recall rate of embedded LSTM models with different input modules

Explain the semantics of the program

One of the main advantages of pattern learning models over look-up table-based predictors is that you can look at the model to get the meaning of the data. Figure 7 shows a T-SNE visualization of cascaded strings embedded on MCF.

FIG. 7 T-SNE visualization of the embedding of cascade strings on MCF, shown in different colors according to instructions

A cluster in the figure consists of a set of repeated instructions with uniform code, resulting from the compiler’s expansion of the loop body. Another cluster consists only of pointer dereference, which is caused by the program iterating through the list of links. In OMnetPP, insert and remove operations on data structures are mapped to the same cluster, while data comparison operations are mapped to different clusters. See the appendix for the code used for this example.

Conclusion and further work

Learning and predicting application behavior can play an important role in solving the problems of control and data concurrency in computer architectures. The traditional approach is to use table-based predictors, but this approach is too costly to implement when scaled to data-intensive irregular workloads.

In this paper, the neural network-based prefetch models of both of them are introduced. They give significantly higher accuracy and recall rate than traditional table-based methods. In the experiment, the model was trained offline and tested online, and the accuracy and recall rate were used to evaluate the model. Experiments show that the prefetch model proposed in this paper can improve the distribution of cache misses. Of course, there are other ways to improve predictors without increasing the computation and memory burden of the prefetcher. In further research, training models based on in-memory hit and miss data could be considered, but this would significantly change the distribution and size of the dataset.

Timeliness is also an important consideration in research expectations. For RNN prefetchers, anticipating too early will result in the processor not using cached data, while prefetching too late will have little impact on performance because the cost of delayed access to memory has already been paid. A simple heuristic rule is to use flow-like behavior to anticipate multiple time steps, not just the next time step.

Finally, the impact on application performance can be used to evaluate the effectiveness of RNN prefetchers. Ideally, RNN will directly optimize application performance. One consideration is to use reinforcement learning techniques as a training method for RNN in a dynamic environment.

Of course, memory prefetching is not the only area of computer systems where predictive execution needs to be given. Whenever branching behavior is involved, predictions need to be made. The branch algorithm is used to redirect the address of the control flow, and the cache replacement algorithm predicts the best replacement from the cache when a replacement decision needs to be made. If you use machine learning systems instead of heuristic rules in traditional microarchitectures, you can better understand the behavior of such systems by examining them. The T-SNE experiment in this paper only gives some superficial observations, showing that the memory access pattern is a representation of program behavior. This shows that the RNN system with the latest research provides a lot of opportunities for the study of computer system structure.

View the original paper:

https://arxiv.org/pdf/1803.02329.pdf

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.

JeffDean is doing something with deep learning again: this time upending the entire architecture of computer systems

Related Posts

Basic Practice of MindSpore (I) — Basic MindSpore

I used the big-screen template to make the mid-year visual report, which amazed the colleagues and leaders present

Alink’s Ramble 21: Source code analysis for regression evaluation