Excerpted from ArXIV by Keno Fischer and Elliot Saba, Hearts of Machines Editorial Board.
The Julia language is evolving rapidly and can be considered to have both the flexibility of Python and the speed of C. However, currently TensorFlow and PyTorch do not officially support Julia. Therefore, some researchers recently built TPU support for Julia with the help of XLA underlying compiler. They said that this method could integrate VGG19 model written by Julia program into TPU executable file and invoke TPU to achieve efficient calculation. “Julia + TPUs = fast and easily expressible ML computations!” tweeted Jeff Dean, head of Google.ai.
1. The introduction
One of the fundamental changes that has driven the steady advance of machine learning technology over the past few years is the enormous computational power of training and optimizing machine learning models. Many of these technologies have been around for years, and only recent advances in computing power can provide good enough solutions to real-world problems. Much of this computing power is captured by gpus, whose vector-specific computing power was originally designed for graphics, but machine learning models often require the execution of complex matrix operations, so gpus also show very good performance.
These approaches and the success of gpus in the real world, especially in machine learning, have sparked a wave of innovation among hardware designers who are working on new accelerators for machine learning workloads. However, while Gpus have long been the power of software systems such as CUDA, these libraries generally do not extend to new non-GPU accelerators, and developing software for them remains a challenge.
In 2017, Google announced that they would make their proprietary TPU machine learning accelerator available to the public via cloud services. Initially, the use of TPU was limited to applications written under Google’s TensorFlow machine learning framework. Fortunately, in September 2018, Google opened access to TPU through IR from the underlying XLA (Accelerated Linear Algebra) compiler. This IR is a general-purpose optimized compiler for expressing arbitrary calculations of linear algebra primitives, thus providing a good foundation for non-TensorFlow users using TPU and non-machine learning workloads.
In this article, we describe the initial work of compiling generic Julia code using this interface, which can further access the TPU of Google Cloud. This approach is in contrast to the approach taken by TensorFlow (Abadi et al., 2016), which does not compile Python code but builds a graph in Python and then compiles the graph. It is aesthetically similar to JAX (Frostig et al., 2018), whose goal is to Offload calculations written in Python itself by tracing and Offload advanced array operations. Importantly, however, we do not rely on tracing, but instead use Julia’s static analysis and compilation capabilities to compile the entire program, including all the control flows passed to the device.
It is worth mentioning that our approach allows users to take full advantage of the expressive power of the Julia language when writing models. These expressiveness is mainly reflected in advanced features such as multiple distributions, higher-order functions and existing libraries such as Differential equation solvers (Rackauckas & Nie, 2017) and general linear algebra routines. Since it only runs on pure Julia code, it is also compatible with zygote.jL (Innes, 2018) automatic differentiation tool, which performs automatic differentiation as an advanced compilation process. Overall, we were able to compile a complete machine learning model written using the Flux machine learning framework, fuse the model’s forward, back propagation, and training loops into an executable and Offload them into the TPU.
Automatic Programmer operation of Julia Programs and ML Models to Cloud TPUs
Links to papers: arxiv.org/abs/1810.09…
Abstract: Google cloud TPU is a promising new hardware architecture for machine learning workloads, which has achieved many milestone machine learning breakthroughs at Google in recent years. Today, Google has made TPU available to the masses on its cloud platform, and recently opened it up further to allow non-TensorFlow frontend use. We describe a method and implementation to Offload the appropriate portion of the Julia program to TPU through this new API and the Google XLA compiler. Our approach is able to fully integrate the VGG19 model written by Julia program and its forward propagation into a single TPU executable for Offload to the device. Our approach works well with existing compiler-based automatic differentiation techniques on Julia code, so it is also possible to automatically take VGG19 back propagation and Offload it to TPU in a similar way. Using our compiler to access the TPU, we were able to do VGG19 forward propagation for batches of 100 images in 0.23 seconds, compared to 52.4 seconds for the original model on the CPU. Our implementation requires less than 1000 lines of Julia code, without making specific changes to the core Julia compiler or any other Julia package based on THE TPU.
5. Map Julia semantics to XLA
As long as the Julia program is written according to XLA primitives, we can compile it into XLA. However, the Julia program is written not based on arcane HLO operations, but on functions and abstractions provided by the Julia base library. Fortunately, Julia uses multiple distributions that make it easy to express the abstraction of the standard library in terms of HLO operations. Here are a few simple examples:
In addition to these simple operations, we also provide implementations of advanced array abstraction, especially MapReduce and Broadcast. The broadcast code based on the HLO operation contains about 20 lines of code. To save space, it is not expanded here. However, the implementation of MapReduce is very simple:
You can see the effect of using any Julia function as a static calculation. Because of Julia’s reliance on generic abstractions, it can cover a large number of apis by specifying very few definitions. Specifically, from the definition of MapReduce, we can automatically obtain the dimensionality reduction of the operations defined in base, such as sum and PROd. In fact, getting enough API coverage to compile forward and back propagation of the VGG19 model requires less than 200 lines of definition.
5.1 Structure Mapping
We did an additional identification. Any tuple or IMmutable structure in embedded IR is mapped to an XLA tuple, that is, the Julia value 1 + 2im (a complex number composed of two integer structures) will be mapped to an XLA tuple (s64[], s64[]). We save this structure type in the Julia embed in XLA IR, but it is clear that XLA does not know the Julia type, so these types are converted to the appropriate tuples in the final transformation step. Similarly, the (Julia) tuple constructor (and the immutable constructor) becomes a tuple component of XLA. Tuple references, which are immutable field references, become XLA tuple references.
5.2 Handling control flow
There is an additional complication that we haven’t discussed yet: the semantic mismatch between the imperative control flow provided by Julia and the functional control flow provided by XLA. To solve the if/else control flow module, we look at φ nodes in the Julia compiler’s SSA IR and then treat these nodes as a result of XLA functional control flow (if there are multiple φ nodes at the same merge point, we construct tuples of these nodes). The condition that causes the computation flow to diverge becomes the condition of the functional control flow, and any calculation between the two can be called as a function. Cyclic control flow is similar to conditional control flow construction. We identify the strongly connected regions of the control flow diagram and take them as the main body of the cycle.
7 the results
7.2 VGG19 forward propagation
Our first complex example is full VGG19 forward propagation. We use VGG19 implementation in Metalhead packs (Mike Innes & Polymorphism, 2018), It utilizes the Flux (Innes & Polymorphism, 2017) framework to transform familiar machine learning layers (convolution layers, full connectivity layers) into linear algebra. But importantly, each layer in the Flux framework is just a generic function that can in turn call generic linear algebra operations. Therefore, the machine learning models expressed in Flux (including VGG19) are just general Julia functions and are therefore able to use the methods introduced in this paper.
Our compiler is able to fully infer, offload, and fuse all forward propagation of VGG19. After Julia level optimization, the final IR of the top-level function consists of 181 instructions (each HloOp is a constant static parameter with appropriate inference and a dynamic parameter with appropriate morphological inference). The total number of HLO Operands calculated at each level is 183 (the extra two are used for parameter instructions hidden in embedding), and a total of 361 HLO Operands are calculated from 29 calculations. See Figure 3 for details of the number of instructions. Since we are able to offload all forward propagation calculations, Julia does not participate in any evaluation steps, allowing other tasks (such as preparing data for the next batch) to be performed synchronously. In addition, the performance of the resulting code is limited only by the quality of XLA generated code, not by the front-end (see 7.4 for performance evaluation). We evaluated the VGG19 model on the ImageNet validation set and verified that the results matched the original Metalhead results, thus verifying the accuracy of the generated XLA code.
7.3 VGG19 Reverse Propagation
To obtain back propagation, we use AD framework based on zygote.jL compiler (Innes, 2018). Zygote runs on Julia code, whose output is also a Julia function (suitable for re-importing Zygote for higher derivatives, and for compiling into a model for TPU). Here is a concrete example:
Namely, the derivative corresponding to the current value of the model and a specific training sample (or batch of training samples). We use sum as a simple substitute for the loss function. Surprisingly, the type inference modification described in Chapter 6 can also improve the type inference accuracy of all VGG19 backpropagation. As for forward propagation, the total number of optimized and unoptimized instructions is shown in Figure 1. Back propagation generates significantly more XLA instructions than forward propagation, and one of the biggest contributors is Zygote’s Mixed mode broadcast fusion, which computes both forward propagation and back propagation in a map kernel. Since XLA does not currently support multiple outputs from a single mapping instruction, this function is run repeatedly on multiple mapping instructions, so the DCE of XLA needs to be cleaned later. In general, our compilation process addresses XLA’s handling of mapping instructions because it is common to call the Julia map and broadcast functions in generic code.
7.4 Perform evaluation on TPU
Figure 3: The instruction count breakdown for forward and back propagation of Metalhead. Jl VGG19 after compilation to XLA. The figure above shows the unoptimized (after Julia front end) and optimized instruction count (after XLA optimization process, similar to the process used by CPU back end, but without HLO Fusion). Each instruction number is further split into instructions (E) in entity calculations and instructions (T) in all calculations.