Heart of the machine compiles

The Heart of the Machine editorial department

There has been a lot of interest in Facebook’s hardware research. At the Open Computing Project Global Summit today, Vijay Rao, Facebook’s head of technology strategy, opened source new AI hardware: Zion and Kings Canyon, hardware systems for AI training and reasoning, and Mount Shasta, for video transcoding. This blog post explains it in detail.

Facebook’s infrastructure now serves more than 2.7 billion people a month across its entire app and service system. Their engineers designed and created advanced, efficient systems to scale up this infrastructure, but as workloads grew, general-purpose processors alone could no longer meet the needs of these systems. The growth rate of transistors has slowed dramatically, requiring the development of specialized accelerators and overall system-level solutions to improve performance, power, and efficiency.

Creating efficient solutions for the infrastructure requires co-designing hardware that optimizes the workload. To that end, Facebook has been working with partners to develop solutions for AI reasoning, AI training and video transcoding. These are some of its fastest growing services. Today, Facebook announced Zion, its next-generation hardware platform for AI training, Kings Canyon, a new custom chip design for AI reasoning, and Mount Shasta for video transcoding.

AI hardware

AI workloads are used throughout Facebook’s architecture to make its services more relevant and improve the user experience with the service. By deploying AI models at scale, Facebook can deliver 200 trillion guesses and more than 6 billion language translations per day. Facebook uses more than 3.5 billion public images to build or train its AI models to better identify and flag content. AI is used in a wide variety of services to help people interact with each other on a daily basis and provide them with unique personalized services.

Most OF the AI processes on Facebook are managed through its AI platform, FBLeaner, which contains tools for centralizing parts of the problem, such as feature libraries, training workflow management, and inference machines. Combined with hardware designed and published to the Open Compute Project (OCP), this will enable Facebook to deploy the model at scale and efficiently. Starting from a stable base, Facebook focused on creating integrated hardware designs that were vendor-neutral and continued to adhere to the principle of decomposition design for maximum productivity. The result is Facebook’s next generation of hardware for workload training and reasoning.

AI training system Zion

Zion is Facebook’s next generation large storage unified training platform designed to efficiently handle a range of neural networks including CNN, LSTM and SparseNN. Zion platform can provide its heavy workload with high storage capacity and bandwidth, flexible high-speed connectivity, powerful computing power.

Zion adopted Facebook’s new vendor-agnostic OCP Acceleration module (OAM). The OAM shape factor allows Facebook’s partners, including AMD, Haban, GraphCore, and Nvidia, to develop their own solutions on top of the OCP common specification. With a single rack using TOR network transformation, the Zion architecture gives Facebook the freedom to scale to multiple servers on each platform. As Facebook’S AI training workloads grow in size and complexity, the Zion platform will expand as well.

Zion system is divided into three parts:

  • 8 Slot Server

  • 8 Accelerator platform

  • OCP accelerator module

AI training solution basic module

Zion connection module illustration

Zion separates the memory, compute, and network intensive components of the system so that each part is individually scalable. The system provides a large DDR storage pool with eight NUMA CPU slots to accommodate workload storage capacity intensive components such as SparseNN’s embedded tables. For storage-bandwidth intensive and computation-intensive workloads such as CNN or SparseNN intensive parts, each CPU slot is connected to an OCP accelerator module.

The system has two high-speed structures: a coherent structure that connects all cpus and a structure that connects all accelerators. Because accelerator storage bandwidth is high but storage capacity is low, the total available storage capacity is efficiently utilized by partitioning the model in such a way that the more frequently accessed data resides on accelerator and the less frequently accessed data resides on DDR memory with CPU. Computing and communication between all cpus and accelerators is balanced and efficient through high – and low-speed connections.

Perform reasoning through Kings Canyon

Once we had trained the model, we needed to deploy it into production to process data from the AI process and respond to user requests. This is called inference — the process by which models implement predictions about new data. Reasoning workloads are increasing dramatically, reflecting the massive increase in training efforts that standard CPU servers can no longer meet. Facebook is working with a number of partners, including Esperanto, Intel, Marvell, and Qualcomm, to develop inference ASIC chips that can be deployed and extended on infrastructure. These chips will provide INT8 half-precision operations for desired performance for workloads, while also supporting FP16 single-precision operations for higher accuracy.

The entire reasoning server solution is divided into four distinct parts that leverage existing building blocks that have been published to OCP. Using existing components can speed up development and reduce development risks through generality. The four main components of the design are:

  • Kings Canyon Reasoning MODULE M.2

  • Twin Lakes Single-socket server

  • Glacier Point V2 Carrier Card

  • Yosemite v2 frame

Figure: AI reasoning solution module

Figure: Diagram of AI reasoning solution connection module

At the system level, each server combines the M.2 Kings Canyon accelerator and Glacier Point V2 bearer card, which is primarily connected to the Twin Lakes server. You can typically install the first two sets of components into a newer Yosemite V2 rack and then connect to a TOR switch through a multi-host NIC. The updated Yosemite SLED is an iteration of the existing Yosemite V2 Sled that connects other PCIe channels on Twin Lakes hosts to nics for higher network bandwidth. Each Kings Canyon module contains an ASIC, associated memory, and other supporting components, where the CPU host communicates with the accelerator module via a PCIe channel. Glacier Point V2 includes an integrated PCIe switch that allows servers to access all modules simultaneously.

Deep learning models have high storage requirements. For example, the SparseNN model has a very large embedded representation table that takes up several gigabytes of storage and is likely to continue to grow. Such large models may not fit into the memory of a single device, either CPU or accelerator, so this requires model partitioning across multiple device memory. Segmentation incurs a lot of communication costs when the data is in memory on another device. Therefore, good graph-partitioning algorithms will attempt to capture local concepts and thus reduce communication costs.

With proper model segmentation, we can run very large deep learning models. For example, in the SparseNN model, if the memory capacity of a single node is not enough to support a given model, we can consider sharing the model between the two nodes and increasing the amount of memory that the model can access. The two nodes can be connected via multi-host nics and support high-speed information processing. This will increase the overall communication cost, but we can take advantage of the fact that there are access differences across multiple embedded tables and order the tables accordingly to reduce communication latency.

Neural network hardware accelerator compiler

Asics do not run generic code because they require specific compilers to translate graphs into instructions to execute on these accelerators. The goal of the Glow compiler is to abstract vendor specific hardware from the more advanced software stack, making the infrastructure vendor free. It accepts computational diagrams from frameworks such as PyTorch 1.0 and generates highly optimized code for these machine learning accelerators.

Glow to the compiler

Use Mount Shasta for video transcoding

The average number of Facebook Live streams has doubled every year since 2016. Since its global launch in August 2018, Facebook Watch has been viewed more than 400 million times a month and used by 75 million people every day. To optimize all these videos for multiple network environments, Facebook generates multiple output qualities and resolutions (or bitrates), a process called video transcoding. The computation required to complete this transcoding process is highly intensive, and a general-purpose CPU cannot meet Facebook’s growing video needs. To stay ahead of demand, Facebook has partnered with Broadcom and Chipyuan to design custom asics optimized for transcoding loads.

The video transcoding process is broken down into many different steps, which are described in more detail below. These steps are run in today’s software, so to improve efficiency, Facebook has worked with vendors to create custom asics containing dedicated chips for each stage of the transcoding workflow. Using custom hardware to complete these workloads makes the process more energy efficient and supports new features such as real-time 4K 60FPS streaming. Individual video codecs are standardized and infrequently modified, so the inherent lack of flexibility in custom chips is not a significant disadvantage in this case.

The first stage of video transcoding is called decoding, in which the uploaded file is decompressed to obtain the raw video data represented by a series of images. These uncompressed images can then be manipulated to change their resolution (called scaling) and then encoded again using optimization Settings to compress them back into the video stream. The output video is compared to the original video to calculate quality indicators that represent the change in quality relative to the original uploaded video. This is typically done for all videos to ensure that the encoding Settings used produce high-quality output. The standard used for video encoding and decoding is called video encoding mode; H.264, VP9 and AV1 are the main coding protocols currently used.

On asics, the steps are the same except that each software algorithm is replaced by a dedicated part within the chip. On average, Facebook wants the video accelerator to be many times more efficient than its current servers. They want the industry’s target encoding to handle at least 2x 4K resolution and 60fps parallel input streams in 10W power consumption. Asics also need to support multiple resolutions (4K at 480p to 60fps) and multiple encoding formats (h.264 to AV1).

Video transcoding ASICS usually have the following main logical blocks:

  • Decoder: receive uploaded video; Output the unzipped raw video stream

  • Scaler: Scales uncompressed video

  • Encoder: Output compressed (encoded) video

  • Quality measurement: Measures the loss of video quality after the coding step

  • PHY: Interface between the chip and the outside world; The DDR is connected to the PCIe and memory of the server

  • Controller: A generic block that runs the firmware and coordinates the transcoding process

Video transcoding solution basic module

As with inference, Facebook deploys these transcoding ASIcs in data centers using existing OCP artifacts. The ASIC will be installed on an M.2 module with an integrated radiator because this common electrical profile can be reused on different hardware platforms. They are mounted in Glacier Point V2 (GPv2) carrier cards, which can accommodate multiple M.2 modules. The GPv2 carrier card has the same physical shape as a Twin Lakes server, meaning it can be fitted to and paired with a Twin Lakes server in a Yosemite V2 rack.

Because transcoding ASIcs are low in power and small in size, Facebook hopes to save money by connecting as many chips as possible to a single server. The high-density GPv2 achieves this while providing enough cooling capacity to withstand the operating temperatures of the data center.

Once software integration is complete, Facebook will balance the video transcoding workloads of heterogeneous hardware Fleets distributed across different data center locations. To scale up in collaboration with various machine learning and video space vendors, they also work to ensure that software is developed in an open format and to promote and adopt common interfaces and frameworks.

Reference content:

https://venturebeat.com/2019/03/14/facebook-open-sources-hardware-for-ai-model-training-and-inference/

https://code.fb.com/data-center-engineering/accelerating-infrastructure/

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]