Heterogeneous memory and its application and optimization in machine learning system

The fourth Paradigm is deeply engaged in the field of artificial intelligence and has a broad and deep understanding of ai related algorithms, applications, systems and underlying architecture design.

With the rapid development of advanced storage technologies in recent years, disruptive storage technologies have emerged, such as non-volatile storage and SSD. Heterogeneous memory architectures based on such technologies are disrupting traditional application design and optimization patterns.

The fourth Paradigm takes the lead in the layout of heterogeneous memory architecture, and has carried out a number of innovative research and development and practice, such as parameter server [The fourth paradigm launches the industry’s first tri trillion dimensional online prediction system based on persistent memory and supporting millisecond recovery: www.163.com/tech/articl… [Intel and the fourth Paradigm joint research results selected the VLDB Aotang ™ Persistent memory support optimization trillion-dimension feature online prediction system: newsroom.intel.cn/news-releas…] .

This article provides technical background on heterogeneous memory architectures and technical practices on automatic machine learning systems.

Heterogeneous memory architecture

Traditionally, what we mean by memory is dynamic random storage, or DRAM. In addition, there are also small capacity fast memory devices in the CPU, which are commonly referred to as THE CPU cache (L1/L2 cache). Slow storage devices with persistence make up external memory, such as disks. Therefore, external memory, memory, and CPU cache constitute the entire pyramid of storage architecture. However, with the commercialization of the revolutionary non-volatile memory technology, the memory in this pyramid is no longer composed of DRAM alone, but a heterogeneous memory architecture composed of DRAM and non-volatile memory.

In addition, the emergence of non-volatile memory has blurred the functional boundary between memory and external memory, making in-memory data persistence possible. Today, non-volatile memory technology is fully mature, and Intel ® OtM ™ Persistent Memory (or PMem), released by Intel in 2019, is a representative product of this technology.

Figure 1. Pyramid of storage architecture based on heterogeneous memory

Figure 1 shows a pyramid of storage architectures with heterogeneous memory. As you can see, persistent memory is essentially somewhere in the pyramid between DRAM and external memory, in terms of capacity, performance, and cost. Even functionally, it is a hybrid of DRAM and external memory. It can be used either directly as a memory usage (memory mode) or as a persistent device (App Direct mode, AD mode).

In memory mode, persistent memory is transparent to the operating system and its capacity directly reflects the total available memory capacity. The AD model exposes the storage hierarchy, leaving developers in full control. Therefore, due to the special existence of persistent memory, modern memory architecture has not only become more complex in terms of hierarchy, but also in terms of functionality. Developers need to think about how to take advantage of heterogeneous memory architecture, such as:

Optimization of multilevel storage. Persistent memory offers a memory solution that is close to DRAM in performance, but at a lower cost, and is great for applications that consume a lot of memory. However, the introduction of a multilevel storage architecture also poses a higher challenge for performance optimization. As we know, high-performance caching is important in performance tuning. On the one hand, there are hot spots in real data, and caching can effectively improve the access performance of hot data. Cache Conscious, on the other hand, are often exquisitely designed to squeeze hardware performance. Then, the emergence of persistent memory makes the storage hierarchy more complex, and puts forward higher requirements for multi-level caching mechanism, data structure and algorithm design.
Utilization of persistence mechanisms. Persistent memory makes external storage no longer the only option for storing data. Persistent memory provides far higher persistence performance than traditional external memory devices, but its capacity is relatively small. In some scenarios, how to effectively play the characteristics of high performance persistence has become a new problem to be thought about. For example, for online service applications that require round-the-clock quality of service, in-memory data persistence provides a quick recovery capability when offline. In addition, when disk IO is a performance bottleneck, persistent memory can also be used as the storage medium to improve the overall system performance.

To give you a better understanding of how heterogeneous memory architectures can be of value in real world scenarios, we will share some practical lessons from the fourth Paradigm on heterogeneous memory architectures.

Optimization of automatic machine learning systems on heterogeneous memory

Figure 2 shows a typical automatic machine learning (AutoML) complete process for a fourth Normal form product. The main body contains offline exploration and online reasoning. Offline exploration produces feature engineering scripts and models that can be put online through automated feature engineering and model training. After receiving the user request, the online reasoning service obtains the prediction result through real-time feature extraction and model reasoning. At the same time message queue plays a key role in data collection and distribution in the whole system.

As can be seen from Table 1, in the heterogeneous memory architecture, persistence is used in different components in different ways to achieve different optimization purposes. In general, the memory mode can be used to expand memory capacity quickly and cheaply, while the AD mode brings more benefits, such as rapid recovery and improved data storage performance.

The fourth Paradigm has decoupled key technologies based on heterogeneous memory optimization and contributed to the open source community. At present, it mainly consists of two projects: high-performance message queuing system Pafka (github.com/4paradigm/p… PmemStore (github.com/4paradigm/p… . The following is the introduction of Pafka.

Pafka: High-performance message queuing system based on heterogeneous memory optimization

Kafka is an open source distributed event flow/message queue system designed to process real-time data streams efficiently and reliably, and has a wide range of applications in the industry. However, due to its persistence logic, its performance (throughput and latency) is often constrained by external storage devices (HDDS/SSDS). In a real-world scenario, to increase the overall throughput of a Kafka cluster, the enterprise has to scale up the cluster, increasing the total cost to the enterprise.

Persistent memory features high persistence, which can achieve several times or even tens of times the persistence performance of traditional hard disks and SSDS. Therefore, Pafka, the optimized version of Kafka based on heterogeneous memory architecture, takes advantage of the high persistence features to greatly improve single-node throughput, thus optimizing the total cost of investment on the cluster. Overall, Pafka offers the following advantages over traditional Kafka solutions:

Compared with SATA SSD configuration commonly used in data centers, Pafka based on heterogeneous memory improves node throughput and latency by 20 times.
Pafka can reduce hardware costs by more than 10 times compared to Kafka in terms of total cluster size investment due to significantly increased node throughput.
Pafka is directly optimized based on Kafka. Users’ original Kafka-based business code does not need to be modified and can be migrated to Pafka system with zero code transformation cost.

Our optimizations for Kafka focus on the data dumping part of the performance bottleneck. In the original Kafka architecture, data persistence only occurred at the external storage (disk /SSD) level; The optimized version of Pafka, based on a heterogeneous memory architecture, uses both persistent memory and external memory for data persistence.

The persistent memory with high performance persistence is the first level of the persistence hierarchy, and the external memory with larger capacity but poor performance is the second level of persistence medium. Both of them are managed by a certain caching mechanism.

Due to the producer/consumer usage pattern of message queues, data access in most scenarios occurs in high performance persistent memory.

Figure 3. Pafka cluster architecture

As shown in Figure 3, a Kafka server cluster consists of a few to hundreds of thousands of brokers. Brokers is divided into different partitions, which are further divided into segments to store messages. Our modifications to Kafka focus on the segment data structure. The original segment can only be stored on HDD/SSD and other external storage devices. We use PMDK to carry out persistent operation based on heterogeneous memory, and introduce the concept of MixChannel to realize that the segment can be stored on both HDD/SSD external storage devices and persistent memory.

Specifically, MixChannel manages common file interfaces and persistent memory interfaces in a unified manner, with the underlying storage media being transparent to the upper-layer components. To support persistent memory-based storage, we introduced a data structure called PMemChannel to MixChannel, which encapsulates MemoryBlock objects of persistent memory into an interface that satisfies the FileChannel API. This gives MixChannel a convenient choice between a traditional FileChannel interface based on files and a PMemChannel based on persistent memory. Here we use PMDK LLPL’s PersistentMemoryBlock, which automatically persists each time the data is written. At the same time, in order to support zero-copy, we also implemented the Zero-copy ByteBuffer interface for LLPL Memoryblocks by directly mapping the address of persistent memory to the ByteBuffer, thus avoiding multiple memory copies and improving performance.

In order to maintain the correspondence between segments and data in persistent memory, a MemoryBlock of persistent memory is allocated for each segment, and the mapping relationship is maintained by ObjectDirectory of PMDK PCJ.

In addition, to avoid the overhead of dynamically allocating Memoryblocks when Pafka is running properly, a fixed proportion of memory pool space is pre-allocated at initialization for rapid allocation of Memoryblocks when writing data.

Performance comparison

Figure 4 shows that the heterogeneous memory-based Pafka can achieve a 20-fold improvement in throughput and latency performance compared to the SATA SSD-based Kafka commonly used in data centers.

Cost comparison

Assuming that our goal is to provide an overall throughput rate of 20 GB/SEC, we compared Pafka with heterogeneous persistent memory to Kafka based on SATA SSDS. Figure 5 shows that in order to achieve a total throughput rate of 20 GB/s, the number of SATA SSD based servers and heterogeneous memory based servers is 45 and 3, respectively. In addition, in terms of hardware cost, traditional Kafka (SATA SSD) costs $450,000, while our Pafka solution only costs $40,500. Pafka solutions significantly reduce hardware costs to 9% of traditional Kafka solutions.

Figure 5. Cost comparison of Pafka and Kafka solutions with 20 GB/ SEC throughput performance

For more information

Pafka is an open source fourth Paradigm project. Details of usage, technical support, and full performance reports can be found at: – Code Github repo: github.com/4paradigm/p… – Slack channel:join.slack.com/t/memarkwor… -MemArk Heterogeneous storage technology forum: Discus.memark.io /

Heterogeneous memory and its application and optimization in machine learning system

Heterogeneous memory architecture

Optimization of automatic machine learning systems on heterogeneous memory

Pafka: High-performance message queuing system based on heterogeneous memory optimization

For more information

Related Posts

PyTorch how to implement forward propagation (3) – implementation

Common problems with model effect evaluation methods

Tensorflow implements face gender judgment