Image credit: Pexels

Zheng Bing, MAO Baolong, Pan Zhizheng

Alluxio is an open source distributed memory-level data orchestration system for AI and big data applications. As big data and AI services migrate to container management platforms such as Kubernetes, Alluxio has become the preferred solution for vendors as an intermediate layer to accelerate scenarios such as data query and model training.

The problem solved by Alluxio in offline peer service of game AI can be abstracted as: data dependence problem in distributed computing scenario. Traditional data dependence solution is as follows:

  1. Image packaging, this method has good isolation, but the use of image cache capacity is limited, data updates frequently, each time to repackage deployment image, stop services to replace containers;
  2. The overall architecture is simple, but every time data is accessed, it needs to be pulled from the remote end, resulting in poor performance.
  3. Local deployment, good data local, but solve permission problems and operation difficulties.

Each of these methods has its own advantages and disadvantages. The introduction of Alluxio can significantly increase the concurrency limit of AI tasks without increasing the cost too much. Meanwhile, AI task programs still use the original POSIX mode to access Alluxio, so that services are not aware of changes in the storage system.

This paper mainly introduces the adaptation and optimization of Alluxio on computing power platform and game AI feature computing service.

background

Game AI offline training services are divided into supervised learning and reinforcement learning scenarios. In supervised learning scenarios, it is generally divided into feature calculation, model training and model evaluation. In some scenarios of reinforcement learning, feature calculation is also required. In the feature calculation of game AI scene, it is necessary to restore the information of the game, and then generate the feature data required by model training through statistics and calculation. The restoration of the game information must be through the corresponding game dependency (game ontology, game translator, game playback tools, etc.). Generally, the size of game dependency varies from 100MB to 3GB. The nature of the game determines that certain game information requires a certain version of the game dependency to calculate.

Gamecore on storage is a game dependency that corresponds to a game version of the Linux client. Placing gamecore in local storage on the server provides better read performance and stability, but is costly and requires native permissions.

Another solution is to store Gamecore in distributed storage, such as CEPh, so that data update is faster and deployment is easier. The disadvantage is that THE METADATA management service MDS of CEPHFS may become a bottleneck. Usually, thousands of containers will be scheduled in a feature computing task, and each container will start several business processes. At the beginning of the mission will have thousands to tens of thousands of process for the same gamecore parallel access, and every minute read hundreds of GB of data volume storage end, because is a small file, MDS will carry all the metadata of pressure, in addition, the delay between storage and business usually is higher, especially not in the same region, the failure rate could lead to a task.

After comprehensive consideration, we introduced Alluxio on Ceph to solve the pain points of the current business situation. In this process, we would like to thank the Tencent game AI team and operations management team for their strong support. The game AI team introduced the overall business background to us and coordinated the on-line environment and the live network environment. We were able to fully test the solution before landing it in production, and the operations management team provided a lot of help with deployment architecture issues and resource coordination.

Business support

In the big data ecosystem, Alluxio sits between data-driven frameworks or applications and various persistent storage systems. Alluxio consolidates the data stored in these different storage systems, providing a unified client API and global namespace for its upper-layer data-driven applications. In our scenario, the underlying storage is CEPHFS, and the application is feature computing. Alluxio is used as the middle layer to provide distributed shared cache service, which is very suitable for optimization of feature computing services such as write, read, and high concurrent access to small files, mainly reflected in the following aspects:

  • Alluxio provides good support on the cloud, which makes it easy to deploy and expand Alluxio cluster on the computing platform.
  • Alluxio can be deployed close to services, and services and Alluxio workers can be attached to the same node to improve I/O throughput through local cache.
  • Alluxio worker uses the ram disk of the node of computing power platform to provide sufficient cache space. The underlying storage CEphFS hotspot data is loaded into the worker through distributedLoad. Some businesses directly access Gamecore through Alluxio to relieve the pressure of cePHFs in the underlying storage.

The following figure shows the architecture of Alluxio connecting services. In this launch, we hope to support the stable operation of tasks with 4000 cores of concurrency. The pod of each task is configured with four cores of CPU, providing 1000 PODS of concurrency on the business side. Alluxio-fuse sidecar container is embedded in each POD as the client. Business data read requests are directly accessed by the local path mounted by Alluxio-FUSE in POSIX format.

The master node of Alluixo cluster is configured in HA mode, and the scale of workers is 1000. We hope to put business pod and worker pod into one node as closely as possible. The advantage of this is that we can use domain sockets as much as possible to further improve the read performance. Before the business is put on the shelves, the gamecore version data of the hot version in cePHfs is preloaded to the Alluxio worker via distributedLoad.

Research and development of tuning

In AI and machine learning scenarios, the Alluxio cluster that accepts feature computing services belongs to a large-scale deployment case (1000 + worker Nodes). Such large concurrent access on the business side is also a challenge to the pressure capacity of the master node. During the launching process, the master node has undergone multiple tuning and new feature development to achieve the best effect.

The development work

  • According to the application scenario of feature calculation + Cephfs, the Cephfs underlying storage implementation based on HCFS + cephfs-Hadoop + libcephfs and the underlying storage implementation directly based on libcephfs are designed and implemented.

  • Designed with the Alluxio community, HA switches the leader to a specified master node.

We abstract the function of ratis into a single ratis-shell tool. With ratis-shell, we can send a setConfiguration request to the direct ratis server to set the priority of each master. The transferLeader request is then sent to confirm that the Leader has switched over to the specified node. Ratis-shell is available for Alluxio and Ozone and all other applications that utilize ratis.

  • Add the dynamic configuration change function, which allows you to modify some cluster parameters online to optimize the configuration without affecting services.

As shown in the figure above, the updateConf API is added between the Client and Master. Through this API, the Alluxio Master can send a configuration change request. After the Master updates the configuration, its internal Config hash will also change. Other clients, workers and other services connected to the Alluxio Master periodically synchronize the Config Hash with the Master, and also sense the configuration change and synchronize the changed configuration.

  • On the Alluxio fuse client, open the kernel cache and the Metadata cache on the Alluxio client to improve read performance. At the same time, the metadata cache is optimized. For example, when the underlying metadata is changed, the metadata cache can be actively invalid and re-cached to increase flexibility. In addition, several bugs were fixed when LocalCache was enabled on Alluxio FUSE.
  • Adding master access busyness, Ratis, OS, JVM, GC, cache hit ratio, and many more valuable metrics enriches Alluxio’s metrics system.
  • Develop the stacks feature to view Alluxio key processes for easy tracking of cluster status.
  • Fixed OOM issue with Alluxio Job Service when executing a large amount of distributedLoad.

Configuration optimization

  • Worker block adjustment of replications, Alluxio opened the passive Alluxio cache function. The default user. The file. The passive. The cache, enabled = true, If the client finds that the data block is not in the local worker, it will copy the copy from the remote worker to the local, and each worker needs to save many copies. This configuration would add huge metadata pressure to Alluxio Master in a 1000 worker scenario. However, the test results show that the performance benefit brought by this locality is actually very small. Therefore, this configuration is closed to reduce the pressure on the master.
  • Tencent internal Kona JDK11 is used, and JVM parameters of master and worker are tuned to run the same workload. Konajdk11 + G1GC has never had a leader Master switch problem due to FULLGC.
  • By tencent JVM team’s help, we locate the auditlog open become the bottleneck of the pure we read scene, by setting the alluxio. Master. Audit. Logging. The enabled = false shut down after the audit log, and enhance seven times. We found that using ROCKSDB metadata management would increase performance overhead by using the flame graph captured by Kona-Profiler. Next we plan to switch Alluxio to HEAP metadata management.

Contrast test

We compared and tested Alluxio (UFS is CEPHFS) and CEPHFS for feature computing business of a MOBA game. Alluxio cluster information in the test is as follows:

  • Alluxio Master: deploy three masters in High Availability mode.
  • Alluxio worker: 1000 workers, about 4TB of storage space.
  • Business Pods: 1000, each POD is a 4-core parallel task.
  • Test business: AI feature calculation task for a MOBA game (including 250,000 matches).

According to the test results, the two solutions can meet the needs of the business, and the failure rate is within the acceptable range. After using Alluxio + CEPHFS, the failure rate of the business is even lower.

Looking at Alluxio and CEphfs’ metadata stress metrics (RPC Count and MDS’s QPS), there was a bump early in the task and then a gradual decrease in master metadata stress. When Alluxio is used to undertake business, ceph MDS’s QPS is almost zero, indicating that Alluxio has withstood most of the business pressure.

Read Remote, Read UFS, Read Domain Remote Read and Read Domain are the main parts. Most of the Read traffic is Remote Read between workers and local Domain socket Read, and the Read from UFS is very rare.

The graph above shows the heap memory change curve during task execution using Kona JDK11. Before using kona JDK11, the master will have the problem of leader switching due to the long GC time. After replacing kona JDK11, this did not happen and the master was smoother.

The future work

Throughput limit increased

At present Alluxio Master is under concurrent stress at 7000 core. It has been noted that the Master’s Callqueue is overloaded and that it needs to be stressstressed to handle nearly 210,000 RPC requests per second through the masterStressBench tool. Eventually, we hope to be able to provide business concurrency of 20000 cores.

In order for alluxio cluster to support the endless concurrent access requirements of the business, it is not enough to do the tuning of a single cluster, but also to design an overall architecture that can support higher concurrent access.

Use Alluxio CSI to decouple the business from Alluxio FUSE

Alluxio FUSE is currently located in the same POD as the business in the form of sidecar. In this way, the business side can independently maintain the business POD and the corresponding YAML, and needs to jointly manage the business container and Alluxio-Fuse container in the business POD with the business team.

Build Alluxio cluster management system on Kubernetes

We maintained a solution for the operation and maintenance of the Alluxio cluster based on the Helm Chart template provided by Alluxio, but we wanted to take it a step further and use the Kubenetes API to operate every pod and container and interactively execute the commands that need to be executed. Based on this, We can mount and umount the underlying storage, visually manage job Service, and build load free services.

conclusion

In the process of Alluxio and game AI feature computing business landing, we support the stable operation of 4000 cores concurrent on the business side. From the use effect, Alluxio Withers most of the metadata pressure for the underlying distributed storage, and the task failure rate is reduced to a satisfactory range of business. In addition, the business of high concurrency large-scale scenarios also exposed many problems of Alluxio kernel, we have all contributed to Alluxio open source version, but also enhanced the stability and operation of Alluxio kernel, can adapt to more scenarios in the future.