Ideal Automobile is a Chinese new energy vehicle manufacturer, which designs, develops, manufactures and sells luxury intelligent electric vehicles. Founded in July 2015, ideal Automobile is headquartered in Beijing and has its own production base in Changzhou, Jiangsu province. Through product innovation and technology research and development, ideal Automobile provides safe and convenient products and services for home users.
In China, Ideal Is the pioneer in successfully commercializing range extended electric vehicles. Ideal ONE, the first and only commercial range extended electric vehicle model, is a luxury electric SUV (sport utility vehicle) with six seats, equipped with range extension system and advanced intelligent vehicle solutions. Mass production will begin in November 2019, and the 2021 Ideal ONE will be launched on May 25, 2021. As of December 31, 2021, a total of 124,088 Ideal ONE vehicles have been delivered.
background
According to the relevant national regulations and standards, the signal data of the core components of new energy vehicles need to be collected and reported to the data platform of new energy vehicles built by the government in the process of driving. The sources of these data include the engine, battery and other core components. Regulators are also requiring automakers to store the data for follow-up maintenance, OTA upgrades, vehicle health monitoring, early warning, and maintenance. In order to better serve users, Ideal automobile began to build its own data platform.
Committed to creating a mobile home and becoming the ideal vehicle for the world’s leading smart electric vehicle companies, the scale of the data they need to manage is very large. In today’s article, only the timing data generated by the ideal car is discussed. In the architecture of the vehicle data platform, the full amount of time sequence signal data is stored in HDFS, and the Hadoop technology stack is used to complete various complex computation and analysis tasks according to business requirements.
In December 2021, Ideal Delivered 14,087 Ideal ONE units, an increase of 130.0% compared to December 2020. From January to December 2021, the Ideal ONE delivered a total of 90,491 units, up 177.4% from 2020. Since delivery, cumulative deliveries of ideal ONE have reached 124,088 units. It is conceivable that the growth of vehicle data managed by data platform is also extremely fast, which puts forward very high requirements on the agility and flexibility of data platform.
Big data veterans know that HDFS expansion takes time and effort, sometimes even keeping up with business growth. In the face of rapid service development and inflexibility of HDFS, maintenance engineers sometimes have to delete invalid and redundant data and balance data of data nodes to alleviate the conflict between high service requirements for agility and inflexibility of HDFS. In addition, because Hadoop is a design of storage and computing coupling, increasing storage space also needs to increase computing, and often the matching between storage and computing is mismatched. The expansion of mismatched storage will also bring a lot of computing power redundancy, creating unnecessary waste.
In 2020, data platforms began to solve the contradiction between fast service change and inflexibility of HDFS. The scale of selection at that time was:
- Minimal changes to existing ETL processes and computation logic, in other words excellent HDFS compatibility
- Good spring
- Transparent acceleration, no performance bottlenecks
- The stability must be at least aligned with HDFS
At the beginning, the Hadoop SDK integration solution provided by cloud vendors was tested. However, as only limited Hadoop API was implemented and there was no cache, stability and performance were far inferior to HDFS, the solution of this problem was delayed.
At the beginning of 2021, when JuiceFS opened its open source, colleagues of the data platform learned about The JuiceFS cloud service. JuiceFS is fully compatible with HDFS API, and has flexibility and cache at the same time. Preliminary judgment shows that it can solve the first three problems in the scale selection. We have a try mentality, the first time to try. Thanks to JuiceFS partners’ great help, JuiceFS Community edition was successfully launched in Ideal Car, which solved the problem of HDFS capacity shortage, and also realized the architecture upgrade of Hadoop storage and computation separation. Most importantly, it met the requirements of business agility.
JuiceFS introduction
JuiceFS is a high-performance open source distributed file system designed for the cloud environment. It is fully compatible with POSIX, HDFS, and S3 interfaces and is suitable for scenarios such as big data, AI model training, Kubernetes shared storage, DevOps, and massive data archiving. Using JuiceFS to store data, the data itself is persisted in an object store (such as Amazon S3), and the metadata corresponding to the file system can be stored in a variety of database engines, such as Redis, MySQL, TiKV, etc. At the same time, JuiceFS client has caching capability to provide intelligent I/O acceleration for upper-layer applications.
Application scenarios
At present, after six months of use and iteration, JuiceFS has been used in many business scenarios of ideal vehicles. Here are a few typical business scenarios that we hope will be useful to the JuiceFS community, and we welcome your thoughts and questions.
JuiceFS supports core warehouse storage
scenario
Currently, 2TB of data is added every day in the vehicle data analysis scenario. Data is directly read and written to JuiceFS by Spark for ETL processing. Because JuiceFS is fully compatible with the HDFS API, the business only needs to specify the table path to JuiceFS ‘directory. It’s insensitive on the switch.
earnings
After switching to JuiceFS, the storage space is changed from the limited disk of HDFS to the object storage with unlimited capacity. At the same time, the storage and computing separation of Hadoop cluster is realized. Now, JuiceFS can be used to flexibly scale the storage, and the computing cluster can be independently expanded and shrunk according to the service volume. In this way, the data platform can be more agile to support business growth and changing requirements.
Improvement plan
When it launched in the first half of the year, JuiceFS used Redis, which is hosted in the public cloud, to store metadata. Because of the need for Redis’s transaction API, the Redis clustering pattern is not available, so the capacity bottleneck of a single Redis instance limits the number of files in a single JuiceFS file system and does not migrate all tables to JuiceFS for the time being. Now that JuiceFS supports TiKV for storing metadata, you’re ready to test and migrate all data to JuiceFS, using the free local physical disk as the cache disk.
Support hierarchical storage of sequential database MatrixDB with JuiceFS
scenario
In an ideal vehicle MaxtrixDB cluster, there is still nearly 500 GIGABytes of incremental data per day, even after compression. Such time-series data is very time-sensitive, and the longer the time goes, the less frequently the data needs to be viewed. The contradiction lies in that even historical data has low frequency query requirements and cannot be deleted. However, the MaxtrixDB architecture is designed with local storage and is not scalable. After seeing JuiceFS’s practice of data layering on ClickHouse, we recommended it to the MatrixDB team, and soon MaxtrixDB supported an automatic storage tiering mechanism that automatically moved warm and cold data from local disks to JuiceFS to meet query requirements.
earnings
The storage cost is reduced by nearly 50% under the condition that the user is basically unaware of the use. JuiceFS is used to implement hierarchical storage of data in sequential database MatrixDB. Hot data is written to local SSDS and automatically transferred to JuiceFS by lifecycle policy. The entire process requires simple configuration and automatic transparency, eliminating the need for frequent manual expansion and greatly saving storage costs. Free SSD capacity can be used for cache acceleration of warm and cold data, and occasional use of cache acceleration can maintain good performance.
Cross-platform data exchange
scenario
The data platform is Hadoop technology stack, and the algorithm platform uses Kubernetes to manage resources. The two platforms are upstream and downstream in many businesses. The data platform is responsible for preparing data, and then completes the training of algorithm model for the algorithm platform. The way we solve the data exchange is that the data platform writes data directly to Hive tables, which use JuiceFS storage underneath. When the algorithm platform starts Pod, it automatically mount the same JuiceFS file system in POSIX mode, so that the application in Pod can read the feature data just like accessing the local directory, and the trained result can also write JuiceFS in POSIX mode. Data platform students can also conveniently use the results provided by algorithm students.
earnings
Data workflows are getting longer and more complex, tasks need to be done collaboratively across different platforms and teams, and data has always been moved around between different storage systems. Copying time, checking correctness time, all these waiting and repeating work are very inefficient. Now JuiceFS is a unified data lake, which can share various types of data across different platforms and applications without waiting, and the efficiency is greatly improved.
Improvement plan
Currently, data platforms use multi-tenant mode to perform data ETL operations. The default Pod pulled by the algorithm platform is root user. After the algorithm colleagues write back the result data to JuiceFS, only root user has the write permission. However, when the Hive component of the data platform adds partitions, it fails to add partitions because it does not have the write permission. A new solution to this problem is to add Hive users to Hadoop’s supergroup so that they have the same write permissions as root users. We will test this solution together with the algorithm platform after the recent release of the new version.
Platform Shared Files
scenario
The entire data platform used HDFS to share files. Platform front-end applications directly upload data to HDFS through back-end service interfaces. On the one hand, large files may fail to be downloaded from the HDFS during centralized task execution in the early morning, which affects task stability. Currently, the real-time development platform has been switched to JuiceFS POSIX mode to provide support for file sharing. The plan is to close all platforms that need to share files to JuiceFS for unified management.
earnings
POSIX access makes application development easier and more efficient. JuiceFS also provides more stable throughput than HDFS peak throughput.
Looking forward to
After nearly a year of using JuiceFS, we have been following up the iteration of JuiceFS community and have a better understanding of JuiceFS. The problems encountered in using JuiceFS can get feedback from the community in time and the problems can be solved quickly. Thanks for the strong support of community partners. We are constantly updating with community releases (JuiceFS is easy to upgrade). Next year’s work plan, on the one hand, the scene will continue to expand and deepen, we plan to conduct verification and promotion in a large number of image retrieval scenes of the company’s automatic driving. On the other hand, we have started to do some development on JuiceFS, and after verification, we will discuss with the community and feed back into the upstream code.
First of all, the goal in 2022 is to weaken and even eliminate HDFS, and in the later stage, object storage will be used as the bottom storage of the whole data lake. And hope to open up data sharing at data lake and data warehouse level. JuiceFS provides local caching to improve performance. The Ideal Car Storage team is currently working on features to increase cache hit ratio, such as local P2P reads.
Second, our entire platform runs in a multi-tenant environment, and JuiceFS is currently designed for a single file system without multi-tenant functionality. We also plan to develop Apache Ranger-like management tools that provide centralized management of security policies and monitoring of user access to manage data security in JuiceFS.
Third, currently, when JuiceFS is mounted using POSIX, meta information needs to be directly passed. We plan to encapsulate JuiceFS Community edition to a certain extent. On the one hand, it is convenient for users to create and manage their own JuiceFS Volume, integrate internal user authentication system, and improve user experience. On the other hand, cluster deployment details can be isolated to facilitate platform team maintenance.
Fourth, TiKV is planned to be used for metadata storage in data lake scenarios, but TiKV is not as fast as Redis in some scenarios. So consider using Redis for scenarios with high metadata performance requirements but controllable data volume. This creates the need to maintain multiple sets of JuiceFS. It is as if each JuiceFS is a directory. The user sees it as a file system. Hadoop – like multi-namenode ViewFS.
Welcome to pay attention to our project Juicedata/JuiceFS! (0 ᴗ 0 ✿)