Abstract: In the special session of 2021 Cloud Computing Conference, Jia Yangqing, senior researcher of Ali Cloud Intelligence, brought us the sharing of “cloud native — Let data Lake accelerate into the 3.0 era”.
Abstract: 2021 Computing Conference cloud native enterprise data lake special session, Ali Cloud intelligence senior researcher Jia Yangqing brought us “cloud native — let the data lake accelerate into the 3.0 era” share.
This paper mainly describes the lecture process of data lake from the aspects of storage as a service, computing diversification and intelligent management.
Here are the highlights:
Evolution of data lake
Data lake 1.0 before 2019
- Storage: separation of storage and calculation, stratification of hot and cold data, mainly Hadoop ecology
- Management: There is no official management service. Users manage capacity expansion, disk operation and maintenance by themselves
- Computing: The initial implementation of computing cloud biogenesis, but the lack of flexibility and diversity of computing
The concept of data lake is familiar to everyone. Before 2019, when the concept of data lake was mentioned, it was partly based on the simple idea of storage separation, which can flexibly expand and shrink the storage scale, and flexibly allocate computing resources according to computing needs. At that time, storage could basically be standardized as a service, and computing could also be planned separately from storage, so there was a relatively lack of better management of upper-layer data and computing flexibility.
Data Lake 2.0 2019 to 2021
- Storage: Centering on object storage, unified storage bears production services and provides large scale and high performance
- Management: Provides OSS/EMR and other vertical lake management systems, lacking linkage between products
- Calculation: calculation flexibility, users according to the load calculation of expansion
Building on the foundation of Data Lake 1.0, we further built many capabilities. Especially after the storage standardization, such as Ali Cloud object storage OSS, began to become a very standard data lake underlying storage solution, its own stability, scale and performance, provides a good foundation for the data lake base. It is possible to do some single clustering, such as pulling up a cluster like EMR, for some data management and control, but it is still a relatively preliminary state. As long as there is a computing cluster, you can reference the data of the data lake in the computing cluster and manage the metadata. At the same time, more elastic computing is becoming more possible because of cloud native. In storage, computing, management of three indicators, storage is the fastest; Computational diversification is a good move; Management is also gradually being built.
Data lake 3.0 2021
- Storage: Centering on object storage, it builds enterprise-level data, fully compatible, multi-protocol, and unified metadata
- Management: One-stop lake construction and management oriented to lake storage + computing, achieving intelligent “lake construction” and “lake governance”
- Computing: Computing is not only cloud-based and flexible, but also real-time, AI-based and ecological
When it comes to Data Lake 3.0, the basic thinking is that there are further developments in storage, computing, and management. Storage requires more compatibility, better consistency, and better persistence. More importantly, in terms of management, a data lake is not just a pile of data thrown there, but can be managed in an orderly manner. Questions about what data is stored on the lake, how it is used, how often it is used, and the quality of the data, which are often considered in the traditional data warehouse world, also exist in the data lake. The lake should also have a complete and mature management system like the barn. As for calculation, it is not only the calculation of the elasticity of the volume, but also a diversified process of calculation. We used to do ETL more, and now we’re doing more real-time computing, AI computing, and a lot of ecological computing engines and lake integration. These are some of the core problems that data Lake 3.0 needs to solve.
Upgrade storage from “cost center” to “value center.
- Smooth on the cloud -100% COMPATIBLE with HDFS, smooth migration of storage data to the cloud
- Reduce the difficulty of operation and maintenance – full service mode, reduce the difficulty of operation and maintenance
- Extreme cost performance – hot and cold stratification, trillions of files per barrel, cost reduced by 90%
- Accelerating AI innovation — Data flow on demand, dramatically reducing computing wait times, and efficient management
Based on the underlying storage of object storage OSS, we have achieved a very smooth migration to the cloud, reducing the difficulty of operation, maintenance and management. A uniform and standard storage state allows many technologies to precipitate. For example, hot and cold layering automatically depends on the allocation of OSS cold storage and hot storage when users do not need to care about it, thus reducing storage costs. Many times, including in the AI world, people may not be familiar with different storage forms and prefer traditional file systems like CPFS. The integration of CPFS and OSS provides many new functions in storage, which can solve users’ migration problems.
Intelligent upgrade of “building lake”, “managing lake” and “governing lake”
- Data into the lake intelligently
Multi-data source key in the lake, support offline/real-time access to the lake
- Metadata servitization of data computation
Servitization of metadata, to meet the metadata management of a single table million partitions
- Unified data rights management
Interconnecting with multiple engines, supporting fine-grained data access control such as library/table/column
- Lake warehouse integrated data management
Unified data development and whole link data governance for data lake and data warehouse
We spent more than a year building a new product, Data Lake Formation (DLF), to better manage the Data Lake in terms of Lake construction, management and governance. The first focus is on how data can be put into the lake in a more standardized and systematic way, not just by writing a bunch of scripts, but also by better management, in an easier way to gather diverse data into the data lake. The second is metadata services. In a warehouse, metadata is built into the warehouse as a whole. When building a data lake, it is stored in OSS. For metadata management, especially the combination of metadata services and upper-level tools such as BI, DLF provides a more service-oriented and standardized metadata management layer. This layer is better governed by the data permissions, data quality, and so on brought about by metadata. The opening of Dataworks and data lake also enables us to do better data governance. In an enterprise, there are many forms of data, some in the data lake, some in the warehouse. You may have heard the term LakeHouse in the industry. A lot of times that means building a warehouse on top of a lake. In fact, the demand of an enterprise is not only to start from 0 to build warehouses on the lake, because there are many traditional data warehouses, including many well-organized data warehouses like Excel sheets are actually useful. Therefore, how to better link the flexibility of the lake with the structure of the warehouse supports some of the tools and methodologies we use in governing, managing and building the lake.
Upgrade from “single computing” to “Full scene Intelligent computing”
- Real-time data lake
Real-time data into the lake, real-time updates at the minute level
- One lake storehouse
Open up lake and warehouse, enhance enterprise data business capacity, a data intelligent flow
- Data science
From BI to AI scenarios, support for deep learning and heterogeneous computing frameworks
- Computing engine multivariate ecology
Support Databricks, Cloudera and other diversified computing and analysis capabilities
How to better real-time data lake? Real-time data lake functionality through open source components like Hudi. How to better align the needs of data science? For example, in the field of AI, we often use some python and programming-based development experience that data scientists like. How to combine it with the underlying data lake storage and management system? How do we integrate very mature enterprise ecosystem products like Databricks and Cloudera with our underlying data lake? These are some of the enterprise-level capabilities that we’ve been building over the last year or making it easier for our developers and engineers to use the data Lake. How to do storage? How to do management? How do you do more diverse calculations? These are the core points of the development of the data lake to the 3.0 stage.
Thousands of enterprises and Ali Cloud to open data lake 3.0 best practices
- 6000+ data lake customers
- Eb-level data lake capacity
- Minute-level data will enter the lake in real time
- TB but data lake throughput
On Aliyun, there are many enterprises using the data lake. It uses a very large amount of storage and a very diverse set of computations. In the process of use, such a product is polished together. Since the beginning of 19 years ago, the continuous iteration of data Lake cannot be separated from the trust of partners. Thank you.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.