Introduction: This article is based on the share of Chen Xinwei, the developer of Ali Cloud data Lake construction cloud product, at DataFunCon 2021. It mainly introduces the architecture analysis and cloud practice of Lakehouse.
Author’s brief introduction
Chen Xinwei (alias Xikang), Ali Cloud open source big data – data lake construction cloud product development
Content framework
Lakehouse concepts and features
Lakehouse architecture and implementation
Lakehouse architecture and practices on the cloud
Case sharing and future outlook
Lakehouse concepts and features
Data architecture evolution
In the late 1980s, data warehouses represented by Teradata, Oracle and other products were mainly used to solve the data collection and calculation requirements of BI analysis and reporting. The built-in storage system provides data abstraction for the upper layer. After data is cleaned and transformed, data is written in a defined Schema structure. Modeling and data management are emphasized to provide better performance and data consistency, and fine-grained data management and governance capabilities are provided. Support transaction and version management; In-depth data optimization and deep integration with the computing engine improve external SQL query performance. However, as the volume of data grows, data warehouses begin to face several challenges. The first is the coupling of storage and computation, which requires procurement according to the peak value of both, resulting in high procurement cost and long procurement cycle. Secondly, more and more data sets are unstructured, and data warehouses cannot store and query these data. In addition, the data is not open enough to be used in other advanced analysis scenarios, such as ML.
With the rise of Hadoop ecosystem, products such as HDFS, S3 and OSS store all data in a unified manner and support various data application scenarios. Data lakes with complex architectures become popular. The original underlying storage is replaced by a unified storage system based on HDFS storage or object storage on the cloud, which is relatively low-cost and highly available. Can store all kinds of original data, without modeling and data transformation in advance, low storage cost and strong scalability; Support for semi-structured and unstructured data; Data is more open and can be accessed through various computing engines or analysis methods. It supports rich computing scenarios and is flexible and easy to start. However, with the development of the past decade, the problems of the data lake are gradually exposed. Long data links or multiple data components lead to high error rate and poor data reliability. Continuous data migration and synchronization between various systems bring challenges to data consistency and timeliness. The data in the lake is haphazard, and the direct access query without optimization will cause performance problems. The complexity of the overall system leads to high cost of enterprise construction and maintenance.
In order to solve the above problems, LakeHouse combining the advantages of data lake and data warehouse came into being. The bottom layer is still low-cost and open storage, while the upper layer is based on data systems such as Delta Lake, Iceberg and Hudi to provide data management features and efficient access performance, support various data analysis and calculation, and form a new architecture combining the advantages of data warehouse and data lake. The storage separation architecture can be flexibly extended. Reduce data relocation, data reliability, consistency and real-time are guaranteed; Support for rich computing engines and paradigms; In addition, data organization and index optimization are supported for better query performance. However, because LakeHouse is still in a period of rapid development, there are few mature products and systems with rapid iteration of key technologies. In the case that there are few cases for reference, enterprises need to invest in technology if they want to adopt LakeHouse.
Data architecture comparison
The figure above compares data warehouse, data lake and LakeHouse from multiple dimensions. It is obvious that LakeHouse integrates the advantages of data warehouse and data lake, which is also the reason why LakeHouse is expected to be “the basic paradigm of the new generation of data architecture”.
Lakehouse architecture and implementation
Lakehouse architecture diagram
Lakehouse = Object storage on cloud + Lake format + Lake management platform
Access layer
- The metadata layer queries and locates data
- Object storage supports high-throughput data access
- Open data formats support direct reading by the engine
- The Declarative DataFrame API takes advantage of the optimization and filtering capabilities of the SQL engine
Optimization of layer
- Caching, Auxiliary data structures (indexing and Statistics), data layout optimization, Governance
The transaction layer
- Implement a metadata layer that supports transaction isolation, specifying the data objects contained in each Table version
Storage layer
- Object storage on the cloud, low cost, high reliability, unlimited expansion, high throughput, o&M free
- Common data formats, Parquet/Orc etc
Lakehouse implements the core – lake format
This paper mainly introduces the characteristics of Lakehouse by focusing on two Lake formats, Delta Lake and IceBerg.
Delta Lake
Delta Lake
Key implementation: transaction logging – Single Source of Truth
- The transaction log is transaction commit granularity and records all operations on the table in sequence
- Serialization isolates write operations, writes ensure atomicity, and MVCC+ optimistic locking controls concurrency conflicts
- Snapshot isolates read operations and supports reading historical version data to achieve time travel
- File level data update to implement local data update and delete
- Data processing based on incremental logs
Delta Lake on EMR
In view of the shortcomings of the open source version of Delta Lake, the EMR team of Ali Cloud developed the following functions:
Optimize& Zorder
- Support Zorder to rearrange data, combined with dataskipping speed query
- Efficient Zorder execution, 780 million data, 2 fields, 20 minutes to complete
- Support Optimize, solves small file problems, and supports automatic compact
SavePoint
- You can create, delete, and query SAVEPOINT and permanently reserve the data of the specified version
Rollback
- You can roll back to a historical version to repair data
Automatically synchronize metadata to MetaStore
- Complete table information and partition information are automatically synchronized to MetaStore for Hive/Presto query without additional operations.
Multi-engine query support
- Supports Hive, Trino, Impala, and Ali cloud MaxCompute query
Iceberg
Iceberg
Key implementation: Snapshot-based metadata layer
- Record all files of the current version of the table in Snapshot
- Commit a new version for each write operation
- Snapshot-based read and write isolation
Incremental data consumption based on snapshot differences
Iceberg on EMR
Aliyun’s contribution to Iceberg:
Optimize
- Provides cache acceleration in conjunction with JindoFS
- Automatic small file merge transaction
Alibaba cloud ecological docking
- Native access to OSS object storage
- Natively access DLF metadata
Open Source Community input
- One IcebergPMC and one Iceberg Committer
- Contribute and maintain Iceberg and Flink integration modules
- Lead and design the community’s MOR streaming upsert functionality
Chinese Community Operation
- Maintain the largest Apache Iceberg Data Lake technology community in China (1250+ members).
- One of the most active Apache Iceberg preachers in the country:
- Organize Apache Iceberg Shenzhen station and Apache Iceberg Shanghai station
Selection of reference
Lakehouse architecture and practices on the cloud
Databricks changed The company’s description to The Databricks Lakehouse Platform after they introduced The Lakehouse concept. As you can see in the architecture diagram below, the underlying is the cloud infrastructure (ali Cloud also partnered with Databricks to launch Databricks Data Insights). Once the Data is in the Lake, on top of the open Data store is a more focused Data Management and Governance layer that combines the Delta Lake Data Lake format.
Data lake construction, management, and analysis process
- Data into the lake with cleaning
Data from various data sources is brought into the lake and cleaned in full, incremental, real-time, ETL, data migration, metadata registration, etc
-
Data storage and management
-
Metadata management and services
-
Permission control and audit
-
Data quality control
-
Lake surface management and optimization
-
Storage management and optimization
-
Data analysis and training
Data modeling and development, offline analysis, real-time computing, interactive analysis, machine learning and algorithm scenarios are covered
-
Data services and applications
-
Synchronize the data after training analysis to the corresponding products for in-depth analysis or further processing
-
Data applications such as business intelligence, real-time monitoring, data science and data visualization are directly connected
Aliyun Lakehouse architecture
- Data Layer (Data Lake Core Capabilities)
- Computing Layer (Elastic Computing Engine)
- Platform Layer (Data Development and Governance)
DLF unified metadata services and governance
-
Unified metadata and permissions
-
Fully compatible with HMS protocol, support multiple engine access
-
Unified permission control for multiple engines
-
KV storage on the cloud provides high scalability and high performance services
-
Support Delta Lake /Hudi/Iceberg metadata synchronization and unified view
-
Take Delta Lake’s support for multiple engine access
-
** Multi-form Spark: **DLF Data exploration, Databrick Spark, Maxcompute Serverles Spark
-
** Real-time computing: **Flink (docking in progress)
-
**EMR interactive analysis: **Trino, Impala
-
**EMR Hive: ** Ali Cloud contribution to the community of Delta Connector, fully open source
-
Lake metadata governance
-
Metadata analysis and management based on historical data
-
Cost analysis and optimization, cold and thermal analysis and performance optimization
CDC into lake productization
-
Multiple data sources 0 code builds the data flow
-
Template + Configuration => Spark task
-
The Spark SQL/Dataframe Code is automatically generated in the lake factory
-
Multiple data sources CDC into the lake, automatic synchronization of metadata
Hudi, for example
- Mysql, Kafka, Log Service, TableStore etc. write to Hudi data lake in real time
- Combined with Flink CDC, fully managed and semi-managed Flink writes to Hudi data lakes in real time and synchronizes metadata to DLF for further analysis by other engines.
- Hudi community contribution
- Aliyun contributed Spark SQL and Flink SQL to read and write Hudi
- Achieve MERGER INTO and other COMMONLY used CDC operators
Data Lake governance
- Provide automated governance services based on the platform
In the case of Iceberg
-
Compact / Merge / Optmize
-
Expire-snapshot / Vacuum
-
Caching (Based on JindoFS)
-
Automatic data analysis and stratified archiving
-
Serverless Spark provides low-cost hosting resource pool services
Case sharing and future outlook
Case Study 1: A fully managed data lake solution
Big data platform architecture:
Before Shangyun, the big data of customers was deployed in IDC room, based on CDH self-built Hadoop cluster. Because the security and stability of the self-built IDC room are not guaranteed, the DISASTER recovery capability is poor. The customer decided to move the big data platform to the cloud.
Case in detail: * * * * developer.aliyun.com/article/800…
Fully managed Data Lake solution:
Take the RDS relational database as an example
Data can be imported to the lake using Spark after ETL, or imported to the OBJECT storage OSS using THE CDC incremental import to the lake/Batch full import to the lake using DLF under unified metadata management:
- Real-time analysis based on Spark Streaming and display of real-time report/monitoring
- ETL tasks through Databricks data insights on Ali Cloud for data modeling and further processing
- Pull up the EMR Presto cluster for the interactive analysis scenario
-
Batch stream integration, simple and clear link
-
Delta Lake + Spark enables real-time computing and offline analysis
-
Full hosting service, free of operation and maintenance
-
OSS Storage Hosting
-
DLF metadata and lake entry task hosting
-
DDI Spark engine hosting
-
Open data
-
Quick access to Presto for interactive analysis
In this scenario, all components are fully managed except for the semi-managed EMR Presto. Users do not need to manage the underlying storage, metadata, or Spark engine, nor do they need to maintain or upgrade components. While improving user experience, operation and maintenance costs are greatly reduced.
Case Share 2: Real-time data lake for storage and computation separation
This is an example of a real-time data lake with separate storage and computation. Users originally use the integrated storage and computing architecture. Hive metadata information is stored in Hive MetaStore and data is stored in HDFS. Disk failures often occur, which brings trouble to O&M and increases costs. On the other hand, the performance of Hive Metastore is greatly challenged when the amount of data reaches a certain level, for example, when the number of partitions reaches 100,000. Therefore, the user decided to migrate from the integrated storage to the separated storage architecture.
-
Integration of storage and computation — > separation of storage and computation
-
HDfs operation and maintenance difficulties -> OSS operation and maintenance free
-
High cost of HDFS cloud disk -> low cost of OSS
-
Poor HMS scalability -> High DLF metadata scalability
-
Flink real-time access to the lake, Spark/Hive analysis
-
Flink -> Hudi efficient real-time write capabilities
-
DLF metadata naturally connects to Spark/Hive
-
Hbase service and data separation
The future of Lakehouse
-
Lake format capabilities will continue to grow closer to warehouse/database capabilities
-
AWS Memorandum Table
-
Optimize becomes more and more versatile (Caching, Clustering, Z-Ording…)
-
The capacity of the three lake formats continues to level out
-
Lake formats are increasingly integrated with storage/computing for higher performance
-
A custom API for the lake format appears in the storage tier
-
More performance-oriented solutions for the computing layer (secondary indexes, Column Statistics)
-
The lake management platform is becoming more and more intelligent
-
Unified metadata service becomes standard, Meta hosting and service (Hudi Timeline Server)
-
Governance and optimization become platform infrastructure capabilities that are not visible to users
To sum up, Lakehouse is likely to become a new generation of big data architecture, either from the characteristics of Lakehouse or from the exploration and construction of various manufacturers in Lakehouse and Lake format.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.