Introduction: This article is based on the share of Chen Xinwei, the developer of Ali Cloud data Lake construction cloud product, at DataFunCon 2021. It mainly introduces the architecture analysis and cloud practice of Lakehouse.

Author’s brief introduction

Chen Xinwei (alias Xikang), Ali Cloud open source big data – data lake construction cloud product development

Content framework

Lakehouse concepts and features

Lakehouse architecture and implementation

Lakehouse architecture and practices on the cloud

Case sharing and future outlook

Lakehouse concepts and features

Data architecture evolution

In the late 1980s, data warehouses represented by Teradata, Oracle and other products were mainly used to solve the data collection and calculation requirements of BI analysis and reporting. The built-in storage system provides data abstraction for the upper layer. After data is cleaned and transformed, data is written in a defined Schema structure. Modeling and data management are emphasized to provide better performance and data consistency, and fine-grained data management and governance capabilities are provided. Support transaction and version management; In-depth data optimization and deep integration with the computing engine improve external SQL query performance. However, as the volume of data grows, data warehouses begin to face several challenges. The first is the coupling of storage and computation, which requires procurement according to the peak value of both, resulting in high procurement cost and long procurement cycle. Secondly, more and more data sets are unstructured, and data warehouses cannot store and query these data. In addition, the data is not open enough to be used in other advanced analysis scenarios, such as ML.

With the rise of Hadoop ecosystem, products such as HDFS, S3 and OSS store all data in a unified manner and support various data application scenarios. Data lakes with complex architectures become popular. The original underlying storage is replaced by a unified storage system based on HDFS storage or object storage on the cloud, which is relatively low-cost and highly available. Can store all kinds of original data, without modeling and data transformation in advance, low storage cost and strong scalability; Support for semi-structured and unstructured data; Data is more open and can be accessed through various computing engines or analysis methods. It supports rich computing scenarios and is flexible and easy to start. However, with the development of the past decade, the problems of the data lake are gradually exposed. Long data links or multiple data components lead to high error rate and poor data reliability. Continuous data migration and synchronization between various systems bring challenges to data consistency and timeliness. The data in the lake is haphazard, and the direct access query without optimization will cause performance problems. The complexity of the overall system leads to high cost of enterprise construction and maintenance.

In order to solve the above problems, LakeHouse combining the advantages of data lake and data warehouse came into being. The bottom layer is still low-cost and open storage, while the upper layer is based on data systems such as Delta Lake, Iceberg and Hudi to provide data management features and efficient access performance, support various data analysis and calculation, and form a new architecture combining the advantages of data warehouse and data lake. The storage separation architecture can be flexibly extended. Reduce data relocation, data reliability, consistency and real-time are guaranteed; Support for rich computing engines and paradigms; In addition, data organization and index optimization are supported for better query performance. However, because LakeHouse is still in a period of rapid development, there are few mature products and systems with rapid iteration of key technologies. In the case that there are few cases for reference, enterprises need to invest in technology if they want to adopt LakeHouse.

Data architecture comparison

The figure above compares data warehouse, data lake and LakeHouse from multiple dimensions. It is obvious that LakeHouse integrates the advantages of data warehouse and data lake, which is also the reason why LakeHouse is expected to be “the basic paradigm of the new generation of data architecture”.

Lakehouse architecture and implementation

Lakehouse architecture diagram

Lakehouse = Object storage on cloud + Lake format + Lake management platform

Access layer

The metadata layer queries and locates data
Object storage supports high-throughput data access
Open data formats support direct reading by the engine
The Declarative DataFrame API takes advantage of the optimization and filtering capabilities of the SQL engine

Optimization of layer

Caching, Auxiliary data structures (indexing and Statistics), data layout optimization, Governance

The transaction layer

Implement a metadata layer that supports transaction isolation, specifying the data objects contained in each Table version

Storage layer

Object storage on the cloud, low cost, high reliability, unlimited expansion, high throughput, o&M free
Common data formats, Parquet/Orc etc

Lakehouse implements the core – lake format

This paper mainly introduces the characteristics of Lakehouse by focusing on two Lake formats, Delta Lake and IceBerg.

Delta Lake

Key implementation: transaction logging – Single Source of Truth

The transaction log is transaction commit granularity and records all operations on the table in sequence
Serialization isolates write operations, writes ensure atomicity, and MVCC+ optimistic locking controls concurrency conflicts
Snapshot isolates read operations and supports reading historical version data to achieve time travel
File level data update to implement local data update and delete
Data processing based on incremental logs

Delta Lake on EMR

In view of the shortcomings of the open source version of Delta Lake, the EMR team of Ali Cloud developed the following functions:

Optimize& Zorder

Support Zorder to rearrange data, combined with dataskipping speed query
Efficient Zorder execution, 780 million data, 2 fields, 20 minutes to complete
Support Optimize, solves small file problems, and supports automatic compact

SavePoint

You can create, delete, and query SAVEPOINT and permanently reserve the data of the specified version

Rollback

You can roll back to a historical version to repair data

Automatically synchronize metadata to MetaStore

Complete table information and partition information are automatically synchronized to MetaStore for Hive/Presto query without additional operations.

Multi-engine query support

Supports Hive, Trino, Impala, and Ali cloud MaxCompute query

Iceberg

Key implementation: Snapshot-based metadata layer

Record all files of the current version of the table in Snapshot
Commit a new version for each write operation
Snapshot-based read and write isolation

Incremental data consumption based on snapshot differences

Iceberg on EMR

Aliyun’s contribution to Iceberg:

Optimize

Provides cache acceleration in conjunction with JindoFS
Automatic small file merge transaction

Alibaba cloud ecological docking

Native access to OSS object storage
Natively access DLF metadata

Open Source Community input

One IcebergPMC and one Iceberg Committer
Contribute and maintain Iceberg and Flink integration modules
Lead and design the community’s MOR streaming upsert functionality

Chinese Community Operation

Maintain the largest Apache Iceberg Data Lake technology community in China (1250+ members).
One of the most active Apache Iceberg preachers in the country:
Organize Apache Iceberg Shenzhen station and Apache Iceberg Shanghai station

Selection of reference

Lakehouse architecture and practices on the cloud

Databricks changed The company’s description to The Databricks Lakehouse Platform after they introduced The Lakehouse concept. As you can see in the architecture diagram below, the underlying is the cloud infrastructure (ali Cloud also partnered with Databricks to launch Databricks Data Insights). Once the Data is in the Lake, on top of the open Data store is a more focused Data Management and Governance layer that combines the Delta Lake Data Lake format.

Data lake construction, management, and analysis process

Data into the lake with cleaning

Data from various data sources is brought into the lake and cleaned in full, incremental, real-time, ETL, data migration, metadata registration, etc

Data storage and management
Metadata management and services
Permission control and audit
Data quality control
Lake surface management and optimization
Storage management and optimization
Data analysis and training

Data modeling and development, offline analysis, real-time computing, interactive analysis, machine learning and algorithm scenarios are covered

Data services and applications
Synchronize the data after training analysis to the corresponding products for in-depth analysis or further processing
Data applications such as business intelligence, real-time monitoring, data science and data visualization are directly connected

Aliyun Lakehouse architecture

Data Layer (Data Lake Core Capabilities)
Computing Layer (Elastic Computing Engine)
Platform Layer (Data Development and Governance)

DLF unified metadata services and governance

Unified metadata and permissions
Fully compatible with HMS protocol, support multiple engine access
Unified permission control for multiple engines
KV storage on the cloud provides high scalability and high performance services
Support Delta Lake /Hudi/Iceberg metadata synchronization and unified view
Take Delta Lake’s support for multiple engine access
** Multi-form Spark: **DLF Data exploration, Databrick Spark, Maxcompute Serverles Spark
** Real-time computing: **Flink (docking in progress)
**EMR interactive analysis: **Trino, Impala
**EMR Hive: ** Ali Cloud contribution to the community of Delta Connector, fully open source
Lake metadata governance
Metadata analysis and management based on historical data
Cost analysis and optimization, cold and thermal analysis and performance optimization

CDC into lake productization

Multiple data sources 0 code builds the data flow
Template + Configuration => Spark task
The Spark SQL/Dataframe Code is automatically generated in the lake factory
Multiple data sources CDC into the lake, automatic synchronization of metadata

Hudi, for example

Mysql, Kafka, Log Service, TableStore etc. write to Hudi data lake in real time
Combined with Flink CDC, fully managed and semi-managed Flink writes to Hudi data lakes in real time and synchronizes metadata to DLF for further analysis by other engines.
Hudi community contribution
Aliyun contributed Spark SQL and Flink SQL to read and write Hudi
Achieve MERGER INTO and other COMMONLY used CDC operators

Data Lake governance

Provide automated governance services based on the platform

In the case of Iceberg

Compact / Merge / Optmize
Expire-snapshot / Vacuum
Caching (Based on JindoFS)
Automatic data analysis and stratified archiving
Serverless Spark provides low-cost hosting resource pool services

Case sharing and future outlook

Case Study 1: A fully managed data lake solution

Big data platform architecture:

Before Shangyun, the big data of customers was deployed in IDC room, based on CDH self-built Hadoop cluster. Because the security and stability of the self-built IDC room are not guaranteed, the DISASTER recovery capability is poor. The customer decided to move the big data platform to the cloud.

Case in detail: * * * * developer.aliyun.com/article/800…

Fully managed Data Lake solution:

Take the RDS relational database as an example

Data can be imported to the lake using Spark after ETL, or imported to the OBJECT storage OSS using THE CDC incremental import to the lake/Batch full import to the lake using DLF under unified metadata management:

Real-time analysis based on Spark Streaming and display of real-time report/monitoring
ETL tasks through Databricks data insights on Ali Cloud for data modeling and further processing
Pull up the EMR Presto cluster for the interactive analysis scenario

Batch stream integration, simple and clear link
Delta Lake + Spark enables real-time computing and offline analysis
Full hosting service, free of operation and maintenance
OSS Storage Hosting
DLF metadata and lake entry task hosting
DDI Spark engine hosting
Open data
Quick access to Presto for interactive analysis

In this scenario, all components are fully managed except for the semi-managed EMR Presto. Users do not need to manage the underlying storage, metadata, or Spark engine, nor do they need to maintain or upgrade components. While improving user experience, operation and maintenance costs are greatly reduced.

Case Share 2: Real-time data lake for storage and computation separation

This is an example of a real-time data lake with separate storage and computation. Users originally use the integrated storage and computing architecture. Hive metadata information is stored in Hive MetaStore and data is stored in HDFS. Disk failures often occur, which brings trouble to O&M and increases costs. On the other hand, the performance of Hive Metastore is greatly challenged when the amount of data reaches a certain level, for example, when the number of partitions reaches 100,000. Therefore, the user decided to migrate from the integrated storage to the separated storage architecture.

Integration of storage and computation — > separation of storage and computation
HDfs operation and maintenance difficulties -> OSS operation and maintenance free
High cost of HDFS cloud disk -> low cost of OSS
Poor HMS scalability -> High DLF metadata scalability
Flink real-time access to the lake, Spark/Hive analysis
Flink -> Hudi efficient real-time write capabilities
DLF metadata naturally connects to Spark/Hive
Hbase service and data separation

The future of Lakehouse

Lake format capabilities will continue to grow closer to warehouse/database capabilities
AWS Memorandum Table
Optimize becomes more and more versatile (Caching, Clustering, Z-Ording…)
The capacity of the three lake formats continues to level out
Lake formats are increasingly integrated with storage/computing for higher performance
A custom API for the lake format appears in the storage tier
More performance-oriented solutions for the computing layer (secondary indexes, Column Statistics)
The lake management platform is becoming more and more intelligent
Unified metadata service becomes standard, Meta hosting and service (Hudi Timeline Server)
Governance and optimization become platform infrastructure capabilities that are not visible to users

To sum up, Lakehouse is likely to become a new generation of big data architecture, either from the characteristics of Lakehouse or from the exploration and construction of various manufacturers in Lakehouse and Lake format.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Lakehouse architecture analysis and cloud practices

Lakehouse concepts and features

Lakehouse architecture and implementation

Lakehouse architecture and practices on the cloud

Case sharing and future outlook

The future of Lakehouse

Lakehouse architecture analysis and cloud practices

Lakehouse concepts and features

Lakehouse architecture and implementation

Lakehouse architecture and practices on the cloud

Case sharing and future outlook

The future of Lakehouse

Related Posts

Memory leaks screening process | learn over the weekend

An architectural perspective on the soft power of architects

In-depth understanding of PHP7 kernel Reference