Delta Lake and Hudi are popular open format storage tiers that provide both streaming and batch operations for data lakes. This allows us to run applications such as BI directly on the data Lake, allowing data analysts to query new real-time data in real time to generate instant insights into your business. MaxCompute provides data warehouse performance in the data Lake by supporting Delta Lake and Hudi in an all-in-one Lake architecture.

The author of this article is Meng Shuo ali Cloud intelligent product expert

First, best practice background

The whole best practice is based on MaxCompute lake warehouse integrated architecture, simulation of corporate use scenarios. For example, Company A uses RDS as its business library and ali Cloud EMR system for log data collection. Data is pooled to the object storage OSS on the cloud, and storage mechanisms commonly used by data lakes, Delta Lake and Hudi, are introduced to provide stream processing and batch processing capabilities for data lakes. You can use MaxCompute to query real-time data for real-time insight into service data changes. The architecture of the whole scenario demo is that the real-time changing data generated by EMR on the cloud, including the online database RDS, are entered into the lake through the data, and then the data changes are reflected in the archived OSS in real time. At the same time MaxCompute analyzes data on OSS with other engines.

Integrated architecture of lake and warehouse: integration of heterogeneous data platforms

Because there are many business lines and different departments within an enterprise, different data architectures and data platforms are adopted due to their own business needs and employees’ technology stacks. The technical architecture includes Hadoop technical system and cloud fully managed architecture. As a result, different departments use different technical platforms for the technical architecture, resulting in data fragmentation. The integration of lake and warehouse is to help enterprises get through the heterogeneous data platform. The underlying data can be accessed each other, and the intermediate metadata layer can also be viewed through each other, so that data can flow freely. The Data Lake part not only supports EMR, but also SUPPORTS ESC Hadoop and IDC Hadoop under the cloud. In this way, all data in MaxCompute can be aggregated in MaxCompute. In this way, all data in MaxCompute can be aggregated in MaxCompute. For example, if there are three tables, in RDS and Hive, MaxCompute has a large number of fact tables. If the requirement is to do a joint query on the three tables, this architecture can easily do this.

Faster business insights

  • DataWorks self-service integration of lake and warehouse: 5 minutes to open heterogeneous data platform (Hadoop/ DLF+OSS)

More extensive ecological docking

  • Support the construction of aliyun cloud native data lake (DLF)
  • Supports query in DeltaLake and Hudi formats
  • Support for connecting to more external federated data sources Hologres (RDS, HBaseUpcoming!)

Higher performance

  • Intelligent Cache implements OSS/ HDFS data access acceleration
  • Lake data query acceleration

Better integrated data development and governance

  • Cross-engine development and scheduling
  • Unified lake/warehouse data development experience
  • Unified lake/warehouse global asset management

Lake warehouse integrated architecture

First look at the right side, which is connected to OSS and DLF side, because in OSS we archive a lot of semi-structured and structured data. There are relational databases, there are noSQL databases, and metadata can be crawled out of OSS through DLF components. Instead of creating OSS externals on MaxCompute to access OSS data, it is now more convenient to automatically recognize OSS schemas through DLF. Some data updates and schema changes can also be automatically recognized by DLF. At the same time, DLF has user rights management, which will be online in the future. That is, for OSS connected engines, unified data access rights are converged to the DLF.

On the left is access to the Hadoop ecosystem. Hadoop includes alibaba Cloud semi-managed EMR, open source ON ECS and IDC Hadoop, as well as mainstream distribution CDH, which can also be opened. Add federated data sources below. MaxCompute can connect to a variety of cloud data sources, upload DataWorks for unified development and management, and data governance. This gives you a global view of data assets, development work data can be connected, and metadata can be projected onto DataWorks. This is the architecture of the whole lake warehouse.

Ii. Introduction of related products

Data lake build DataLakeForamtion

DLF is a host for OSS data and connects to other engines such as EMR, Databricks, Datalnsight, PAI, MaxCompute, and Hologres. These engines can share a piece of metadata on the DLF. Subsequently, enterprises can switch engines based on different departments and service scenarios. Later we will also improve access control, as well as enterprise capabilities such as security, including data layering.

Data into the lake

  • Support a variety of data sources into the lake, such as MySQL, SLS, OTS, Kafka, etc
  • Offline/real-time access to the lake, support Delta/Hudi and other data lake formats
  • Supports field mapping, transformation, and user-defined UDF operations

Metadata management

  • Unified metadata management solves the problem of metadata consistency among multiple engines
  • Compatible with open source ecological apis
  • Automatically generate metadata to reduce usage costs

Access control

  • Centralized data access permission control, multi – engine unified centralized authorization
  • Data access log audit, statistics data access information

Third, best practice Demo

The overall architecture

The construction and application of data lake in enterprises generally need to go through several processes, such as data entry into the lake, data storage and management, data exploration and analysis. This paper mainly introduces the one-stop data into the lake and analysis practice based on Ali Cloud MaxCompute, data Lake construction (DLF) + object storage (OSS) construction. Its main data links are as follows:

Step 1: RDS data preparation

RDS data preparation, create database academy_db in RDS. Create a user account in the account center that can read the Employees database. Log in to the database via DMS and run the statement to create the anti_fraud_RESULT table and insert a small amount of data.

CREATE TABLE `anti_fraud_result` ( `transactionid` varchar(32) NOT NULL, `uid` varchar(32) DEFAULT NULL, 'card' varchar(32) DEFAULT NULL, 'longitude' double(12,8) DEFAULT '12000000' 'latitude' double(12,8) DEFAULT '12.00000000', PRIMARY KEY (' transactionid ')) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO 'anti_fraud_result' VALUES ("12489571","82005","123123",3.14592040,101.12315432); INSERT INTO 'anti_fraud_result' VALUES ("12489572","82005","123123",3.14592040,101.12315432); INSERT INTO 'anti_fraud_result' VALUES ("12489573","82005","123123",3.14592040,101.12315432); INSERT INTO 'anti_fraud_result' VALUES ("12489574","82005","123123",3.14592040,101.12315432);Copy the code

The second part: DLF data into the lake

  • Enter the DLF console interface: dlf.console.aliyun.com/cn-hangzhou… Data to the lake -> Data Source Management -> New data source. Enter database connection information

After filling in the RDS instance, login information, and network and security group information, click “Connect Test”

  • Create a Bucket and directory in OSS, for example, Bucket: rtcompute, directory: OSS ://rtcompute/csvfile/
  • On the DLF console interface, choose “Metadata Management” > “Metadata Database” from the left menu, and click “New metadata database”. Enter the name coVID -19, create a directory, and select.

  • Create a lake entry task:

A. In the DLF console, click “Data Entry to Lake” -> “Household Entry Task Management”, click “New Lake Entry Task” -> select “Real-time Relational Database Entry to Lake”, and fill in data source, target data lake, task configuration and other information according to the following figure. And save.

If the format is Hudi, you can select Hudi in Storage Format:

B. Next, enter the task instance name, RAM role, and maximum resource usage, and click Next to confirm the information

C. On the “Inbound Task Management” interface of the DLF console, find the newly created inbound task list and click run to start the inbound task. The data import task is full and incremental. After 3 to 5 minutes, the full data is imported and the real-time monitoring status automatically enters. If any data is updated, it is automatically updated to the Delta Lake or Hudi data format.

Step 3: query data in MaxCompute

Enable DataWorks and MaxCompute (refer to the documentation for quick startup), and enter the DataWorks temporary query interface (refer to the documentation)

  • Create MaxCompute external Project mapping DLF metadata:

    create externalproject -source dlf -name ext_dlf_delta -ref ms_proj1 -comment “DLF” -region “cn-shanghai” -db covid_19 -endpoint “dlf-share.cn-shanghai.aliyuncs.com” -ossEndpoint “oss-cn-shanghai-internal.aliyuncs.com”;

    Show tables in ext_dlF_delta;

  • Query data:

    set odps.sql.hive.compatible=true; set odps.sql.split.hive.bridge=true; select * from ext_dlf_delta.sh_rds;

Part four: New data in RDS

INSERT INTO `anti_fraud_result` values ("12489575","82005","123123",3.14592040,101.12315432);
INSERT INTO `anti_fraud_result` values ("12489576","82005","123123",3.14592040,101.12315432);
INSERT INTO `anti_fraud_result` values ("12489577","82005","123123",3.14592040,101.12315432);
INSERT INTO `anti_fraud_result` values ("12489578","82005","123123",3.14592040,101.12315432);
INSERT INTO `anti_fraud_result` values ("12489579","82005","123123",3.14592040,101.12315432);
Copy the code

Verify data in MaxCompute

set odps.sql.hive.compatible=true;
set odps.sql.split.hive.bridge=true;
select * from ext_dlf_delta.sh_rds;
Copy the code

Within 1 minute, check that the data in MaxCompute has been updated:

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.