Introduction: As we all know, data lake technology is very hot in the field of big data. With its wide deployment and application in cloud, its business value has gradually gained consensus in the industry. How to build data lake architecture quickly is discussed by more and more enterprises. This paper mainly shares the architecture and practice of rapid construction of cloud native enterprise-level data lake.
Wang Zhen, technical expert of open source big data platform of Alibaba Cloud Computing Platform Division
This article is based on Meetup#0821, 2021 open source big data technology online
Live playback links: developer.aliyun.com/live/247227
Content framework:
- background
- How do I use the DLF data lake
- Field demonstration
I. Background introduction
What is a data lake
Data lake: Stores various types of data in a regular form
- Structured data (Orc, Parquet)
- Semi-structured data (Json, Xml)
- Unstructured data (images, videos)
Why a data lake
1. Further expansion of data scale
- Big data storage requires governance
- Data governance needs to clarify data dependencies
- Users need to know the total cost of big Data (TCO)
2. Diversified data sources
- Transaction data (MySQL, SqlServer)
- Search data (SOLR)
- Batch data (SPARK, HIVE)
3. Diversified data formats
- Parquet / Orc / Avro / Csv / Json / Text
4. Diversified data analysis scenarios
- Semantic – based search analysis
- Random/near real time OLAP analysis
5. Diversified users of data analysis
- Analysis of user role diversification (development/testing/data /BI)
- User data access compliance control requirements
What can a data lake do
1. Further expand the scale of data
- Data Lake provides data blood service
- Data Lake provides [data governance] services
- The data lake helps users understand the overall cost of big data
2. Diversify data sources
- DLF provides unified metadata services
• Resolve multi-engine metadata consistency issues
• Address metadata usage and maintenance costs
3. Diversify data formats
- DLF provides data into lake/metadata crawl services
• Support MYSQL/KAFKA into the lake, metadata crawl
• Support offline/real-time access to the lake to meet the requirements of different business timeliness
• Supports data lake formats such as DELTA/HUDI
4. Diversified data analysis scenarios
- DLF provides unified Metadata Service
• Switch between MC/EMR/DDI engines
• Data exploration is consistent across engines
5. Diversified users for data analysis
- Data lake provides access control services
• Centralized authorization of data access in multiple engines/avoid repeated authorization
• Resolve multi-user data access compliance issues
- Data lake provides access log audit service
• Resolve compliance review issues with user data access
How to use DLF data lake
Data into the lake
1. A large number of heterogeneous external data sources [data into the lake] services
- Full import: Batch import
- Incremental import: Real-time incremental import of outflow into the lake
2. A large number of existing Hadoop ecological data [metadata crawl] services
- Import data to data lake OSS for storage
- Metadata crawl Extracts the original data schema
Data query
The Data Lake unified Metadata service supports multiple engine queries
- SPARK is used to explore the incoming lake data
- Use MAXCOMPUTE for deep complex processing of data
- Explore the data using the Databricks DDI dedicated cluster
- More engine support in…
Data governance
1. Control data access with access Control service
- Access permissions are set at the library/table/column level
- Unified metadata that needs to be set up only once
Second, use [Data Governance] service to clarify the total cost of big data
- Daily/weekly/monthly storage usage – Timely release of obsolete large storage files
- Calculation usage at the daily/weekly/monthly level – Timely identification of abnormal calculations on data
Three, practical operation demonstration
Data lakes build DLF experience links: dlf.console.aliyun.com/
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.