0. Introduction to Data Lake Analytics (DLA)

Lake Data (Data Lake) is popular nowadays big Data industry concept: en.wikipedia.org/wiki/Data_l… . Analysis based on data lake can realize big data association analysis across various heterogeneous data sources without any prior processes such as ETL and data relocation, thus greatly saving costs and improving user experience. The concept of Data Lake.

Finally, ali cloud lake has its own data analysis product: www.aliyun.com/product/dat… You can click Apply for OTS data analysis (open beta is still in invited beta mode) to experience this tutorial. The product documentation: help.aliyun.com/product/701…

1. Introduce the ETL

ETL (en.wikipedia.org/wiki/Extrac…). Extract, Transfrom and Load are important tools for traditional data warehouse and big data.

Extraction: The extraction of required data from source systems that are homogeneous or heterogeneous: Excel spreadsheets, XML files, relational databases. Conversion: Conversion of source system data to the format required by the target system for analysis purposes, or data cleaning and data processing. Load: Load the transformed data into the target database as a basis for online analysis, data mining, and data presentation.

The whole ETL process is like building a pipeline between the source system and the target system, in which data flows continuously.

2. DLA and ETL

Data Placement Optimization (Data Placement Optimization) is the mainstream architecture direction and thinking of business system on cloud platform. Architects consider the benefits of different storage engines in terms of read/write performance, stability, consistency, cost, ease of use, and development efficiency to achieve a perfect balance of the entire service system.

However, data migration across heterogeneous data sources is not an easy task. Many OF the ELT tools are at the framework level and require a number of self-developed ancillary tools; At the same time, the expression ability is weak, unable to meet many scenes; In addition, the abstraction and compatibility of heterogeneous data sources are not perfect.

On the other hand, DLA perfectly fits the demand scene of ETL from all aspects. The following is a simple architecture diagram of DLA. DLA was designed based on the architecture principles of “MPP computing engine + storage computing separation + elastic high availability + heterogeneous data set source” at the very beginning. The core goal of DLA is to support reading and writing of various heterogeneous data sources.

Extract is realized by connecting heterogeneous data sources to execute logic such as SELECT + Join + subQuery, and data flow transformation and mapping Transform are realized by Filter+ Project + Aggregation + Sort + Functions. Here is an example of Load via insert:

Insert into target_table (col1, col2, col3,....) Select c1, C2, c3,.... The columns must be compatible with the type of the imported columns. -- Can be any data target you want to querywhere. Insert into target_table (id, name, age) select s1.pk1, s2.name, s1.age from source_table1 s1 join source_table2 s2 on s1.sid = s2.sidwhere s1.xxx = 'yyy'Copy the code

Let’s try importing data from different data sources.

3. Actual test (take TableStore: as an example)

  • Prepare DLA account (existing test account)

    • Test cluster: Shanghai Region;
    • Account Number: DLA test account;
  • Prepare two source tables (two TPC-H OSS tables, Customer and Nation) for join and data query;
  • Prepare a TableStore(help.aliyun.com/document_de…) Target table;
  • Execute import SQL, write data to verify the result;

A) Two source table definitions:

mysql> show create database tpch_50x_text; +---------------+------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------+ | Database | Create Database | +---------------+------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------+ | tpch_50x_text | CREATE DATABASE `tpch_50x_text` WITH DBPROPERTIES ( catalog ='hive',
    location = 'the oss: / / ${your bucket} / datasets/TPCH / 50 x/text_date/'
)
COMMENT ' '| +---------------+------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------+ 1 rowin set (0.02 sec)


mysql> show tables;
+------------+
| Table_Name |
+------------+
| customer   |
| nation     |
+------------+
2 rows in set(0.03 SEC) mysql> show create table customer; +----------+------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ----------------------------+ | Table | Create Table | +----------+------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ----------------------------+ | customer | CREATE EXTERNAL TABLE `tpch_50x_text`.`customer` ( `c_custkey` int, `c_name` string, `c_address` string, `c_nationkey` int, `c_phone` string, `c_acctbal` double, `c_mktsegment` string, `c_comment` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY'|'
STORED AS `TEXTFILE`
LOCATION 'oss: / / ${your bucket} / datasets/TPCH / 50 x/text_date/customer_text'| +----------+------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ ----------------------------+ 1 rowin set(0.90 SEC) mysql> show create table nation; +------------+---------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------ ---------------+ | Table | Create Table | +------------+---------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------ ---------------+ | nation | CREATE EXTERNAL TABLE `tpch_50x_text`.`nation` ( `n_nationkey` int, `n_name` string, `n_regionkey` int, `n_comment` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY'|'
STORED AS `TEXTFILE`
LOCATION 'oss: / / ${your bucket} / datasets/TPCH / 50 x/text_date/nation_text'| +------------+---------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------ ---------------+ 1 rowin set (0.73 sec)
Copy the code

B) Prepare TableStore libraries and tables

# # to build librarymysql> show create database etl_ots_test; +--------------+-------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------+ | Database | Create Database | +--------------+-------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------+ | etl_ots_test | CREATE DATABASE `etl_ots_test` WITH DBPROPERTIES ( catalog ='ots',
    location = '{your instance} https://$. Cn-shanghai.ots-internal.aliyuncs.com,
    instance = '${your instance}'
)
COMMENT ' '| +--------------+-------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------+ 1 rowin set (0.02 sec)

# # using a library
mysql> use etl_ots_test;
Database changed

# # to build tablemysql> show create table test_insert; +-------------+--------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------+ | Table | Create Table | +-------------+--------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------+ | test_insert | CREATE EXTERNAL TABLE `test_insert` ( `id1_int` int NOT NULL COMMENT'Customer ID primary key',
    `c_address` varchar(20) NULL COMMENT 'Customer's Address',
    `c_acctbal` double NULL COMMENT 'Customer's Account Balance',
    PRIMARY KEY (`id1_int`)
)
COMMENT ' '| +-------------+--------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------+ 1 rowin set (0.03 sec)Copy the code

Here’s a screenshot of the actual data:

C) Start importing data to ensure the order and type compatibility of imported fields:

## Check the data, it's empty
mysql> select * from etl_ots_test.test_insert;
Empty set (0.31 sec)Copy the code
mysql> use tpch_50x_text;
Database changed

## Query the nation data, where CANADA's nationkey is 3, and look for this data latermysql> select n_nationkey, n_name from nation; +-------------+----------------+ | n_nationkey | n_name | +-------------+----------------+ | 0 | ALGERIA | | 1 | ARGENTINA | | 2 | BRAZIL | | 3 | CANADA | | 4 | EGYPT | | 5 | ETHIOPIA | | 6 | FRANCE | | 7 | GERMANY | | 8 | INDIA | | 9 | INDONESIA | | 10 | IRAN | | 11 | IRAQ | | 12 | JAPAN | | 13 | JORDAN | | 14 | KENYA | | 15 | MOROCCO | | 16 | MOZAMBIQUE | | 17 | PERU | | 18 | CHINA | | 19 | ROMANIA | | 20 | SAUDI ARABIA | | 21 | VIETNAM | | 22 | RUSSIA | | 23 |  UNITED KINGDOM | | 24 | UNITED STATES | +-------------+----------------+ 25 rowsin set (0.37 sec)

Nationkey =3; c_mktsegment='BUILDING'
mysql> select count(*) from customer where c_nationkey = 3 and c_mktsegment = 'BUILDING';
+----------+
| count(*) |
+----------+
|    60350 |
+----------+
1 row in set (0.66 sec)

Nationkey =3; c_mktsegment='BUILDING'
mysql> select * from customer where c_nationkey = 3 and c_mktsegment = 'BUILDING' order by c_custkey limit3; +-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+- ---------------------------------------------------------------------------------------------------+ | c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment | +-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+- ---------------------------------------------------------------------------------------------------+ | 13 | Customer13-761-547-5974 # 000000013 | nsXQu0oVjD7PM659uC3SRSp | 3 | | | 3857.34 BUILDING | ounts sleep carefully after the close frays. carefully bold notornis use ironic requests. blithely |
|        27 | Customer13-137-193-2709 # 000000027 | IS8GIyxpBrLpMT0u7 | 3 | | | 5679.84 BUILDING | about the carefully ironic pinto beans. accoun |
|        40 | Customer13-652-915-8939 # 000000040 | gOnGWAyhSV1ofv | 3 | | | 1335.3 BUILDING | rges impress after the slyly ironic to be. foxes are. blithely |+-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+- ---------------------------------------------------------------------------------------------------+ 3 rowsin set (0.78 sec)Copy the code

Before importing, we think about the requirements: Find the customer whose country is ‘CANADA’ and whose market segmentation is ‘BUILDING’, sort C_CUSTkey, select the first 10 data, and then select their C_CUSTkey, C_Address, and C_Acctbal columns. Clear to THE TEST_INSERT table of OTS for future use.

Let's look at the data and see how many entries there are
mysql> select c.c_custkey, c.c_address, c.c_acctbal 
    -> from tpch_50x_text.customer c
    -> join tpch_50x_text.nation n 
    -> on c.c_nationkey = n.n_nationkey
    -> where n.n_name = 'CANADA' 
    -> and c.c_mktsegment = 'BUILDING' 
    -> order by c.c_custkey
    -> limit10; +-----------+--------------------------------+-----------+ | c_custkey | c_address | c_acctbal | + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- + | | 13 nsXQu0oVjD7PM659uC3SRSp 3857.34 | | | | 1335.3 5679.84 IS8GIyxpBrLpMT0u7 | | | | 40 gOnGWAyhSV1ofv | | | 64 | MbCeGY20kaKK3oalJD, OT 646.64 | | - | 255 | 3196.07 I8Wz9sJBZTnEFG08lhcbfTZq3S | | | 430 | s2yfPEGGOqHfgkVSs5Rs6 qh, SuVmR 7905.17 | | | 726 | 4 w7doltn9hy, xzZMR | 6253.81 | | 905 | f iyVEgCU2lZZPCebx5bGp5 | 600.73 | | 1312 | f5zgMB4MHLMSHaX0tDduHAmVd4 9459.5 | | 1358 | | 5149.23 t23gsl4TdVXqTZha DioEHIq5w7y | | + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + 10 rowsin set (1.09 sec)


## Start import
mysql> insert into etl_ots_test.test_insert (id1_int ,c_address, c_acctbal)
    -> select c.c_custkey, c.c_address, c.c_acctbal 
    -> from tpch_50x_text.customer c
    -> join tpch_50x_text.nation n 
    -> on c.c_nationkey = n.n_nationkey
    -> where n.n_name = 'CANADA' 
    -> and c.c_mktsegment = 'BUILDING' 
    -> order by c.c_custkey
    -> limit 10;
+------+
| rows |
+------+
|   10 |
+------+
1 row in set (2.14 sec)

## Verify the result, no problem:mysql> select * from etl_ots_test.test_insert; +---------+--------------------------------+-----------+ | id1_int | c_address | c_acctbal | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- + | | 13 nsXQu0oVjD7PM659uC3SRSp 3857.34 | | | | 1335.3 5679.84 IS8GIyxpBrLpMT0u7 | | | | 40 gOnGWAyhSV1ofv | | | 64 | MbCeGY20kaKK3oalJD, OT 646.64 | | - | 255 | 3196.07 I8Wz9sJBZTnEFG08lhcbfTZq3S | | | 430 | s2yfPEGGOqHfgkVSs5Rs6 qh, SuVmR 7905.17 | | | 726 | 4 w7doltn9hy, xzZMR | 6253.81 | | 905 | f iyVEgCU2lZZPCebx5bGp5 | 600.73 | | 1312 | f5zgMB4MHLMSHaX0tDduHAmVd4 9459.5 | | 1358 | | T23gsl4TdVXqTZha DioEHIq5w7y | | 5149.23 + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + 10 rowsin set (0.27 sec)Copy the code

D) Points to note:

Although there are ETL tools for fast import and export, there are also some issues that need to be noted, such as:

  • If it takes a long time to import a task, set it to asynchronous mode. Otherwise, task running may be affected due to disconnection.
  • TableStore insert is based on primary key overwrite, primary key will not be judged, please be sure not to your normal table insert;
  • At present, the transaction capacity of DLA and TableStore is not enough, there may be interruption, the imported data will not be clear, need to clean up;
  • The number and type of columns must be aligned by themselves. Otherwise, an error will be reported.

4. Import other data sources

Was the whole process simple? Do you want to import data sources for other scenarios? For DLA, any underlying data source is treated in the same way, as long as the other data source libraries, tables are created in DLA, can be normal read and write, implement ETL! Give it a try!

Other relevant documents:

  • Use the Data Lake Analytics Data from OSS cleaning AnalyticDB:yq.aliyun.com/articles/62…
  • DLA related technical documentation: yq.aliyun.com/search?q=DL…
  • Data Lake Analytics Usage scenario: help.aliyun.com/document_de…
  • OLAP on TableStore – based on the Data Lake Analytics Serverless SQL yq.aliyun.com/articles/61 big Data analysis…
  • Analyzing OSS Data using Data Lake Analytics: help.aliyun.com/document_de…
  • How to connect to the Data Lake Analytics database: help.aliyun.com/document_de…
  • Data Lake RDS Data Analytics analysis: yq.aliyun.com/articles/62…

Author: I am the link to the original text