Description: As an enterprise-level data warehouse service, MaxCompute centrally stores and manages enterprise data assets, processes and analyzes data for data-oriented applications, and transforms data into business insights. Through flexible combination with internal and external services of Ali Cloud, rich data applications can be constructed. Fully hosted data and analytics solutions can simplify platform operation and maintenance, management input, improve business-oriented service capabilities, and accelerate value realization.

The author of this article kong Liang ali cloud intelligent product expert

MaxCompute base introduction

Ali Cloud data and analysis product solutions

As an enterprise-level data warehouse service, MaxCompute centrally stores and manages enterprise data assets, processes and analyzes data for data-oriented applications, and transforms data into business insights. Through flexible combination with internal and external services of Ali Cloud, rich data applications can be constructed. Fully hosted data and analytics solutions can simplify platform operation and maintenance, management input, improve business-oriented service capabilities, and accelerate value realization.

As can be seen from the following figure, MaxCompute is in a core position. Firstly, MaxCompute is a data warehouse and a super large scale cluster product based on Serverless architecture with enterprise-level capabilities such as security management. The former is off-line data processing platform, currently has BI analysis capabilities of the enterprise data warehouse.

The offline data links of general enterprises, from data sources including relational databases, unstructured storage, big data storage, message queues and so on, can be entered into the data warehouse in the way of offline data integration and batch data integration. Front-end various analysis applications can also have some real-time analysis, through maxcompute-Hologres to do query acceleration, this is the offline scenario. Real-time link is from message queue data source, through Datahub data bus, to real-time calculation Flink, to real-time data warehouse Hologres, and then to the front desk. The data in the middle is streamed through Hologres and MaxCompute and Flink. Based on the Data warehouse, MaxCompute extends some federated query capabilities of the database, including Data Lake, Mysql, Hbase, etc., Data integration, including metadata synchronization and query to synchronize Data processing capabilities of the Data Lake. It also includes the ability for AI PAI machine learning based on data warehouse data. Constitute a complete big data base. On top of this, DataWorks provides a one-stop development governance platform, which can do task scheduling, metadata data quality blood management, data development and other capabilities.

MaxCompute Describes the big data computing service

MaxCompute (Big Data Computing Service) is a versatile, high-performance, easy-to-use data warehouse service.

MaxCompute has built-in enterprise-level security and management functions, supports open data ecology, and is widely used in data operation, portrait and recommendation, intelligent prediction and other application scenarios to meet the needs of multiple use scenarios (data warehouse /BI, data lake analysis, machine learning).

MaxCompute have unified Iass layer at the bottom of the storage and computing, storage is pangu, scheduling is fuxi, storage and computing resources to do a layer of encapsulation, with resource pool for the upper applications available without awareness, upper application only needs to use the resource group, and don’t need to know the specific task to run on what resources. MaxCompute can provide the storage capability of structured data warehouse and data lake including open format, semi-structured, unstructured data processing capability. For users, all usage scenarios are in the project, and each project is separated by tenants. Resources of the project can be owned or shared. Projects can synchronize data directly through secure sharing. The upper layer uses unified access authentication, management, security, monitoring, metadata and other capabilities.

The following points can be summarized:

  • Easy to use SQL end-to-end development mode, support Spark, distributed Python (Mars) and other open source technology stack, built-in comprehensive enterprise management functions and open interfaces, easy to use and open
  • Unified metadata, unified data storage, a unified enterprise data asset, cloud native multi-tenant system, the most efficient cross-business & cross-organization data connection and sharing
  • Adaptive on-demand elastic resources accurately match business scale changes, avoiding resource waste or insufficiency, isolating service load, and eliminating resource contention between services
  • Storage and computing scale independently, supporting TB to EB storage expansion. Connect to a wide range of external data sources for federated computing
  • Deep optimization, automatic counting warehouse, integrated multi-year double 11 optimization ability, intelligent tuning + expert service support

This topic describes the MaxCompute function

The MaxCompute function can be divided into the following categories. The core storage computing power is not developed externally. Storage is mainly used in database tables, and computing resources are represented when used on SQL tasks or other computing models. See SQL computing ability, can complete the data processing and data model management of the entire data warehouse end-to-end, including some basic data types, internal table to do some partition, external table to deal with non-organizational data. Support streaming write, streaming UPSERT data insertion, data deletion and other capabilities. You can use very complex query methods when querying, looking at the parse plan Explain. On the UDF side, MaxCompute supports Java UDFs and Python UDFs, as well as content security UDFs.

Management capabilities, the darker part of the chart below, will have separate enhancements in the private cloud to provide additional computing power to the enterprise. On the public cloud, the server is directly provided to users in Serverless mode, charging only resource fees. Management capability also includes the ability to measure and charge, including prepaid and pay-as-you-go. Tasks have task management, query has query acceleration, private cloud deployment, large-scale cross-domain computing and other enterprise capabilities. In addition to the SQL engine, MaxCompute also has vector retrieval, TensorFlow, Spark, Mars, and Hologres, all of which can be used to store computing resources based on MaxCompute, using different engines to provide the ability of corresponding scenarios. The gray part of compliance governance is provided by DataWorks, including its own warehouse security management capabilities, metadata management capabilities, audit capabilities, data encryption, data desensitization, data quality and other functions. Then through SDK/API and MaxCompute supporting tools to complete data development, data upload and download, and some three-party applications, two applications to complete the ecological construction of the entire data warehouse.

MaxCompute main solution

Enterprise data Warehouse/Data center

Integrate raw data into widely used knowledge for subsequent consumption, including:

• Integrated storage: Collect, store, and centrally manage internal and external data;

• Processing and analysis: cleaning, processing, integration of multi-party data; Business demand-oriented statistics and mining; Unified storage and processing can provide elastic scaling of storage computing power, reducing costs

• Standardization: Establish enterprise data warehouse model (layered/sub-topic), establish data standards, form reusable data assets, and through data governance, carry out data life cycle safety, cost governance, and continuously guarantee data quality and standardization

• Data interworking: transfer and share standard data within the enterprise, open up data islands, and make the associated data play a greater value

Data center is not only a technical platform, but also contains organizational and management process requirements. It emphasizes the realization of “data business” by public data products, which can be considered as a best practice of data warehouse. MaxCompute+DataWorks is an out-of-the-box data warehouse solution.

BI analysis/data analysis

BI analysis does not necessarily use a data warehouse, for example, it can be directly based on a transaction database

Data warehouses can help provide the comprehensive data from the enterprise perspective needed for BI analytics

BI analysts can better retrieve and understand data through data warehouse data asset management

Data warehouse can also meet the requirements of multi-user concurrency and analysis of different data scales with powerful performance

MaxCompute provides centralized management, processing and analysis of data, and can directly connect to BI or synchronize data to an external analytical database (OLAP) for BI analysis

Predictive analytics/intelligent applications

Data warehouse is increasingly integrated with AI

Data warehouse for machine learning data processing, data preparation

Machine learning conducts model training and data prediction on the data, and the results can be directly solidified in the data warehouse for knowledge sharing, such as the prediction of customer gender and preference by user portrait analysis

MaxCompute seamlessly integrates PAI and SparkML, enabling a platform to build intelligent applications based on machine learning, such as CTR estimation and personalized recommendation, on top of enterprise data without data movement

MaxCompute Semi-structured data processing

What is semi-structured data

The topic of this article is MaxCompute semi-structured processing capability. Let’s take a look at what semi-structured data is.

Structured data, that is, row data, is stored in a database and can be logically expressed using a two-dimensional table structure

Unstructured data, including all formats of office documents, text, images, XML, HTML, various reports, images and audio/video information, etc

Semi-structured data is the data between completely structured data (such as relational database, object-oriented database data) and completely unstructured data (such as sound, image files, etc.), HTML documents belong to semi-structured data. It is generally self-describing and constantly changing, with the structure and content of the data mixed together and typically represented by a triple, including tags, types, and the value of the object.

Through data model comparison:

Structured data: Relational (two-dimensional table) Semi-structured data: A directed graph structure with roots consisting of a set of nodes and a set of arcs. (tree, graph) Unstructured data: None

Finally, according to the definition of wiki belt, semi-structured structure is characterized by complex type structure, which is changeable and requires data extraction from self-describing structure for calculation.

Semi-structured data[1] is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes’ order is not important.

Semi-structured data are increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information.

Semi-structured data is widely used because:

• Simple, simple, small size, etc.

• Easy to use and efficient.

• Cross-language, front and back interfaces for Web projects, configuration files, file storage, and more. The popularity of mobile applications further drives the use of semi-structured data such as JSON (see the following JSON example from the web). These data can be used as data sources and stored in a data warehouse for analysis.

So the ability of semi-structured data processing is a typical application scenario of data warehouse.

Semi-structured data processing

In general, according to the process of data warehouse, from data source -> data processing -> data storage, this stage is mainly the main link of semi-structured processing, because from the upper level, the data may have been processed and directly oriented to the application, semi-structured data is not so obvious.

Here semi-structured data processing, there are two ways, one is to synchronize the data to a field, each time using some complex type, or directly extract json function, that is, extract on demand, but put the data into a field. This advantage is that semi-structured data structure changes are not considered. The disadvantage is that the performance is not good, each time the selection of suitable processing functions and methods, complex development. For example, MaxCompute, which normally supports 8M, can be opened up to 256M to handle this. Another approach is to split a wide table into json structures during import or batch processing, and then modify/rebuild the table structure as the JSON structure changes. The advantage of this is that both storage and computation can be optimized. But the disadvantage is that the table is often modified, modify inconvenience.

MaxCompute Semi-structured data processing

MaxCompute provides the following four capabilities for processing semi-structured data

Provides complex types to support the storage of semi-structured data

First, complex data types are provided to store corresponding semi-structured data

Schema Evolution changes the definition of semi-structured data structures

Schema Evolution then provides the ability to modify tables and nested columns, including:

  • Delete the column
  • Add columns
  • Modify column order
  • Modify the column name
  • Modify column data type (compatible type)
  • Modifying complex type nested structures (same as modifying table structures)

Semi-structured data processing function Processes the values of each node of semi-structured data

  • MaxCompute SQL adds a large number of built-in functions to improve the processing power and efficiency of complex data types (ARRAY, MAP, STRUCT). You can use built-in functions to process input complex data types or output complex data types through function processing.
  • At the same time, high-order functions are provided to enhance the processing ability of complex data types. Compared with ordinary functions, the input parameter of high-order functions can be a function itself. Thus higher-order functions can process input data of complex data types and use lambda expressions to simplify the processing of logical syntactical expressions.

The semi-structured data node value is directly used for calculation

CREATE TABLE evol_t2 (id int, name struct<given: string, family: string>,phones array<struct<type: string, num: string>>) ;
insert into table evol_t2 select 1, STRUCT('Bill', 'Gates'), array(STRUCT('work', '1234567890'),STRUCT('cell', '9876543210'));
insert into table evol_t2 select 2, STRUCT('Michael', 'Jordan'), array(STRUCT('work', '1111111111'),STRUCT('cell', '9999999999'));
Copy the code

Insert result as follows:

select name.given as firstname,c.phones[1].num as phonenum
from evol_t2 c
where c.phones[1].type = 'cell';
Copy the code

The query results are as follows:

MaxCompute table structure modified

Flexible modification of table structure can not only support the change of semi-structured data source schema, but also facilitate the adjustment of data warehouse modeling model. It is convenient to add and remove fields to the inventory table, and then put the same fields together or modify the type.

Syntax definitions and examples are as follows:

Delete the column

ALTER TABLE <table_name> DROP COLUMN <column_name>;
create table if not exists evol_t(id bigint,value1 bigint,value2 bigint);
ALTER TABLE evol_t DROP COLUMN value2;
Copy the code

Add columns

ALTER TABLE <table_name> ADD COLUMNS (col_name1 type1[, col_name2 type2...] ); create table if not exists evol_t(id bigint,value1 bigint,value2 bigint); ALTER TABLE evol_t ADD COLUMNS value3 STRING;Copy the code

Note: New columns added do not support the specified order, default in the last column.

Modify column order

ALTER TABLE <table_name>
CHANGE COLUMN <original_column_name> <new_column_name> <column_type> AFTER <column_name>;
create table if not exists evol_t(id bigint,value1 bigint,value2 bigint);
ALTER TABLE evol_t CHANGE COLUMN value2 value3 bigint AFTER id;
Copy the code

Note: The BEFORE keyword is not currently supported and can be implemented through AFTER, which can be added in subsequent functions if necessary.

Modify the column name

ALTER TABLE <table_name> 
CHANGE COLUMN <original_column_name> RENAME TO <new_column_name>;
Copy the code

MaxCompute Complex type data structure modification

The schema of nested columns for each layer of complex type data also supports flexible modification. Nested columns enjoy the same performance of column storage optimization and the convenience of direct query as table structures

CREATE TABLE evol_t (id int, point struct<x: double, y: double>) ;
ALTER TABLE evol_t ADD COLUMNS (points_map map<string, struct<x: double, y: double>>);
ALTER TABLE evol_t ADD COLUMNS (points_arr array<struct<x: double, y: double>>); 
Copy the code

Because all nested columns are treated and identified as a nested table, nested columns also gain the following capabilities:

  • Table structure can be modified (add, delete, change name, change order, change type)
  • Finer column storage and compression
  • Storage and computation optimization for data types
  • Calculate directly with the node values
  • Richer functions for semi-structured data processing

MaxCompute Complex type data processing function

Rich complex type data processing functions facilitate the direct processing of semi-structured data, and more easy-to-use functions are constantly introduced

MaxCompute higher-order functions support lambda expressions

Higher-order functions in complex type data processing functions include:

ANY_MATCH, ALL_MATCH, ARRAY_REDUCE, ARRAY_SORT, FILTER, TRANSFORM, ZIP_WITH, MAP_FILTER, MAP_ZIP_WITH, TRANSFORM_KEYS, and TRANSFORM_V ALUES function

Support for lambda expression syntax, simplifying the expression of data processing for complex data types.

Descriptions and examples of some functions are as follows:

ARRAY ARRAY ARRAY (1, 2, -10, 100, -30) contains elements that meet the x-> x > 3 condition. The following is an example command: -- Returns true. selectany_match(array(1, 2, -10, 100, -30), x-> x > 3); Returns a new ARRAY by filtering the elements in ARRAY A using func. -- returns [2, 3]. selectfilter(array(1, 2, 3), x -> x > 1); Combiner combiner combiner combiner combiner combiner combiner combiner combiner combiner combiner -- Returns [2, 4, 6, NULL]. Selectzip_with (array (1, 2, 3), array (1, 2, 3, 4), (x, y) - > x + y); Filter the input elements of the MAP object to retain only the elements that meet the predicate conditions. Return {-30:100, 20:50}. selectmap_filter(map(10, -20, 20, 50, -30, 100, 21, null), (k, v) -> (k+v) > 10); Transform the input MAP object input, leaving the Key unchanged and using func to calculate the new Value. -- Returns {-30:71, 10:-10, 20:NULL} selecttransform_values(map(10, -20, 20, null, -30, 101), (k, v) -> k + v);Copy the code

Three, practical operation demonstration

MaxCompute semi-structured data processing and Schema evolution

Please click on the video to see the demo

Functional specifications

Examples of common commands to modify the schema of a complex type of node:

ALTER TABLE evol_t ADD COLUMNS (point. Z double); ALTER TABLE evol_t ADD COLUMNS (points_map.value.z double); ALTER TABLE evol_t ADD COLUMNS (points_arr.element.z double) ALTER TABLE evol_t ADD COLUMNS (points_arr.element.z double) ALTER TABLE evol_T2 ADD COLUMNS (phone.element. Type2 string); ALTER TABLE evol_T2 DROP COLUMNS (phone.element.type); ALTER TABLE evol_T2 CHANGE COLUMN phone.type2 phone.type0 string; ALTER TABLE evol_T2 CHANGE phones. Num phones. Num string AFTER type0;Copy the code

Complete function demonstration script in Demo

DROP table evol_t2; CREATE TABLE evol_t2 (id int, name struct<given: string, family: string>,phones array<struct<type: string, num: string>>) ; insert into table evol_t2 select 1, STRUCT('Bill', 'Gates'), array(STRUCT('work', '1234567890'),STRUCT('cell', '9876543210')); insert into table evol_t2 select 2, STRUCT('Michael', 'Jordan'), array(STRUCT('work', '1111111111'),STRUCT('cell', '9999999999')); select * from evol_t2; ALTER TABLE evol_t2 ADD COLUMNS (position map<string, struct<x: double, y: double>>); insert into table evol_t2 select 3, STRUCT('Michael', 'Jackson'), array(STRUCT('work', '1231231231'),STRUCT('cell', '1231231233'), map (p1, struct (1.1, 1.2), 'the p2, struct (1.5, 1.3)); select * from evol_t2; ALTER TABLE evol_t2 ADD COLUMNS (position.value.z double); insert into table evol_t2 select 4, STRUCT('Ming', 'Yao'), array(STRUCT('work', '5555555555'),STRUCT('cell', '6666666666'), map (p1, struct (5.5, 1.0, 12.0), 'the p2, struct (6.5, 3.0, 8.1)); select * from evol_t2; ALTER TABLE evol_t2 DROP COLUMNS (phones.element.type); Select * from evol_t2; ALTER TABLE evol_t2 ADD COLUMNS (phones.element.type2 string); Select * from evol_t2; ALTER TABLE evol_t2 CHANGE COLUMN phones.type2 phones.type0 string; Select * from evol_t2; insert into table evol_t2 select 5, STRUCT('Lei', 'Li'), Array (STRUCT (' 9999999999 ', 'work'), STRUCT (' 8888888888 ', 'cell')), map (p1, STRUCT (9.5, 6.0, 10.0), 'the p2, STRUCT (5.5, 2.0, 3.0)); select * from evol_t2; ALTER TABLE evol_t2 CHANGE phones.num phones.num string AFTER type0; Select * from evol_t2; select name.given as firstname,c.phones[1].num as phonenum from evol_t2 c where c.phones[1].type0 = 'cell'; select c.name.family||c.name.Given,c.phones[1].num,SQRT(POW(position['p2'].x-position['p1'].x,2)+POW(position['p2'].y-position[ 'p1'].y,2)+POW(position['p2'].z-position['p1'].z,2)) from evol_t2 c where name.given in ('Ming','Lei');Copy the code

Currently supported data type conversion relationships

Feature of modified type will be released in gray scale in the near future. The specific data type transformation relationship supported is as follows:

Iv. Evolution direction

Continued enhancement of functionality and evolution direction

  • Add more complex data type handlers
  • More data type compatible conversions
  • Automatically recognize the schema of complex type data types and store optimizations
  • More flexible node value extraction and calculation
  • Higher performance column analysis capabilities
  • Version management for Schema evolution and data changes by Timetravel

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.