MaxCompute provides eB-class near-real-time analysis of the cloud’s native database using high performance streaming data writes and second-level query capabilities (query acceleration). Efficient implementation of the changing data for rapid analysis and decision-making assistance. The current Demo is based on the near-real-time interactive BI analysis/decision assistance scenario, realizing the near-real-time BI analysis of the indicator card, near-real-time market monitoring, near-real-time trend analysis and near-real-time sales splitting functions.

The author of this article is Long Zhiqiang Ali Cloud intelligent senior product expert

I. Product function introduction

Data warehouse architecture based on query acceleration

At present, the prevailing real-time data warehouse is basically based on Flink. MaxCompute compute compute compute compute compute compute compute compute compute Open source real-time data warehouse is based on Flink to do, Flink is real-time computing, support stream batch integration, so more real-time scenarios are based on Flink+Kafka+ storage to do. How to write real-time data from BinLog, Flink, and Spark Streaming into MaxCompute

MaxCompute is written to in real time through a real time flow channel. This is a product feature of MaxCompute. At present, most of the data warehouse product write queries have delay, MaxCompute is to achieve high QPS real-time write, write can be checked. Query acceleration (MCQA) can be used to query data written into MaxCompute in real time. When connected to BI tools, AD hoc queries provide real-time access to real-time written data.

The Binlog is written to MaxCompute through DataX, which supports the combination of increment, deletion, change, and query. In subsequent product function iterations, MaxCompute supports upsert, which supports the addition, modification, and deletion of service database data. Flink data is written to MaxCompute directly using the Streaming Tunnel plugin. This process does not require code development. Kafka also supports the plugin.

At present, real-time write does not do the computation processing link of writing data, but quickly writes the current Streaming data including message service data directly into MaxCompute through the Streaming Tunnel service. Streaming Tunnel supports mainstream messaging services such as Kafka and Flink, as well as plugins. And Streaming Tunnel SDK, currently only supports Java SDK. You can use the Streaming Tunnel SDK to do some logic processing after the application reads, and then fetch the Streaming Tunnel SDK and write it into MaxCompute. After MaxCompute is written, the main processing link is to perform direct read query for the written data, or to associate the written data with the offline data in MaxCompute for joint query analysis. During the query process, if the SDK or JDBC access is used, you can enable the MCQA function. If you use the Web Console or DataWorks, query Acceleration (MCQA) is enabled by default. At present, BI analysis tools and third-party application layer analysis tools are mainly used. When MaxCompute is connected through SDK or JDBC, query acceleration (MCQA) function can be turned on, which can query data written in real-time at a level close to second.

Overall, the current scenario is mainly real-time streaming writing of data, which can be combined with offline data to do joint analysis queries through query acceleration (MCQA). After the data is entered into MaxCompute, no calculation is performed, only query service is performed. This is the current MaxCompute based real-time data processing scenario.

This section describes the streaming data writing function

The streaming data writing function has been commercialized in China. This feature is currently free to use.

A specific function

Supports streaming data writing in query-per-second (QPS) scenarios with high concurrency and QPS.
Provide streaming semantics API: It is easy to develop distributed data synchronization service through streaming service API.
Automatic partition creation: Solves the problem of concurrent lock scrambling caused by concurrent partition creation for data synchronization services.
Incremental data asynchronous aggregation (Merge) : Improves data storage efficiency.
Support incremental data asynchronously zorder by sorting, zorder by details, please see the INSERT or OVERWRITE data (INSERT INTO | INSERT OVERWRITE).

Performance advantage

More optimized data storage structure to solve the fragmentation file problem caused by high QPS write.
Data links are completely isolated from metadata access, preventing lock grab delay and error reporting caused by metadata access in high concurrent write scenarios.
Provides an asynchronous processing mechanism for incremental data, which allows you to further process newly written incremental data without being aware of it. The following functions are supported:
Merge data aggregation: Improves storage efficiency.
Zorder by sorting: Improves storage and query efficiency.

Streaming data write – Technical architecture

Stream API stateless concurrent data visible in real time

The technical architecture is divided into three parts: data channel, stream computing data synchronization, self-developed application.

The current data channel supports Datahub, Kafka, TT, and SLS

Stream computing data synchronization supports Blink, Spark, DTS, DataX, Kepler /DD

In MaxCompute, a Tunnel cluster exists before the cluster is calculated and the Stream Tnnel service is provided to write data from the client to the Tunnel server. The write process is a file optimization process, and there is a file merge at the end. This process consumes the computing resource service in the data channel process, but this consumption is free.

This section describes the query acceleration function

Realize real-time data writing and interactive analysis based on query acceleration

Currently, the query acceleration function supports 80% to 90% of daily query scenarios. The syntax for query acceleration is exactly the same as the MaxCompute built-in syntax.

MaxCompute query acceleration – For real-time query tasks, the full link accelerates the MaxCompute query execution speed

Optimized for near-real-time scenarios using MaxComputeSQL syntax and engine
The system automatically optimizes the query and supports users to select the execution mode of delay first or throughput first
Latency-based resource scheduling policies are used in near-real-time scenarios
Full-link optimization is performed for scenarios requiring low latency: resource pools are executed independently; Multi-level data and metaCaching; Interaction protocol optimization

earnings

An integrated solution of simplified architecture, query acceleration and mass analysis adaptation
It is several times or even tens of times faster than normal offline mode
Combined with MaxCompute streaming upload capability, near-real-time analysis is supported
Supports multiple access modes for easy integration
Supports automatic identification of short queries in offline tasks. The postpaid mode is enabled by default. Pre-paid currently supports free query acceleration for query jobs with SQL scans up to 10 GB in instances that use annual and monthly resource packages.
Low cost, free operation and maintenance, high flexibility

Query acceleration – Technical architecture

Adaptive execution engine, multi-level caching mechanism

When SQL is submitted to the MaxCompute computing engine, it is divided into two modes, offline jobs (throughput optimization) and short queries (latency optimization). At the technical level, the query acceleration task reduces and optimizes the execution plan, and the computing resource is the pre-pull resource and the vectorization execution, which is based on memory/network shuffle and multi-level caching mechanism. Compared with the offline job, the code is produced to the disk shuffle, and then the resource queue is applied. The query acceleration does the identification and, if the conditions are met, goes directly to the pre-pull resource. For data caching, there is a caching mechanism for tables and fields based on the Pangu distributed file system.

Query acceleration – Performance comparison

Performance comparison between TPCDS test set and an industry-leading competitive product

100GB exceeds 30%
The 1TB scale has similar performance

2. Application scenarios

Streaming Data Writing – Application Scenarios

Query Acceleration – Application Scenarios

Fixed report quick query

Data ETL processes aggregated data for consumption
To meet the requirements of fixed report/online data service, second-level query
Flexible concurrency/data caching/easy integration

Table data in MaxCompute can be read directly from the data application tool or BI analysis tool through JDBC/SDK.

Ad hoc data exploration and analysis

Automatic identification of job characteristics, according to the size of data, computing complexity to choose different execution modes, simple queries run fast, complex queries calculate
Cooperate with storage layer modeling optimization, such as partitioning, HashClustering, etc. to further optimize query performance

Near real-time operational analysis

Supports batch and streaming data access
Fusion analysis of historical data and near real time data
Product-level integrated messaging services:
Datahub- Logs/messages
DTS- Database logs
SLS- Behavior logs
Kafka- Internet of Things/log access

Tools and access

Streaming data write – access

Messages & Services

Message queue Kafka (plugin support)
Logstash output plug-in (plug-in support)
Flink version built-in plug-in
DataHub Real-time Data Channel (internal plug-in)

SDK class new interface -Java

Simple Upload Example
Multithreaded upload example
Asynchronous I/O multi-thread upload example

Follow the examples above to encapsulate your own business logic.

Query accelerated access

Utility class

DataWorks (enabled by default)
ODPS CMD (need to be configured)
MaxCompute Studio (configuration required)

The SDK class interface

ODPS JavaSDK
ODPS PythonSDK
JDBC

Old interface compatibility

Automatic recognition mode

Fourth, Demo& summary

Real-time data processing practice based on MaxCompute

To achieve rapid and high-performance analysis and decision-making aid for changing data, and obtain 1 billion pieces of data in seconds.

This Demo is implemented by MaxCompute+QuickBI. QuickBI now supports the directly connected MaxCompute query acceleration mode, and QuickBI already has its own acceleration engine such as DLA, CK, and so on. The current optimal mode, direct connection MaxCompute go query acceleration mode is the fastest.

practice

advantages

Streaming Tunnel: Real-time write visible, solve the fragmentation file problem caused by high QPS write;
Query acceleration: Low latency – multi-level caching & fast resource scheduling, easy to use – a set of SQL syntax, elasticity – storage computing separation

ascension

At present, the consumption/summary of downstream applications can only be fully queried each time, without further real-time flow calculation processing. Real-time storage cannot be modified or deleted.
Subsequently, MC provides streaming SQL engine to run real-time streaming operations, so as to integrate streaming and batch operations

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.

Real-time data processing practice based on MaxCompute

I. Product function introduction

2. Application scenarios

Tools and access

Fourth, Demo& summary

Related Posts

Shell series of file operations

Implement elegant and efficient TiDB UDF with WASM

What is the print directly output by a JAVA object?