Data warehouse is a kind of basic service that the company must provide when the data develops to a certain scale, and it is also the basic link of “data intelligence” construction. Rapid acquisition of data feedback is not only conducive to improving product and user experience, but also conducive to the company’s scientific decision-making, so the real-time acquisition of data is particularly important.

At present, the number warehouse construction of enterprises is mostly a set of offline, real-time. Business requirements of low – delay use of real-time data bin; Business complex using offline data warehouse. The architecture is very complex, requiring the use of many systems and computing frameworks, which requires enterprises to reserve various talents, resulting in a high cost of talents, and it is difficult to troubleshoot problems. End users also need to be familiar with a variety of syntax. This paper analyzes the current data warehouse architecture, explores whether offline and real-time data warehouse can be considered together, and explores whether Flink’s unified architecture can solve most of the problems.

There is a bonus at the end of the article, you can download the ebook.

Several warehouse architecture

Data warehouse can be divided into three layers: ODS (original data layer), DW (data warehouse layer), ADS (application data layer).

1. Operation Data Store (ODS)

The original Data transferred from the log or service DB is periodically synchronized to the Data warehouse using the Change Data Capture (CDC) tool. Taking on this role with a unified Kafka allows data to be dropped into the warehouse in real time, and can also unify real-time and offline at this level.

2. Data Warehouse (DW) layer

DW layer is generally divided into DWD layer and DWS layer:

  • DWD (Data Warehouse Detail) layer: The detailed Data layer. The Data at this layer should be cleaned, clean, and accurate. It contains the same information as the ODS layer, but it follows the standard Schema definitions for warehouses and databases.
  • Data Warehouse Service (DWS) layer: The summary Data layer, which may be lightly aggregated and may be structured Data of star or snowflake models. This layer has done some calculation of the business layer, and users can calculate the Data required by the Data service based on this layer.

3. Application Data Store (ADS) layer

Unlike DWS, this layer is the data service directly to the user, which does not need to be computed again and is already the final data needed.

It is mainly divided into two links:

  1. Business DB and Log -> Kafka -> Real-time database (Kafka + Dim dimension table) -> BI DB -> Data Services
  2. Service DB and Log -> Kafka -> Offline database (Hive MetaStore + HDFS) -> BI DB -> Data Service

The mainstream data warehouse architecture is still Lambda architecture. Although Lambda architecture is complex, it can cover the scenarios required by the business and is the most flexible way for the business.

The Lambda architecture is divided into two links:

  • Traditional offline data has the advantages of stability, complex calculation and flexibility. Batch computing ensures T+1 report generation and flexible ad-hoc query.
  • Real-time warehouse with low latency data services, the traditional offline for several positions tend to be the delay of T + 1, as a result, the analyst can’t do some real-time decision-making, and the whole link delay the lowest number of real-time storehouse can even do a second level, this not only speeds up the analysis and decision, and it also brings more business, such as real-time monitoring alarm. Flink’s strengths are real-time computing, streaming computing, and Kafka is at the heart of real-time warehouse storage.

The figure above shows 1-9 edges, and each edge represents the transformation of data, namely the calculation of big data. This paper will analyze these edges and explore the role that Flink can play in them.

Flink stack calculation

metadata

Kafka does not have the ability to manage metadata. There are two ways to manage metadata:

1. Confluent schema registry

After setting up the Schema Registry service, confluent URL can obtain the schema information of the table. For tables with hundreds of fields, it can save a lot of trouble in writing Flink jobs. Flink is also integrating its schema inference capabilities with the Confluent Schema Registry. But it still doesn’t eliminate the process of creating tables, and users need to fill in Confluent urls.

2. Catalog

Flink now provides HiveCatalog built-in, Kafka tables can be directly integrated into Hive MetaStore, users can use these tables in SQL. However, some scenarios of Kafka start-offsets require flexible configuration. For this, Flink is also providing LIKE [1] and Table Hints [2].

Flink uses Hive Catalog for both offline and real-time data stores:

use catalog my_hive; -- build streaming database and tables; create database stream_db; use stream_db; Create table order_TABLE (id long, amount double, user_id long, status string, TS timestamp,... Ts_day string ts_hour string with (' connector.type '=' kafka ',... -- Kafka table configuration); -- build batch database and tables; create database batch_db; use batch_db; create table order_table like stream_db.order_table (excluding options) partitioned by (ts_day, Ts_hour) with (' connector.type '=' hive ',... -- Hive table configuration);Copy the code

With the Catalog, subsequent calculations can fully reuse batches and streams, providing the same experience.

Several positions import

Calculation ① and ⑤ are respectively the import of real-time data warehouse and offline data warehouse. Recently, more real-time offline data warehouse import has become more and more common practice of data warehouse. Flink import can make the data of offline data warehouse more real-time.

Previously, the import was mainly done through DataStream + StreamingFileSink, but ORC is not supported and HMS cannot be updated.

Flink Streaming Integrate Hive provides Hive streaming sink [3], using SQL way will be more convenient and flexible, using SQL built-in functions and UDF, and stream and batch can be multiplexed, running two stream calculation jobs.

Insert into [stream_db. | batch_db.] order_table select... from log_table;Copy the code

The data processing

Calculation ② and ⑥ are the intermediate data processing of real-time data warehouse and offline data warehouse respectively. There are mainly three kinds of calculation:

  1. ETL: Like data import, batch streaming is no different.
  2. Dimension table Join: Dimension table complement field is a very common data store operation. In offline data store, basically Join Hive table directly, but the Streaming operation is somewhat different, which will be described in detail below.
  3. Aggregation: Streaming operations In these stateful calculations, the resulting values are not fixed once, but may be constantly changing.

Dimension table Join

Unlike offline computing, which only cares about dimension table data at a point in time, the Streaming job runs continuously, so it needs to focus on dynamic dimension table instead of static data.

In addition, for efficiency of Join, streaming jobs usually Join a database table, not just a Hive table.

Example:

-- stream dimension table use stream_db; Create table user_info (user_id long, age int, address, primary key(user_id)) with (' connector. Type '=' JDBC ',...) ; Insert into user_info select * from batch_db.user_info; Insert into ORDER_with_user_age SELECT * from order_TABLE Join user_infofor system_time as of order_table.proctime on user_info.user_id = user_info.user_id;
Copy the code

Join Hive dimension table (Hive dimension table) to Hive dimension table (Hive dimension table) Hive dimension Table is also being developed in the community.

  1. The Hive dimension table is too large to Cache:
  • Consider Shuffle by key, or distributed dimension table Join, to reduce the amount of data in a single concurrent Cache
  • Consider putting dimension table data into State
  1. Dimension table update problem:
  • The simple scenario is that the TTL expires
  • A more complex solution is to implement Hive Streaming Source with Flink’s Watermark mechanism

Stateful computation and data export

Example:

select age, avg(amount) from order_with_user_age group by age;
Copy the code

A simple sentence of aggregated SQL, which executes in radically different modes for batch and stream computing.

The main difference between the aggregation of Streaming and that of offline computing is that it is a dynamic table [4] and its output is constantly changing. The concept of dynamic table is simply that the count of a streaming table is driven by the input, instead of batch, which is output after all the input is obtained. Therefore, the result of dynamic table is changed:

  • The retract mechanism in Flink ensures that the SQL result is the same as the batch if it is in SQL.
  • If it’s external storage, this presents a challenge for Sink.

Output after stateful calculation:

  • If sink is an updatable database, such as HBase/Redis/JDBC, then this does not seem to be a problem, we just need to keep updating.
  • But what about non-updatable storage? There’s no way to update the original data. To this end, Flink proposed the support of Changelog [5], hoping to support such sink built-in, output data of specific Schema, so that downstream consumers can work well.

Example:

-- Batch: Output data to mysql once the computation is complete. Only one data of the same key is output to mysql. Insert into mysql_table select age, avg(amount) from order_with_user_age group by age; Insert into hive_table select age, AVg (amount) from order_with_user_age group by age; -- streaming: The data in Kafka is constantly appended, with an extra column indicating that this is a message from upsert. Upsert insert into kafka_table select age, AVg (amount) from order_with_user_age group by age;Copy the code

AD HOC and OLAP

Offline data warehouse can be used for calculation ⑨, and ad-hoc queries can be made for detailed data or summary data, which enables data analysts to conduct flexible queries.

At present, a big disadvantage of real-time data warehouse is that it can not be ad-hoc query, because it does not save historical data, Kafka may be able to save more than three days of data, but one is high storage cost, two is not good query efficiency.

One idea is to provide the batch stream unified Sink component for OLAP databases:

  • Druid sink
  • Doris sink
  • Clickhouse sink
  • HBase/Phoenix sink

conclusion

Based on the Lambda architecture, this paper analyzes the Flink stacked for warehouse calculation plan ability, in this article, some new functions is still in fast iterative evolution Flink, along with the constant exploration and practice, hope gradually advancing in the direction of the integration of calculation, the number of warehouse architecture in the future hope to be able to truly unified user off-line and real-time, provide a unified experience:

  • Unified metadata
  • Unified SQL development
  • Unify data import and export
  • Consider unified storage in the future

reference

[1] cwiki.apache.org/confluence/…

[2] cwiki.apache.org/confluence/…

[3] cwiki.apache.org/confluence/…

[4] ci.apache.org/projects/fl…

[5] cwiki.apache.org/confluence/…

Benefits to the

Calmly deal with the technical problems in the production environment, “Apache Flink ten technical difficulties in combat” e-book free download!

[click free to download the Apache Flink ebook collection “> > >] (https://yq.aliyun.com/download/3925)

  • Deep Dive | 1.02 million lines of code, 1,270 questions, what does Flink 1.10 release?
  • How to determine the planned cluster size from development to production?
  • Demo: Build streaming applications based on Flink SQL
  • Flink Checkpoint Troubleshooting Guide
  • How to analyze and deal with Flink backpressure?
  • Flink on YARN (1) : The infrastructure and startup process can be easily mastered in one picture
  • Flink on YARN (bottom) : FaQs and troubleshooting methods
  • Apache Flink integration with Apache Hive
  • Flink Batch SQL 1.10 practice
  • How to customize Python UDFs in PyFlink 1.10?
  • Flink 1.10 Native Kubernetes principles and practices

This book is summarized and shared by core contributors of Apache Flink and users of production environment of first-line large factories. The content is comprehensive and rich, including principle analysis, application practice, demo demonstration, troubleshooting and solution of common problems in production environment, Flink 1.10 ecological application principle and practice. Help big data developers really solve Flink production application problems!