This paper will talk about how to build real-time data warehouse of unified data service acceleration based on Flink+Hologres
Author: Chen Jianxin, development engineer of data warehouse of Telephony Technology, currently focusing on the integration of offline and real-time architecture of big data platform of telephony technology.
Shenzhen Call Technology Co., LTD. (hereinafter referred to as “Call Technology”) is a pioneering enterprise in the shared charging bank industry. Its main business covers self-service leasing of charging bank, development of customized shopping mall navigation machine, advertising display equipment and advertising communication and other services. Call technology has the industry three-dimensional product line, small and medium-sized cabinets and desktop type, more than 90% of the country’s cities to achieve business services landing, registered users more than 200 million people, to achieve the full scene of user needs.
I. Introduction to big data platform
(1) Development history
The development of big data platform is divided into the following three stages: 1. Discrete 0. Before, there was no unified big data platform to support data services. Instead, each business development line took numbers or did some calculations by itself, and used a low-profile version of Greenplum offline service to maintain daily data needs. 2. After offline 1.0 EMR, the architecture was upgraded to offline 1.0 EMR. Here, EMR refers to alibaba Cloud’s elastic distributed mixed cluster service composed of big data, including common components such as Hadoop and HiveSpark offline computing. Ali Cloud EMR mainly solves our three pain points: first, the level of storage computing resources can be expanded; Second, it solves the development and maintenance problems caused by the heterogeneous data of the previous business lines, and the platform cleans and stores them uniformly. Third, we can establish our own data warehouse hierarchical system, divide a subject area, lay a good foundation for our index system. 3. Real-time, unified 2.0 Flink+Hologres is currently experiencing the “Flink+Hologres” real-time data warehouse, which is also the core of this article to share. It has brought two qualitative changes to our big data platform, one is real-time computing, the other is unified data services. Based on these two points, we accelerate the exploration of knowledge data and promote the rapid development of business.
(2) Platform capability
In general, version 2.0 of the big data platform provides the following capabilities:
1) Data integration
The platform now supports integration of business databases or logging of business data in real-time or offline mode.
2) Data development
The platform now supports Spark based offline computing as well as Flink-based real-time computing.
3) Data services
The data service consists of two main components: the analysis services and AD hoc analysis capabilities provided by Impala, and the interactive analysis capabilities for business data provided by Hologres.
4) Data application
At the same time, the platform can be directly connected with common BI tools, and the business system can be quickly integrated and connected.
3. Achievements made
The capabilities provided by big data platforms have brought us a number of achievements, summarized in the following five points: 1) Horizontal scale-up The core of big data platforms is distributed architecture, which enables us to scale storage or computing resources at low cost. 2) Resource sharing can integrate the resources available to all servers. In the previous architecture, each business department maintained a set of clusters by itself, which would cause some waste, difficult to ensure reliability, and high freight cost. Now, the platform is unified scheduling. 3) Data sharing integrates all service data of business departments and other heterogeneous data sources such as service logs, and the platform cleans and connects them uniformly. 4) Service sharing After data sharing, the platform will uniformly output services to the outside, and each business line can quickly obtain the data support provided by the platform without repeated development. 5) Security guarantee The platform provides a unified authorization mechanism such as security authentication, which can provide different levels of fine-grained authorization to different people and ensure data security.
Second, enterprise business demand for data
With the rapid development of business, it is extremely urgent to build a unified real-time data warehouse. Based on the platform architecture of 0.x and 1.0 versions and judging the current development and future trend of business, the requirements for building 2.x version data platform are mainly focused on the following aspects: 1) Real-time large screen Real-time large screen needs to replace the old quasi-real-time large screen and adopt a more reliable and low delay technical solution. 2) Unified data Service High performance, high concurrency and high availability of data services become the key to the digital transformation of enterprises unified data portal, need to build a unified data portal, unified external output. 3) The timeliness of real-time warehouse data is becoming increasingly important in enterprise operation, requiring faster and more timely response.
Technical scheme of real-time data warehouse and unified data service
(I) Overall technical architecture
The technical architecture is mainly divided into four parts, namely data ETL, real-time data warehouse, offline data warehouse and data application.
- Data ETL is used for real-time processing of business database and business logs. Flink is used for real-time calculation.
- The data in the real-time data warehouse will be stored and analyzed in Hologres after real-time processing
- Cold service data is stored in the Hive offline data warehouse and synchronized to Hologres for further data analysis
- By Hologres unified docking common BI tools, such as Tableau, Quick BI, DataV and business system.
(2) Real-time data warehouse data model
As shown above, the real-time data warehouse and the offline data warehouse have some similarities, but there are fewer links to other layers.
- The first layer is the original data layer. There are two types of data sources, one is the Binlog of the service library and the other is the service log of the server. Kafka is uniformly used as the storage medium.
- The second layer is the data detail layer, the information in the original data layer Kafka is extracted by ETL and stored as real-time details to Kafka. The purpose of this is to facilitate the simultaneous subscription of different downstream consumers and facilitate the use of subsequent application layers. Dimension table data is also stored through Hologres to meet the following data association or conditional filtering.
- The third is the data application layer. In addition to getting through Hologres, Hologres is also used to connect to Hive, and Hologres provides the upper application services uniformly.
(iii) Overall technical architecture data flow
The following data flow diagram can concretely deepen the overall architecture planning and data flow of the warehouse model as a whole. As can be seen from the figure, it is mainly divided into three modules, the first is integrated processing, the second is real-time data warehouse, and the third is data application. From the inflow and outflow of data, there are two main core points:
- The first core is Flink’s real-time computation: MySQL Binlog data can be fetched from Kafka, read directly from the Flink CDT, or written directly back to the Kafka cluster, which is a core.
- The second core is the unified data service: the unified data service is now done by Hologres to avoid problems with data islands, or consistency that is difficult to maintain, and to speed up the analysis of offline data.
Fourth, specific practice details
(I) Selection of big data technology
The implementation of the solution is divided into two parts: real-time and service analysis. In real time, we choose the way of Aliyun Flink full hosting, which mainly has the following advantages:
1) State management and fault tolerance mechanism;
2) Table API and Flink SQL support;
3) High throughput and low delay;
4) Exactly Once;
5) Flow and batch;
6) Value-added services like full hosting.
In terms of service analysis, we choose Ali Cloud Hologres interactive analysis, which brings several benefits:
1) Extreme response analysis;
2) High concurrent read and write;
3) Separation of computing and storage;
4) Easy to use.
(ii) Implementation of real-time large-screen business practice
The figure shows the comparison between the old and new solutions for real-time large-screen services. In order, for example, the old scheme of orders from the orders from the library, through the DTS synchronization to another, although it is real-time, but in computing and deal with this aspect, mainly through regular tasks, such as scheduling interval to 1 minute or five minutes to complete the data updated in real time, the sales floor, management need to be more dynamic, real-time grasp of business, So it’s not really real time. In addition, slow and unstable response is also a big problem. The new scheme adopts Flink real-time computing +Hologres architecture. The development mode is completely able to use Flink SQL support, for our previous MySQL computing development mode, can be said to be a seamless migration, to achieve a fast landing. Hologres is used for data analysis and services. Taking orders as an example, for example, the revenue of orders today, the number of users of orders today or the number of users of orders today, with the increase of business diversity, the city dimension may need to be added. Through Hologres’ analysis ability, it can perfectly support the rapid display of some indicators of revenue, order quantity, order number and city dimension.
(3) Implementation of real-time data warehouse and unified data service
Take a certain service scenario as an example. For example, the average daily data volume of a large number of service logs is TB. Here’s a look at the pain points of the old scheme:
- Poor data timeliness: Due to the large amount of data, hourly off-line scheduling policy was used for data calculation in the old scheme. However, this solution has poor timeliness and cannot meet the real-time requirements of many business products. For example, the hardware system needs to know the current status of the equipment in real time, such as alarms, errors, empty positions, etc., and make corresponding decisions and actions timely.
- Data island: In the old solution, a large number of service reports are connected with Tableau. Reports are used to analyze the number of reports reported by devices in the past hour or one day and which devices report exceptions. In different scenarios, the Spark data is backed up and stored in MySQL or Redis. In this way, multiple systems form data islands, which pose a huge challenge to platform maintenance.
Business logs can now be transformed with the 2.0 Flink+Hologres architecture.
- The previous terabyte log volume was completely stress-free in the low-latency computing framework of Flink polymer. For example, a link between HDFS and Spark of Flume was directly abandoned and replaced by Flink. We only need to maintain a Flink computing framework.
- Device status data is collected as unstructured data that needs to be cleaned and then returned to Kafka because consumers can be diverse, making it easy for multiple downstream consumers to subscribe at the same time.
- In the previous scenario, the hardware system needs to query the status of tens of millions of devices (charging banks) in high concurrency and real-time, which requires high service capability. Hologres provides high concurrent read and write ability, establishes the primary key table associated with status devices, and updates status in real time to meet the real-time query of devices (charge banks) by CRM system.
- At the same time, Hologres will also store the latest hot spot details of the data, directly provide external services.
(IV) Business support effect
Through the new solution of Flink+Hologres, we support three scenarios: 1) The real-time large screen business level can iterate diversified requirements more efficiently, while reducing the development, operation and maintenance costs. 2) Unified data service through an HSAP system to achieve service/analysis integration, to avoid data islands, consistency, security and other problems. 3) Real-time data warehouse meets the increasingly high requirements for data timeliness in enterprise operation, with second-level response.
5. Future planning
With the business iteration, we have two major plans for the big data platform in the future: integration of streaming and batch and improvement of real-time data warehouse.
- At present, the big data platform is generally a mixture of offline architecture and real-time architecture. In the future, redundant offline code architecture will be abandoned and Flink’s streaming and batch integrated unified computing engine will be used.
- In addition, we have only migrated part of the business at present, so we will refer to the previously perfect offline data warehouse index system to meet our current real-time data warehouse construction, and fully migrate to 2.0 Flink+Hologres architecture.
Through future planning, we hope to build a more perfect real-time data warehouse together with Flink fully managed and Hologres, but we also have a further demand for it:
(I) Demand for full hosting of Flink
The SQL editor in Flink is very efficient and convenient to write FlinkSQL jobs, and also provides many common SQL upstream and downstream connectors to meet development needs. However, there are still some requirements that we hope Flink full hosting will support in future iterations:
- SQL job versioning and compatibility monitoring;
- SQL jobs support hive3.x integration;
- DataStream jobs are easily packaged and resource packages are uploaded faster.
- Session Tasks deployed in cluster mode support automatic tuning.
Hologres not only supports real-time write and query with high concurrency, but also is compatible with the PostgreSQL ecosystem, making it easy to access and use unified data services. However, there are a few requirements that Hologres would like to support in later iterations:
- Supports hot upgrade to reduce impact on services.
- Support data table backup, support read and write separation;
- Support accelerated query ali cloud EMR-Hive number warehouse;
- Supports computing resource management for user groups.