The song of agile
I take out several palpable exist | DBus
Everyone to play the stream processing | Wormhole
When we make is the database | Moonbox
Appearance level last ten kilometers | Davinci
Real-time Data Platform (RTDP) is an important and common big Data infrastructure Platform. In the previous part (design), we introduced RTDP from the perspective of modern data warehouse architecture and typical data processing, and discussed the overall design architecture of RTDP. As the second part (technology), this paper introduces the technology selection and related components of RTDP from the technical perspective, and discusses the relevant modes applicable to different application scenarios. The agile path of RTDP begins here
Read more: Learn what agile big Data means by using enterprise real-time data platform as an example
How to Design real-time Data Platform (Design Part)
I. Introduction of technology selection
In the design chapter, we present an overall architectural design of RTDP (Figure 1). In the technical section, we will recommend the overall technical component selection; Each technical component is briefly introduced, especially the four technical platforms (unified data acquisition platform, unified streaming processing platform, unified computing service platform, unified data visualization platform) that we abstract and realize are emphatically introduced. Discuss Pipeline end to end topics, including function integration, data management, data security, etc.
Figure 1 RTDP architecture
1.1 Overall technology selection
Figure 2 Overall technology selection
First, let’s take a brief look at Figure 2:
- Data source, client, lists the common data source types for most data application projects.
- Data bus platform DBus, as a unified data acquisition platform, is responsible for docking various data sources. DBus extracts the data in an incremental or full manner, performs some general data processing, and finally publishes the processed messages on Kafka.
- Distributed message system Kafka connects message producers and consumers with distributed, high availability, high throughput, publish-subscribe and other capabilities.
- Wormhole, as a unified streaming processing platform, is responsible for processing and docking various data target stores on streams. Wormhole consumes messages from Kafka, supports SQL configuration for data processing logic on streams, and supports configuration to send data to different data target stores (sinks) with final consistency (idempotent) effect.
- In the data computing storage layer, the RTDP architecture selects open technology component selection. Users can select appropriate storage according to the actual data characteristics, computing mode, access mode, data volume and other information to solve specific data project problems. RTDP also supports simultaneous selection of multiple different data stores, allowing more flexibility to support different project requirements.
- Moonbox computing service platform, as a unified computing service platform, the client is responsible for the integration of heterogeneous data storage, calculation of optimization, the heterogeneous data storage is mixed virtualization technology (data), such as the data display and interaction is responsible for convergent unified metadata query, data computing and distributed, unified data query language (SQL), unified data service interface, etc.
- Davinci, a visual application platform, as a unified data visualization platform, supports various data visualization and interaction requirements in a configurable way, and can integrate other data applications to provide partial data visualization requirements solutions. In addition, it also supports different data practitioners to cooperate on the platform to complete various daily data applications. Other data terminal consumption systems such as Data development platform Zeppelin and data algorithm platform Jupyter are not introduced in this paper.
- Cutting topics such as data management, data security, development operation and maintenance, and drive engine can be integrated and redeveloped by interconnecting service interfaces of DBus, Wormhole, Moonbox, and Davinci to support end-to-end control and governance.
Next, we will further detail the technical components and section topics involved in the figure above, introduce the functions and characteristics of the technical components, focus on the design ideas of the technical components developed by us, and discuss the section topics.
1.2 Technical Components
1.2.1 Data Bus Platform DBus
Figure 3 DBus of RTDP architecture
1.2.1.1 DBus design idea
1) Look at design ideas from an external perspective
- Interconnect with different data sources and extract incremental data in real time. For databases, extract operation logs. Interconnect with multiple agents for log types.
- All messages in a unified UMS message format published in Kafka, UMS is a standardized metadata information JSON format, through the unified UMS to achieve logical messages and physical Kafka Topic decoupling, so that the same Topic can flow multiple UMS message tables.
- It can pull all data from the database and merge it with incremental data into UMS messages, transparently unaware of downstream consumption.
2) Look at design ideas from an internal perspective
- Data formatting based on Storm computing engine to ensure minimal end-to-end latency for messages.
- Standard formatting of data from different data sources generates UMS information, including:
➤ Generate a unique monotonically increasing ID for each message, corresponding to the system field ums_ID_
➤ Verify that each message has an event timestamp corresponding to the system field ums_ts_
✔ Confirm the operation mode of each message (add, delete, change, or INSERT only), corresponding to the system field ums_OP_
- Real-time perception and version number management of database table structure changes to ensure upstream metadata changes are identified for downstream consumption.
- Ensure strongly ordered (not absolutely ordered) messages and at least once semantics when deploying Kafka.
- Ensure message – to – end probing awareness through the heartbeat table mechanism.
1.2.1.2 DBus Functions and Features
- Supports configuring full data pull
- Support for configuring incremental data pull
- Supports configured online log formatting
- Support visual monitoring and warning
- Multi-tenant security control can be configured
- Supports the collection of separate table data into a single logical table
1.2.1.3 DBus technical architecture
Figure 4 DBus data flow architecture diagram
For more DBus technical details and user interface, see:
Making: github.com/BriData
1.2.2 Distributed Message system Kafka
Kafka has become a de facto standard distributed message processing system with big data streaming. Of course, Kafka is constantly expanding and improving, and now has certain storage capacity and streaming processing capacity. There are plenty of articles and information available on Kafka’s own capabilities and technologies, but this article won’t go into detail on Kafka’s own capabilities.
Here we explore the topic of message Metadata Management and Schema Evolution on Kafka.
Figure 5
Image: cloudurable.com/images/kafk…
Figure 5 shows the introduction of a metadata management component, Schema Registry, in the Confluent solution behind Kafka. This component is responsible for managing the metadata and Topic information that flows messages across Kafka and provides a range of metadata management services. The reason for introducing such a component is that Kafka consumers can understand what data is flowing across different topics, as well as the metadata information of the data, and effectively parse and consume it.
Any data flow link, no matter what system it is on, will have the metadata management problem of this data link, Kafka is no exception. Schema Registry is a centralized metadata management solution for Kafka data links. Based on Schema Registry, Confluent provides corresponding Kafka data security mechanism and Schema evolution mechanism.
For more information about Schema Registry, see:
Kafka Tutorial:Kafka, Avro Serialization and the Schema Registry
Cloudurable.com/blog/kafka-…
So how to solve the problem of Kafka message metadata management and schema evolution in RTDP architecture?
1.2.2.1 Metadata Management
- DBus automatically logs and services real-time aware database metadata changes
- The DBus automatically records the online formatted log metadata and provides services
- DBus publishes unified UMS messages on Kafka. UMS has its own message metadata information. Therefore, downstream consumption does not need to invoke the centralized metadata service and can directly obtain data metadata information from UMS messages
1.2.2.2 Schema Evolution
- UMS messages contain information about a Schema Namespace. A Namespace is a seven-layer location string that uniquely locates the life cycle of any table. The Namespace is equivalent to the IP address of a table in the following format:
[Datastore].[Datastore Instance].[Database].[Table].[TableVersion].[Database Partition].[Table Partition]
Example: oracle. Oracle01. Db1. Table1. V2. Dbpar01. Tablepar01
Where [Table Version] represents the Version number of a Schema for this Table, which is automatically maintained by DBus if the data source is a database.
- In the RTDP Schema, Kafka’s downstream is consumed by Wormhole, and the Wormhole uses [TableVersion] as * when consuming UMS. This means that when the upstream Schema of a table changes, the Version is automatically upgraded. However, Wormhole will ignore this Version change and consume incremental/full data for all versions of the table. In Wormhole, SQL and output fields can be processed on the stream. If the upstream Schema change is a “compatibility change” (adding fields or modifying expanded field types), the correct execution of the Wormhole SQL will not be affected. When incompatibility changes occurred upstream, Wormhole reported an error, and manual intervention was required to repair the logic of the new Schema.
As can be seen from the above, Schema Registry and DBus+UMS are two different design ideas to solve metadata management and Schema evolution. Both of them have advantages and disadvantages. For a simple comparison, see Table 1.
Table 1 Comparison between Schema Registry and DBus+UMS
Here is an example of UMS:
Figure 6 UMS message example
1.2.3 Streaming processing platform Wormhole
FIG. 7 Wormhole of RTDP architecture
1.2.3.1 Wormhole design idea
1) Look at design ideas from an external perspective
- Consuming UMS messages and custom JSON messages from Kafka
- Responsible for docking different data target stores (Sinks) and achieving the final consistency of sinks through idempotent logic
- Support to configure SQL to achieve flow processing logic
- Provide Flow abstractions. Flow is defined by a Source Namespace and a Sink Namespace and is unique. Flow allows you to define processing logic, which is a logical abstraction for stream-on-stream processing. By decoupling from physical Spark Streaming and Flink Streaming, the same Stream can process multiple Flow processing streams, and the Flow can switch between different streams at will.
- Support for backfill based Kappa architecture; Lambda architecture based on Wormhole Job is supported
2) Look at design ideas from an internal perspective
- Spark Streaming and Flink computing engine are used to process data stream. Spark Streaming supports high throughput, batch Lookup, and batch write Sink scenarios. Flink supports scenarios such as low latency and CEP rules.
- Realize idempotent storage logic of different sinks through ums_ID_ and ums_op_
- Through the calculation of the push down to achieve Lookup logic optimization
- Abstract several unity to support functional flexibility and design consistency
✔ Unified DAG higher order fractal abstraction
➤ Unified unified flow message UMS protocol abstraction
✔ Unified data logical table Namespace abstraction
- Abstract several interfaces to support extensibility
➤ Roll back more Sink support
✔ ✔ SwiftsInterface: Support for custom flow processing logic
✔ UDF: More streams on processing UDF support
- Feedback message is used to collect real-time dynamic indicators and statistics of streaming operations
1.2.3.2 Wormhole Function
- Support visualization, configuration, SQL development and implementation of streaming projects
- Supports management, operation, maintenance, diagnosis, and monitoring of instruction dynamic streaming processing
- Supports unified structured UMS messages and customized semi-structured JSON messages
- Supports the processing of add, delete and modify three-state event message flows
- Supports a single physical flow to process multiple logical service flows in parallel
- Support stream Lookup Anywhere, Pushdown Anywhere
- Support event timestamp streaming based on business policy
- Supports UDF registration management and dynamic loading
- Supports concurrent idempotent storage for multi-objective data systems
- Supports multiple levels of data quality management based on incremental messages
- Supports streaming and batch processing based on incremental messages
- Lambda and Kappa architectures are supported
- Supports seamless integration with the three-party system and can serve as the flow control engine of the three-party system
- Supports private cloud deployment, security permission control, and multi-tenant resource management
1.2.3.3 Wormhole Technology Architecture
Figure 8. Wormhole Data flow architecture diagram
For more details on Wormhole technology and user interface, see:
GitHub:github.com/edp963/worm…
1.2.4 Common data calculation storage selection
RTDP architecture takes an open and integrated approach to data computing and storage selection. Different data systems have their own advantages and suitable scenarios, but no one data system can be suitable for a variety of storage computing scenarios. So when appropriate, mature, mainstream data systems become available, Wormhole and Moonbox will scale and integrate support accordingly.
Here are some general choices:
-
Relational database (Oracle/MySQL, etc.) : Suitable for small data volume of complex relationship calculation
-
Distributed column storage system
✔ Kudu: Scan optimization, suitable for OLAP analysis computing scenarios
✔ HBase: Random read and write, suitable for data service scenarios
✔ Cassandra: High performance write, suitable for high frequency writing of massive data
✔ ClickHouse: High performance computing, suitable for INSERT write scenarios only (later will support update deletion operations)
- Distributed file system
✔ HDFS/Parquet/Hive: Append only, suitable for mass data batch computing scenarios
- Distributed document system
➤ Balance ability, suitable for large data volumes and moderately complex computing
- Distributed index system
➤ Index ability, suitable for fuzzy query and OLAP analysis scenarios
- Distributed predictive system
➤ Prediction ability, suitable for high performance OLAP analysis scenarios
1.2.5 Computing service platform Moonbox
Figure 9 Moonbox of RTDP architecture
1.2.5.1 Moonbox design idea
1) Look at design ideas from an external perspective
- Responsible for docking different data systems, support unified way across heterogeneous data system impromptu calculation
- There are three Client invocation modes: RESTful service, JDBC connection, and ODBC connection
- Unified metadata closure; Unified query language SQL closed; Unified permission control closed
- Two query result writing modes are provided: Merge and Replace
- Two interaction modes are provided: Batch mode and Adhoc mode
- Data virtualization implementation, multi-tenant implementation, can be regarded as a virtual database
2) Look at design ideas from an internal perspective
- The SQL is parsed, the parsing process is processed by normal Catalyst, and the logical execution subtree of the push-down data system is finally generated for push-down calculation, and then the results are pulled back for mixing and return
- Supports two levels of Namespace: database.table to provide virtual database experience
- Moonbox Grid provides high availability and concurrency
- Provides fast execution channel for all push-down logic (no mixing)
1.2.5.2 Moonbox Functions and Features
- Support seamless mixing across heterogeneous systems
- Supports unified SQL syntax query calculation and writing
- Supports RESTful service, JDBC connection, and ODBC connection
- Two interaction modes are supported: Batch mode and Adhoc mode
- Supports Cli Command and Zeppelin
- Multi-tenant user rights system is supported
- Supports table – level permissions, column – level permissions, read permissions, write permissions, and UDF permissions
- Supports YARN scheduler resource management
- Support metadata services
- Scheduled Tasks
- Supporting security Policies
1.2.5.3 Moonbox technical architecture
Figure 10 Moonbox logic module
For more technical details and user interface of Moonbox, see:
Making: github.com/edp963/moon…
1.2.6 Visual application platform Davinci
Figure 11 Davinci of RTDP architecture
1.2.6.1 Davinci’s design idea
1) Look at design ideas from an external perspective
- Responsible for various data visualization display functions
- Support for JDBC data sources
- Provide equal user system, each user can establish their own Org, Team and Project
- Support SQL programming data processing logic, support drag-and-drop editing visual display, provide a multi-user social division of labor and collaboration environment
- Provide a variety of different chart interaction capabilities and customization capabilities to meet different data visualization needs
- Provides the ability to embed and integrate into other data applications
2) Look at design ideas from an internal perspective
- Build around views and widgets. A View is a logical View of the data; Widgets are visual views of data
- Select classified data, ordered data and quantitative data by user customization, and automatically display the view according to reasonable visualization logic
1.2.6.2 Davinci Functions and Features
1) data source
- Support for JDBC data sources
- CSV file upload is supported
2) Data view
- Support for defining SQL templates
- SQL highlighting is supported
- Support for SQL tests
- Write back is supported
3) Visual components
- Support for predefined charts
- Supporting controller components
- Free style support
4) Interactive ability
- Supports full screen display of visual components
- Support for visual component local controllers
- Supports filtering linkage between visual components
- Support group controller visual components
- Support for visual component native advanced filters
- Support large amount of data display pages and slider
5) Integration ability
- Support visual component CSV download
- Public sharing of visual components is supported
- Supports visual component authorization sharing
- Support for dashboard public sharing
- Support for dashboard authorization sharing
6) Security permissions
- Data row and column permissions are supported
- LDAP login integration is supported
For more technical details and user interface of Davinci, see:
GitHub:github.com/edp963/davi…
1.3 Discussion of section topics
1.3.1 Data Management
1) Metadata management
- DBus can retrieve metadata from data sources in real time and provide service queries
- Moonbox can access metadata from data systems in real time and provide service queries
- For RTDP architecture, metadata information of real-time data source and AD hoc data source can be collected by calling DBus and Moonbox RESTful service, and enterprise metadata management system can be built based on this
2) Data quality
- Wormhole enables messages to be sent to the HDFS (HDFSlog) in real time. Hdfslog-based Wormhole jobs support Lambda architecture. Hdfslog-based Backfill supports the Kappa architecture. Periodic task can be set to select Lambda architecture or Kappa architecture to refresh Sink periodically to ensure the final consistency of data. Wormhole also provides real-time Feedback to the Wormhole system for messages about abnormal flow processing or Sink writing, and provides RESTful services for three-party applications to process.
- Moonbox’s ability to mix heterogeneous systems at the drop of a hat gives it a Swiss Army knife of convenience. Moonbox can be used to write timing SQL script logic, compare the heterogeneous system data of concern, or make statistics on the data table fields of concern, etc., so as to develop a data quality detection system again based on Moonbox’s ability.
3) Consanguinity analysis
- The Wormhole flow processing logic is usually SQL, which can be aggregated through RESTful services.
- Moonbox controls the unified entry point for data queries, and all the logic is SQL, which can be aggregated through Moonbox logs.
- For RTDP architecture, real-time processing logic and AD hoc processing logic OF SQL can be called Wormhole RESTful service and Log collection of Moonbox, based on which enterprise-level blood analysis system can be built.
1.3.2 Data Security
Figure 12. RTDP data security
The figure above shows that in the RTDP architecture, four open source platforms cover end-to-end data transfer links, and all aspects of data security are considered and supported on each node, ensuring end-to-end data security in the real-time data pipeline.
In addition, as Moonbox serves as a unified gateway for data access at the application layer, operation audit logs based on Moonbox can obtain a lot of information at the security level, and a data security warning mechanism can be established around operation audit logs to build an enterprise-level data security system.
1.3.3 Development operation and Maintenance
1) Operation and maintenance management
- Operation and maintenance management of real-time data processing has always been a pain point. DBus and Wormhole provide visual operation and maintenance management capability through visual UI, making operation and maintenance easier.
- DBus and Wormhole provide RESTful services such as health check, operation management, Backfill and Flow drift, which can be used to develop automatic operation and maintenance systems.
2) Monitoring and warning
- Both DBus and Wormhole provide visual monitoring interfaces, allowing you to view logical table-level throughput and latency information in real time.
- DBus and Wormhole provide RESTful services such as heartbeat, Stats, and status, which can be used to develop automated warning systems.
Second, mode and scene discussion
In the previous chapter, we introduced the design architecture and functional features of the various technical components of the RTDP architecture. So far, readers have a concrete understanding of how the RTDP architecture is implemented. So what common data application scenarios can the RTDP architecture address? Below we explore several usage patterns and what requirements scenarios they fit into.
2.1 Synchronization Mode
2.1.1 Mode Description
In synchronous mode, only real-time data synchronization is configured between heterogeneous data systems and no processing logic is performed on streams.
Specifically, DBus is configured to extract data from data sources in real time and put it into Kafka, and Wormhole is configured to write data from Kafka to Sink storage in real time. Synchronous mode provides two main capabilities:
- The subsequent data processing logic is not executed on the standby service database, reducing the pressure on the standby service database
- Provides the possibility to synchronize data from different physical service standby databases to the same physical data store in real time
2.1.2 Technical difficulties
The concrete implementation is relatively simple.
The IT implementer does not need to know much about the common problems of stream processing, and does not need to worry about the design and implementation of the on-stream processing logic implementation, just needs to understand the basic flow control parameter configuration.
2.1.3 Operation and Maintenance Management
Operation and maintenance management is relatively simple.
It needs people to operate it. However, since there is no processing logic on the flow, it is easy to control the flow velocity, without considering the power consumption of the processing logic on the flow itself, a relatively stable synchronous pipeline configuration can be given. It is also easy to make regular end-to-end data comparisons to ensure data quality because the data on the source and target ends are identical.
2.1.4 Application Scenarios
- Data is shared across departments in real time
- The transaction database is decoupled from the analysis database
- Support data warehouse real-time ODS layer construction
- User self-service real-time simple report development
- , etc.
2.2 Stream computing Mode
2.2.1 Mode Description
Stream computing mode refers to the usage mode of configuring the processing logic on a stream based on the synchronous mode.
In the RTDP architecture, the configuration and support of processing logic on a stream is primarily done on the Wormhole platform. In addition to the capabilities of synchronous mode, streaming mode provides two main capabilities:
- On-stream computing disperses the power consumption of batch computation into incremental continuous power consumption on the stream, which greatly reduces the time delay of result snapshot
- On-stream computing provides a new computing portal (Lookup) for mixing across heterogeneous systems
2.2.2 Technical difficulties
Implementation is relatively difficult.
Users need to understand what on-stream processing can and can do, how to convert full computing logic to incremental computing logic, and so on. The power consumption of the processing logic on the flow itself and the dependent external data system should be considered to adjust and configure more parameters.
2.2.3 Operation and maintenance management
Operation and maintenance management is relatively difficult.
It needs people to operate it. However, it is more difficult than synchronous mode o&M management, mainly reflected in the configuration of flow control parameters, the inability to support end-to-end data comparison, the need to select the final consistency policy of result snapshot, the need to consider the Lookup time alignment policy on the flow and other problems.
2.2.4 Application Scenarios
- Data application projects or reports with high requirements for low latency
- Requires low latency to invoke external services (such as external rule engine invocation on stream, online algorithm model usage, and so on)
- Support data warehouse real – time fact table + dimension table wide table construction
- Real-time multi-table fusion, splitting, cleaning, and standardization Mapping scenarios
- , etc.
2.3 Rotation mode
2.3.1 Mode Description
Rotation mode refers to the use mode of streaming to batch and batch to stream on the basis of real-time data falling in the database. After further calculation of short-time tasks in the database, the results are put into Kafka again and run on the next turn for calculation.
In the RTDP architecture, Kafka->Wormhole->Sink->Moonbox->Kafka can be integrated to achieve any cycle and any frequency of rotation calculation. On top of the capabilities of the flow mode, the main capability provided by the rotation mode is to theoretically support any complex flow calculation logic with low latency.
2.3.2 Technical difficulties
The concrete implementation is difficult.
The introduction of Moonbox to Wormhole capability further increases the number of variables to be considered, such as the choice of multiple sinks, the frequency setting of Moonbox calculation, and the division of calculation between Wormhole and Moonbox.
2.3.3 Operation and Maintenance Management
Operation and maintenance management is difficult.
It needs people to operate it. More data system considerations, more parameter tuning, and more difficult data quality management and diagnostic monitoring are required than in the streaming mode.
2.3.4 Application Scenarios
- Low latency multi-step complex data processing logic scenarios
- Company – level real-time data flow processing network construction
2.4 Intelligent Mode
2.4.1 Mode Description
Intelligent mode refers to the use of rules or algorithmic models for optimization and efficiency.
Points that can be intelligent:
- Intelligent drift of Wormhole Flow (intelligent automatic Operation and Maintenance)
- Intelligent Optimization of Moonbox Prediction (Intelligent automatic Tuning)
- Full computing logic intelligently converted to streaming computing logic and then deployed in Wormhole + Moonbox
- , etc.
2.4.2 Technical difficulties
Implementation is the simplest in theory, but effective technical implementation is the hardest.
Users only need to complete offline logic development, and the rest of the development, deployment, tuning, operation and maintenance by intelligent tools.
2.4.3 Operation and Maintenance Management
Zero operations.
2.4.4 Application Scenarios
The whole scene.
Since then, our discussion on “how to design real-time data platform” has come to an end. We moved from a conceptual background to architectural design, then introduced technical components, and finally looked at pattern scenarios. Since each of the topics involved here is very large, this article has only made a superficial introduction and discussion. Later, we will discuss a specific topic in detail from time to time, and present our practice and experience to draw inspiration from others. If you are interested in the four open source platforms in the RTDP architecture, please find us on GitHub to learn about their use and exchange suggestions.
Author: Lu Shanwei
Source: Creditease Institute of Technology