The background,
A large amount of business data in an enterprise is stored in the database of various business systems. In the past, there were many methods to synchronize data, such as:
- Various data users extract the required data in the business peak period (the disadvantage is repeated extraction and inconsistent data)
- The unified data warehouse platform extracts data from each system through SQOOP (the disadvantage is that the timeliness of SQOOP extraction method is poor, generally T+1 timeliness)
- Get incremental changes based on trigger or timestamp (disadvantages are high intrusion on business side, performance loss, etc.)
None of these solutions is perfect. After understanding and considering different implementation methods, we believe that a more reasonable solution should be log-based solution that can provide message subscription for downstream systems to use in order to solve data consistency and real-time simultaneously.
DBus (data bus) project should is the demand of DBus focused on data collection and real-time data flow calculation, through simple and flexible configuration, unobtrusive manner to the source end data acquisition, high available flow computing framework, to the company all IT systems in the business process to produce the data gathering, After conversion, it becomes a unified JSON data format (UMS), which can be subscribed and consumed by different data users. It acts as a data source for data warehouse platform, big data analysis platform, real-time report and real-time marketing.
Ii. System architecture and working principle
DBUS is divided into two parts: source data collection and multi-tenant data distribution. The connection between the two parts is through the medium of Kafka. Users without multi-tenant resources and data isolation requirements can directly consume the data output to Kafka at the source data collection level without configuring multi-tenant data distribution.
2.1 DBUS Source Data Collection
DBUS source data collection is generally divided into two parts:
- The method of reading RDBMS incremental log is used to obtain incremental data log in real time and support full pull.
- Based on logtash, Flume, FileBeat and other capture tools to obtain real-time data, structured output of data in a visual way.
The implementation principle is as follows
The main modules are as follows:
- Log capture module: reads incremental logs from the RDBMS standby library and synchronizes them to Kafka in real time.
- Incremental conversion module: convert incremental data into UMS data in real time, deal with schema changes, desensitization, etc.
- Full extraction procedure: the full data from the RDBMS standby library pull and convert to UMS data;
- Log operator processing module: structured processing of log data from different grasping ends according to operator rules;
- Heartbeat monitoring module: for RDMS source, it periodically sends heartbeat data to the source end, monitors the source end, and sends warning notification. For logging classes, alerts are monitored directly at the end.
- Web management module: Manages all related modules.
2.2 Multi-tenant Data Distribution
If different tenants have different access permissions and desensitization requirements for different source data, a Router distribution module needs to be introduced to attach the source data to the Topic assigned to tenants based on the configured permissions, source table that users are entitled to obtain, and different desensitization rules. The introduction of this level in the DBUS management system involves user management, Sink management, resource allocation, desensitization configuration, etc. The topic to which the consumption of different items is assigned.
Main functions:
- Non-invasive access to multiple data sources: The service system reads the database system logs without any modification to obtain real-time incremental data changes. At present, RDBMS supports mysql, Oracle data source (please refer to oracle related protocols for oracle data source), and log extraction schemes based on Logstash, Flume and FileBeat.
- Real-time transmission of massive data: Storm-based streaming computing framework, second-level delay, and overall no single point guarantee of high availability.
- Multi-tenant: Provides diversified functions such as user management, resource allocation, Topology management, and tenant table management. Tenants can be assigned different access permissions to source table data and apply different desensitization rules based on requirements, implementing multi-tenant resource isolation and differentiated data security.
- Schema change awareness at the source end: When a schema change occurs at the source end, the UMS automatically senses the schema change, adjusts the UMS version, and notifies the downstream through Kafka messages and emails
- Real-time desensitization of data: real-time desensitization of specified column data can be carried out according to requirements. Desensitization strategies include direct replacement, MD5, Murmur and other desensitization algorithms, salt addition, regular expression replacement, etc. Support users to develop JAR packages to achieve personalized desensitization policies not covered by DBUS.
- Initial loading: support efficient initial loading and reloading, support any specified output topic, flexible response to customer needs.
- Unified and standardized message transfer protocol: Unified UMS(JSON format) message schema format is used to output data in a convenient manner, provide data line umS_ID to ensure data sequence, and output INSERT,Update(before/after), and Delete Event data.
- Reliable multi-channel message subscription distribution: Use Kafka to store and deliver messages for reliable and convenient multi-user subscriptions
- Partitioned table/series table data collection: Partitioned table data can be collected into a logical table. You can also aggregate user – defined series of table data into a “logical table”. Ex. :
- Real-time monitoring & early warning: The visual monitoring system can check the real-time flow and delay of each data line at any time; When the data line is abnormal, the system automatically sends emails or SMS messages to the responsible person according to the configuration policy
Open source: github.com/BriData/DBu…