Introduction: In-depth interpretation of real-time data flow, warehouse integrated data processing and other core technologies

“Juyun cloud · Intelligent Future” — Ali Cloud Database Innovation On the Cloud Summit and the 3rd Database Performance Challenge Final Award ceremony has been successfully concluded, more dry content welcome to watch the live replay of the summit.

Summit live playback 📎 developer.aliyun.com/live/247301

Dry PPT download 📎 developer.aliyun.com/topic/downl…

1. Core pain points

With the explosion of data, the rapid rise of enterprise data application demand and the rapid development of data technology, database and data warehouse, these two key systems have a strong relationship. Inmon model or Kimball model, are strongly dependent on a variety of database and data warehouse and the connection between data integration and processing products, at the same time also involves data assets management and security problems, this a series of core components closely cooperate to complete even the enterprise one of the most simple report demand, need not say oneself other complex requirements.

At the same time, we also see the development trend of the market. According to Gartner analysis, enterprises require data integration and analysis from offline to real-time trend. It is estimated that real-time data will account for 30% in 2025, and more than 50% of new business will adopt real-time analysis in 2022.

In the current process of data warehouse construction, the whole process is very long and complicated, which contradicts the current enterprise’s desire to realize digitalization simply and quickly. The main problems include:

  • In the data island, there are many kinds of databases and logs, which lack the best design, development and maintenance practices.
  • Data interconnection is difficult, lack of simple and reliable data channel, especially real-time data channel;
  • The ETL process of data processing is complicated, real-time is insufficient, and technology stack is large and difficult to maintain.
  • Lack of data governance ability and prominent security problems;

Second, solutions

To solve these problems and the market trend of cloud native + real-time, we launched a new one-stop data management product DMS(DMS+DTS) on the basis of ali’s database and data warehouse construction and real-time development and application experience in the past 10 years. It provides capabilities including global data assets, database design and development, data integration, data development, data visualization and data services. Built into Alibaba data development specifications and data security best practices, DMS helps enterprises achieve efficient, agile, secure and reliable data production, processing and consumption. This article will give a technical interpretation of the key capabilities.

Key abilities

In the following part, real-time data flow technology, warehouse integrated data processing, distributed capability, asset management and security are described.

Real-time data stream technology

1. The background

Data integration capability is the basic capability of building data warehouse, including stock data extraction, incremental data extraction, and data conversion (data cleaning/data conversion/data enrichment). Stock data extraction is relatively simple, directly using the database/file/or third-party application provided by the standard interface for data pulling and storage; The data conversion part can be divided into single record processing, Arregation processing, multi-stream association processing and other types. This part focuses on the conversion of data itself and can be applied to stock and incremental data. Incremental data extraction focuses on real-time performance and non-intrusion of business. By means of highly coupling with data source, incremental data extraction technology with minimal impact on data source and lowest application cost can be achieved as much as possible. After incremental data is captured, it is stored in real time. The powerful real-time data stream technology of new DMS comes from DTS. This chapter mainly discusses incremental data capture technology and illustrates the overall technical picture and core points of real-time data stream by taking DMS as an example.

2. Technical school of incremental data acquisition

Real-time data flow technology can be divided into four categories according to the acquisition method of incremental data, namely, business field based pull, Trigger capture, Change Data Capture (CDC), and WAL parsing.

2.1 Pull based on service fields

Given that the data source has timing fields, differential data can be extracted through periodic queries. The most common of these timing fields is the time attribute, such as modify_time. The following is an example of a problem data pull based on this field:

The advantages of this technology are that it is simple to implement, but it is highly dependent on services, the incremental information obtained is limited, and it can only achieve real-time performance at the level of minutes or hours.

2.2 the Trigger to capture

Trigger capture means that data operations are recorded in the delta table through the mechanism of the data source itself, and delta extraction obtains delta data by periodically querying the delta table. As shown below:

The implementation of this technology can obtain every data change, but it has intrusion to the source database, and it is difficult to guarantee the real-time performance.

2.3 the CDC

The Changed Data Capture (CDC) technology is a technology provided by Data sources and can be enabled at the table level. For tables with CDC enabled, the data source generates a CDC table based on the source table name to record incremental data changes in the source table. Incremental extraction retrieves incremental data by periodically querying the CDC table.

CDC is very similar to Trigger capture on the whole, but the core difference is that CDC table data is generated through WAL log mining and parsing, which has higher performance than Trigger capture. Common commercial databases today provide CDC technology (e.g., Oracle/DB2/SQLServer).

CDC technology has the advantage of being efficient and simple, but has poor support for multiple tables and DDL.

2.4 a WAL parsing

WAL parses behavior that is highly dependent on data sources, and although its implementation varies greatly, its technical implementation is very uniform and consistent at the generic database level. The following figure shows its basic form:

Incremental extraction Obtains WAL logs of the Master through the Master/slave replication protocol and parses the WAL logs.

WAL parsing has the advantage of real-time and zero intrusion on data sources and services. However, it is technically difficult to implement. For each data source, you need to have a deep understanding of and master the WAL format.

3. Key technical points of real-time data flow

From the traditional T + 1 data warehouse to the real-time data warehouse, it reflects that the business needs more and more real-time data. On the technical path, real-time data flow is the only way. At present, the industry has not yet formed a standard for the construction of real-time data streams. Here, we only use the practical experience of DMS to illustrate.

3.1 Real-time is the core of real-time data flow

Real-time is defined as second-level delay. Only when real-time data flow reaches this level of delay, real-time data warehouse based on sub-construction can achieve minute-level or even second-level data analysis to help real-time business decision-making and maximize the value of data.

In order to achieve the second delay, DMS chose WAL parsing as the technical route, and through this method, DMS achieved real-time requirements on many databases. In order to achieve second-level real-time performance, it is necessary to make breakthroughs in many technical points. According to the practical experience of DMS, it is summarized as shown in the figure below.

DReader is the incremental capture module in DMS, and DWriter is the incremental data writing module in DMS. In order to meet the requirements of real time, DReader achieves millisecond delay through concurrent analysis, pre-image derivation and other technologies. DWriter, on the other hand, achieves millions of RPS writes through concurrent writes, distributed extensions, and hotspot data merges. As a whole, real-time data streams are delayed by seconds.

3.2 Stability is a necessary condition for large-scale application of real-time data stream

High performance and real time are the characteristics and advantages of real-time data flow, but stability is a necessary condition for whether a technology can be applied in core scenes and whether it can be promoted on a large scale. In order to meet the requirements of industrial-grade stability, DMS real-time data streams are optimized in terms of incremental data capture, incremental data writing, and so on.

From the deep understanding and grasp of each database kernel, but also accepted the test of double 11 scene for many years. In terms of commercial data sources, DMS real-time data stream builds complete data types, DML and other test scenarios for Oracle/DB2/SQLServer, supporting tens of thousands of commercial database instances in public and private clouds. This section is illustrated by DMS making an optimization on SQLServer data source.

Tables in SQLServer database are divided into clustered index tables and heap tables. Some data updates of heap tables do not record VLF logs. The master/slave replication protocol has the highest performance, but because it is based on VLF logs, it cannot obtain complete records in the heap table. At the same time, the reverse lookup may cause problems such as checking back wrong data. To this end, DMS balances stability and performance, introducing WAL parsing and CDC hybrid mode for the first time in real-time data stream, as shown in the following figure:

The active/standby replication protocol solves the log incremental pull of more than 80% tables, and the remaining special tables (such as the heap tables of SQLServer) use CDC mode. This hybrid mode not only ensures stability, but also provides high real-time performance.

3.3 DDL Adaptive

DDL adaptability is essential for long-term stability in production environments. By analyzing and mapping DDL of data source, DMS real-time data flow can not only ensure its own stable operation, but also exclude some DDL (such as DROP) that is not suitable for data warehouse scenario from the data flow.

Warehouse integrated data processing

1. The background

In the development of data storage, there are two main areas. One is the database camp, which is responsible for storing the online data of services and ensures the strong reliability and stability of online services through the ACID characteristics of transactions. One is the data warehouse camp, which is responsible for storing the mirror image of data on the business line and mining, analyzing and modeling data through a series of big data analysis components.

Often these two domains are supported by two different sets of methodologies and system components, and the two domains are connected by data integration or data ETL(extract-transform-load), within which data flows.

Figure 1. Data flow between warehouses

There are very large data exchange needs between these two domains. Data Transmission Service (DTS) supports migration, subscription and real-time synchronization of various databases, and has supported a large number of Data integration links from databases to Data warehouses in 6 years. This year, DTS+DMS will form a new product DMS5.0, which integrates data management, data integration, data ETL and other capabilities, forming a one-stop online data processing platform, which can realize the materialized view ability from library to warehouse.

This section describes the new features in the data processing section of DMS5.0.

In the warehouse integration scenario, there are usually two types of data processing requirements. The first type is the link construction of data from database to real-time data warehouse, and the second type is the processing scenario of offline data warehouse T+1, which is respectively introduced below.

2. Warehouse computing scenario

2.1 Scenario 1: Real-time DATA ETL

In the process of transferring data from database to data warehouse, a DTS data integration link needs to be configured. In the new DMS5.0 scenario, it not only supports the original data integration link function, but also provides new data ETL capability. In other words, data preprocessing is carried out in the process of data transmission, including common data field filtering, data field transformation, data flow table JOIN, data dimension table join and other complex ETL functions.

Currently, this part of the ability can be configured based on the canvas by dragging and dropping, for example:

1. Direct data synchronization

There are input and output tables in the canvas. Drag them directly to the canvas to configure a data synchronization link

Figure 2. ETL Canvas configuration synchronization

2. Data filtering operation

Inserts a field calculator between the data source table and the target table

Figure 3. ETL Canvas configuration field calculation

3, data flow table and dimension table Join

The two original tables in the figure are MySQL database tables. The left side is input as the flow table, and the data change flow is pulled by collecting binlog. The right dimension table is input, and the right dimension table is directly joined by fields in the left flow table. In this scenario, data content in the flow table is usually supplemented with fixed dimension table fields and output to the target table.

Figure 4. ETL Canvas configuration flow, dimension table Join

4, data flow table and flow table Join

The two original tables in the figure are MySQL database tables, and the left and right tables are flow tables. By pulling the binlog of the stream table and joining the stream table in a certain time window.

Figure 4. ETL Canvas configures joins between flow tables

The nodes in the canvas include input nodes, transition nodes, and output nodes. Each node can configure the parameters of the response, corresponding to different libraries, tables, fields and operators, and it is constantly enriched.

On the back of the canvas, the specific calculation will be converted into Flink Sql statements, and the Flink Job will be directly submitted to form a flow calculation task to run. A streaming ETL link between the database and the data warehouse is established.

In the above case, we use the most common MySQL as the source library and AnalyticDB as the target library. Form a seamless connection between the warehouse and support rich computing data.

2.2 Scenario 2: Data T+1 Snapshot

2.2.1 Traditional warehouse scheme

In the traditional space, there is a strong demand. Snapshots are created for daily and hourly data for retrospective analysis of snapshot data at a certain time in the past.

There are usually two ways to implement snapshot tables:

1. Full snapshot: The latest snapshot is extracted every day or hour and the latest snapshot is retained

2. Zipper table: use the zipper table as a temporary table in the middle of the transition, and then restore the snapshot data of a certain hour or a certain day according to the zipper table

The first way is the simplest, but each time the amount of calculation is very large, and the final need to consume data warehouse storage overhead is very large. It is used in the case of small amount of data, while the zipper table is usually used in the case of massive data.

The zipper table applies to the following scenarios:

1. The data volume of a table is very large. For example, a user information table has about 100 records and 50 fields

2. The proportion and frequency of record changes in the table are not very large, and only less than 10% of the data will change every day

Traditional data warehouse zipper tables use offline data stores such as Hive or ODPS as the target storage. Data is usually partitioned by hour or day, and computing tasks are also scheduled by hour or day.

Relative number of traditional warehouse plan, only according to the hours and days level granularity generated zipper table, the source of the library can produce timing data read pressure, at the same time, data timeliness is too slow, cannot satisfy such as live data in the report quickly generate and business decisions, implementation of the business scenario is likely to require reporting levels minutes or even seconds.

2.2.2 DMS5.0 Arbitrary snapshot capability

Arbitrary snapshot capability of DMS5.0 is based on real-time data flow and processing technology of DTS + scheduling and DDL processing capability of DMS + real-time data warehouse capability of ADB, which can generate reports of day level, hour level, minute level and second level. :

Figure 5. DMS5.0 zipper table T+1 snapshot architecture

Zipper table

The zipper table mainly involves the change of the ETL part of the kernel, the ETL within a specific link, Modify_time (a fixed field in the actual table — Dts_SYNC_MODIFY_time) and END_time (a fixed field in the actual zipper table — Dts_SYNC_END_time) are used to mark the modification time of the current record and whether it is expired.

Create zipper table: add modify_time and END_time columns with primary key + MODIFY_time

Original table operation corresponds to zipper table operation:

Zipper means for example:

The snapshot table

The snapshot table mainly involves fixed scheduled tasks in DMS task scheduling, including day level and hour level. Table life cycles are optional and rely on ADB’s own data expiration mechanism.

Whether snapshots can be created depends on the front check of the zipper table. For example, check whether the writer time point in the zipper table has exceeded the current hour and whether the latest records modify_time and END_time in the zipper table have exceeded the current hour. The command can be executed immediately but cannot wait for a fixed timeout period, for example, 10 minutes

2.2.3 Comparison of DMS5.0 Zipper table scheme number warehouse scheme

3. Summary

DMS5.0 puts forward the concept of warehouse integration. It has core technical advantages in DDL synchronization, storehouse integrated data link management, fine-grained snapshot table support, ETL visual processing and other aspects.

In addition to the traditional database integration into the data warehouse link, DMS5.0 has formed a cross-warehouse data ETL link, data T+1 snapshot processing supported by the zipper table. After the development of database and data warehouse reaches a certain maturity, the fusion between the two will bring new user value, and bring new changes for data storage, calculation, processing and analysis. We’ll see!

Distributed capability

1. The background

The new DMS supports migration, subscription, and real-time synchronization of data sources such as relational databases, NoSQL, and big Data (OLAP) through the data transfer service DTS.

In the past, distributed databases were used as the target in data migration and synchronization scenarios. For example, when a stand-alone database could not meet performance requirements, it was synchronized to a distributed database. In recent years, with the booming development of distributed database, there are more and more scenarios of distributed database as the source end, such as Lindorm, PolarDB X, MongoDB cluster edition, Redis cluster edition and other kinds of database and various kinds of middleware sub-table sub-database products. Typical example: 32 nodes PolarDB X(original DRDS) synchronization to AnalyticDB for analysis.

Synchronizing or subscribing to a distributed database as the source presents many new challenges:

  • Distributed database is various, how to standardize access as far as possible?
  • How to set up a distributed synchronization task with one click and operate it easily and efficiently?
  • One of the characteristics of distributed database is the large amount of data, including storage and increment. How to quickly complete storage migration and ensure real-time synchronization of increment?
  • How to migrate the library table/instance and Shard structure of distributed database?

This part mainly introduces the thinking and practice of DTS in connection with distributed database, and how to ensure the overall stability of link through a series of mechanism design.

2. Standardized distributed data source docking capacity construction

It is important for DTS to support data synchronization and subscription for a scenario or data source. However, it is more important to build a standardized docking framework so that the vast majority of data sources can be connected conveniently and efficiently through a unified framework. Meanwhile, performance, operation and maintenance and other aspects should also be taken into account, just as we built AnyToAny framework.

2.1 Distributed data source synchronization

Let’s first look at the DTS synchronization link for non-distributed data sources. Why can’t it be used directly?

As shown in the figure above, in THE DTS task, from pre-check to incremental migration, each module is a stand-alone version. The performance of computing, storage, network and other aspects is limited by the stand-alone performance, which cannot meet the data synchronization and storage requirements of distributed databases such as PolarDBX.

There are many other problems, such as the fact that many distributed databases do not have uniform Binlog data, so DTS must be able to fetch distributed Binlog data.

Distributed database as the source of synchronization, synchronization of the original DTS link is split and form the father and son task model, the original performance bottlenecks of incremental data services (read + storage), total amount of transfer, incremental transfer this several parts down into subtasks, this structure migration performance overhead smaller modules, in the parent task to complete. In theory, RDS MySQL is a standalone version, corresponding to every module of DTS, is also a standalone version, when both machines are the same, performance will not be a bottleneck.

The polarDBX-1.0 example above is not representative of all distributed types of data sources. What do you do with other distributed data sources?

Such as HBase, Tablestore, Lindorm, MongoDB, MyCat, Redis and other types of database, they also have their own partition or sharding mode, DTS provides interfaces, each data source according to their own way to obtain the split logic, corresponding to the sub-tasks of DTS.

2.2 Distributed Scheduling

After the DTS link is split, it becomes parent-child tasks, and each task and each module within the task need to be executed in strict order. The Master module of DTS is responsible for scheduling the split link from left to right, and the parent-child tasks and their sub-modules will be executed alternately. This completes a synchronization or migration of a distributed data source to another database.

Compared with non-distributed DTS task scheduling, the main difficulty of distributed task scheduling lies in the coordination between multiple links. For example, the child task must wait for the parent task to complete the post-structure migration before performing incremental migration.

2.3 Distributed DDL synchronization

For DDL(Data Definition Language), DMS is relatively simple for single-instance database to single-instance database synchronization, directly synchronization DDL to the destination instance (of course, consider the online DDL synchronization case). But DDL synchronization for distributed data sources is one of the more difficult points. Taking PolarDBX1.0 as an example, the following problems are briefly described: When a column deletion operation is performed at the source end, a column deletion operation (horizontal split mode) will be performed on each RDS under PolarDBX1.0. Because the incremental migration module of DTS subtask is executed in parallel, the progress of subtask synchronization will not be completely consistent. If not handled, such phenomenon may occur. 【 subtask 1】 progress is fast, it deleted the column operation of the target library, 【 subtask 2】 progress is slow, it is still writing to the column, resulting in the final synchronization error.

Deal with the problem of a kind of method is to synergy of DTS subtasks, when have DDL to synchronization, only in a child task synchronization, the subtasks wait other subtasks DDL also arrived by it to the target table DDL, again after you have other child tasks it performs to the back of the synchronization.

2.4 Library table mapping

Some distributed databases, in the form of sub-database sub-table for data storage, such as PolarDBX1.0, MyCat and so on. When DTS reads such a database, it reads the physical library table in its physical instance. After synchronization to the target, it needs to map the physical library table to the logical library table. This event needs to be done in the structural migration, full migration and incremental migration steps.

However, the processing is different for different data sources. It can be abstractly divided into two categories. The first one is that in the process of data stream synchronization, the kernel module of DTS obtains the mapping relationship in real time, which is the most ideal situation, because it is compatible with the change of library table in the process of synchronization. The second type is to obtain the mapping relationship when generating sub-tasks, store it, and map it according to the stored relationship during the execution of data flow. This type is suitable for scenarios that are not easy to read when processing data flow, such as DMS logical library.

2.5 Modifying synchronization Objects

“Synchronization object” refers to the library table information that needs to be synchronized. During synchronization, modifying the synchronization object is a common and rigid requirement. DTS distributed tasks generate subtasks after the synchronization object is modified. New library tables are synchronized by subtasks, which are merged into their main task without delay. Of course, if modifying a synchronization object does not add new library tables, there is no need to generate subtasks.

3. Integration lifecycle

Users only need to configure once to generate a distributed data synchronization or subscription link. From the console, there is only one main task. DTS aggregates sub-task information into the main task, including progress and so on.

Subtasks are listed on the task details page and can be viewed in detail.

Tasks such as suspending, modifying synchronization objects, restarting, and releasing are performed in the parent task. DTS sends related instructions to sub-tasks for internal collaborative processing. Users only need to perform operations on the parent task.

Similarly, DTS collects alarm, monitoring, diagnosis and other o&M capabilities into the parent task for convenient o&M management.

3.1 Distributed data source subscription

A distributed data source subscription, like a distributed data source synchronization, is a split of DTS tasks, and the server-side processing of subscriptions is simpler than synchronization because there is no structural migration, full migration, and incremental migration. The rest of the synchronization with a distributed data source is basically the same idea and won’t be expanded here.

3.2 to subscribe to the SDK

On the basis of the original non-distributed data source subscription, we have upgraded the subscription SDK, through the multi-threaded way of parallel consumption, so that users only need to start a Client to complete the subscription of the distributed database.

Similar to the library table mapping problem described in distributed data synchronization, the SDK layer also needs to map the library table at subscription time, mapping the acquired physical library table to the logical library table.

Asset management and security

1. The background

Asset management is the basis of security management. It helps enterprises discover security risks in a more timely and accurate manner and implement effective security control measures through management components. Discovering and managing data, optimizing the cost and quality of data, and ensuring the safe flow and use of data are the three main parts of data management.

Data security and how to manage data in the information age have become a key concern of enterprises, while data encryption and desensitization have also attracted much attention in terms of security compliance requirements. During the life cycle of data assets, enterprises mainly face the following types of problems:

(1) Data are scattered in various places and there is no effective unified management means.

(2) Data authorization is difficult to maintain, including database authorization and accounts scattered everywhere.

(3) The authorization granularity is too coarse to effectively protect data, and the authorization granularity is too fine, resulting in high management costs.

(4) Consistency protection of global sensitive data.

(5) The emergency treatment after data leakage cannot trace back and stop bleeding quickly.

(6) Security issues when transferring and sharing data.

2. Discover and manage data

Data is stored in databases, files, and applications. The database part can be divided into relational database, non-relational database and data warehouse. Files are generally stored statically. For example, the OSS of Ali Cloud is used to store log files and Excel. In addition to data types, network storage locations of data also have a variety of characteristics, such as public clouds, VPC networks, self-built networks, and even local servers.

Data Asset acquisition

DMS supports the discovery and management of more than 27 data sources, laying a foundation for asset collection.

1. Scan IP addresses and ports in a specific network environment to discover database instances.

2. Scan instance information and schema information

For example, MySQL:

Get instance version information: select version();

Additional metadata information is retrieved from the various tables under information_SCHEMA.

3. Data sampling

Data classification by sampling a small amount of data.

4. Data discovery in isolated island environment

Use the database gateway DG to discover and unify the data of the island environment.

3. Optimize cost and quality

Unified management is the first step of data asset management. On this basis, management needs to be carried out for data categories, quality, blood, classification and classification to detect unnecessary data copies, low-quality data, and data security threats.

Build a knowledge graph of data assets

1. Data asset relationship identification

The atlas was constructed from the physical relationship, logical relationship, ETL integration task, processing blood relationship, Schema matching technology, work order and database change relationship, application and middleware relationship, human and database authority association relationship and so on.

2. Map

Draw the “edges” of each relationship around several “points” such as resources and resources, people and resources, people and authority, people and work order.

3. Online atlas storage

The mapped data will be stored in the graph database GDB of Ali Cloud.

4. Data Asset Services

Based on the online knowledge graph, it can provide asset retrieval, change risk association analysis, data security risk association analysis, change early warning, data security early warning, etc. For example, when the asset sheet T needs to make structural changes, the dependence linkage of upstream and downstream in time or the warning of linkage risk is very beneficial to the stability of upstream and downstream data consumption.

4. Safe circulation and use

Classification of data and identification of sensitive data

Data classification focuses on the classification of different data types according to attributes and features, while classification focuses on specific standards. Attributes of the same category are divided into different levels according to their height and size. For example, according to the classification standards defined in GDPR standard. In related regulations, DMS not only meets the requirements of GDPR, but also complies with the requirements of equal Protection 2.0, Data Security Law, personal Information Protection Law and other regulations.

1. Configure a classification template and define sensitive data

It is generally believed that there is consensus on the classification of data, for example, those with mobile phone number characteristics are defined as the category of personal information mobile phone number. This configuration is used to configure tiering policies and define sensitive data for this class. DMS provides a series of act templates to simplify this configuration.

2. Configure periodic scan rules

As data assets are constantly growing and evolving, timely discovery and classification can help reduce the risk of sensitive data leakage. Periodic scan tasks can detect and protect sensitive data in a timely manner.

Generally, data classification is mainly based on metadata information, such as column name, table name, remark name, etc., supplemented with sampling data to strengthen identification.

There are several ways to classify data:

Keywords based;

Wildcard based;

Based on regular expression;

Based on machine learning;

5. Sensitive data protection

Sensitive data is defined for the purpose of comprehensive data protection and leakage prevention in the process of circulation and use. The usual practice is data desensitization protection, desensitization is divided into two categories: static and dynamic desensitization.

1. Configure desensitization algorithm

Define data desensitization methods. For example, DMS supports 15+ desensitization algorithms, covering algorithms such as masking, replacement, transformation, and hash. For example, mobile phone numbers are masked by the middle four digits.

2. Configure a desensitization policy

Desensitization methods for using data in different scenarios are defined, which can be subdivided into simple query, data export, data analysis and other scenarios. To configure different desensitization algorithms.

3, static and dynamic desensitization

Static desensitization uses desensitization algorithm to desensitize data in the synchronization process to protect data from the existence of the target end. For example, when DTS data transmission is desensitized, the downstream can not obtain sensitive data for consumption.

DMS secure access proxy on dynamic desensitization is compatible with HTTP and MySQL protocols, and users can directly access the database through the secure access proxy layer. In the process of accessing the database based on secure access agent, DMS will fully detect the validity of access permissions, detect and intercept data operations that violate r&d specifications and security rules, and desensitize sensitive data securely.

5.1 Permission Control

Data assets cover all data in an enterprise. Access between different departments, personnel, and services should be controlled and secured. Single access should also take into account long-term access, short-term access, large-granularity and small-granularity access requirements under the asset.

For example, DMS builds a logical permission system on top of the database to do access control. Through the abstraction of data, people, and permissions to achieve fine-grained permission management and control. The control granularity, in descending order, is instance, library, table, column, and row. At the same time, the following three permission types are set for operations: query, change, and export. The permission period can be customized to minutes. Through the means of account hosting, avoid r & D personnel directly access to the database account password information, so as to improve the data security.

5.2 Operation Audit

All SQL initiated against data assets and data result meta-information should be recorded and audited. Security operation to audit the traceability, on SQL operations, can according to the researchers sent the query request, record the requester’s IP, the requester personnel information, operation time, content of SQL, SQL operations the impact of the number of rows, as well as the object library information, comprehensive records and audit operation and the operation of database information.

In addition, logs must be encrypted and tamper-proof to prevent them from being modified and used.

Four,

One-stop data management DMS provides a comprehensive unified data management solution, compatible with the complexity of data storage types and regions, and realizes the whole life cycle management from data production, storage, transmission, processing and calculation. At the same time, it needs to support database design and development and data collection in data production, real-time and off-line data integration in data transmission, data development and processing capacity in data use and processing, and data service capacity in data use. Data source coverage, data security and governance, database DevOps, data transfer timeliness, and agility of data development are all major considerations.

This paper discusses some of the key capabilities of one-stop data management DMS, hoping to provide a role for the vast number of users to understand and use the product.

Url [sic] (summit live playback 📎 developer.aliyun.com/live/247301 dry PPT download 📎 developer.aliyun.com/topic/downl…).

This article is the original content of Aliyun and shall not be reproduced without permission.