Stack is a cloud native – site-based data central-platform PaaS, and we have an interesting open source project on Github and Gitee: FlinkX is a batch data synchronization tool based on Flink. It can collect static data and real-time data. It is a data synchronization engine with all-domain, heterogeneous and batch integration. If you like, please give us a star! Star! Star!

Github open Source project: github.com/DTStack/fli…

Gitee Open Source project: gitee.com/dtstack_dev…

Table storage is a NoSQL data storage service. It is a distributed structured and semi-structured data storage and management service based on cloud computing technology.

The data model for table storage is centered on “two-dimensional tables”.

Tables have the concept of rows and columns, but unlike traditional databases, tables store sparse tables

Each row can have different columns, and attribute columns can be added or subtracted dynamically. When building a table, you do not need to define a strict schema for the attribute columns of the table. An overview,

For OTS data migration, use DataX to migrate full data. However, due to the large amount of data in some data tables, full migration cannot be completed in the specified time window. In addition, DataX can only query the range of primary key values, and does not support data extraction based on the range of attribute columns.

Therefore, full + incremental data migration can be implemented in the following two ways:

  • If the partition key contains range information (such as time information and self-increasing ID), the specified range is used as the point of sharding and the migration is performed in batches.
  • If the partition key does not contain range information, data can be migrated in batches in dual-write mode on the application side and written into the same service table in the target environment. Based on the primary key uniqueness of OTS, the strategy of overwriting the original rows of repeated data is selected to ensure the data uniqueness.

In short, in short, we don’t produce data. At the moment, we are big data porters.

Next, this paper takes the application side adjustment to dual write mode as an example to explain the OTS data migration and verification process in detail.

OTS data migration process is shown in the figure below:

OTS data migration preparation

  • Pre-migration phase: Full migration of large tables in dual-write mode
  • Formal migration phase: Full migration of incremental tables in dual-write mode and full migration of other small tables

2. Pre-migration stage

1. Preparation

To ensure data consistency between the old and new environments, clear data from the OTS data table of the target environment before data migration. To Delete data from the OTS data table, use the DataX tool to Delete data from the table without creating a new table.

Specific operations are as follows:

1) Configure the DataX task

Before using DataX to clear data, you need to configure the CORRESPONDING data table to use DataX to execute the Delete task json file. To delete data, configure the connection information of the target end for both reader and Writer, and configure the data write mode DeleteRow as follows:

{ "job": { "setting": { "speed": { "channel": "5" } }, "content": [{ "reader": { "name": "otsreader", "parameter": { "endpoint": "http://xxx.vpc.ots.yyy.com/", "accessId": "dest_accessId", "accessKey": "dest_accessKey", "instanceName": " dest_instanceName", "table": " tablename ", "column": [{ "name": "xxxxx" }, { "name": "xxxxx" } ], "range": { "begin": [{ "type": "INF_MIN" }], "end": [{ "type": "INF_MAX" }] } } }, "writer": { "name": "otswriter", "parameter": { "endpoint": "http://xxx.vpc.ots.yun.yyy.com/", "accessId": "dest_accessId", "accessKey": "dest_accessKey", "instanceName": " dest_instanceName", "table": " tablename ", "primaryKey": [{ "name": "xxxxx", "type": "string" }], "column": [{ "name": "xxxxx", "type": "string" }, { "name": "xxxxx", "type": "string" } ], "writeMode": "DeleteRow" } } }] } }

2) Execute the datax task

  • After logging in to the ECS where datax is located, access the path where datax is located
  • Run the del_pre.sh script on the corresponding tool machine to clear the data in the corresponding table of the target environment.

sh de_pre.sh

  • The content of the del_pre.sh script is as follows:

#! /bin/bash nohup python datax.py del_table_1.json –jvm=”-Xms16G -Xmx16G” > del_table_1.log &

2. Data migration

All data tables with large data volume in the source environment are migrated to corresponding data tables in the target environment under the condition of non-stop service.

1) Configure the DataX task

Configure the corresponding JSON file for the data table in DataX. The details of the migration configuration are as follows:

{ “job”: { “setting”: { “speed”: { “channel”: “5” } }, “content”: [{ “reader”: { “name”: “otsreader”, “parameter”: { “endpoint”: “xxx.vpc.ots.yyy.com/”, “accessId”: “src_accessId”, “accessKey”: “src_ accessKey “, “instanceName”: ” src_instanceName”, “table”: “tablename”, “column”: [{ “name”: “xxxxx” }, { “name”: “xxxxx” } ], “range”: { “begin”: [{ “type”: “INF_MIN” }], “end”: [{ “type”: “INF_MAX” }] } } }, “writer”: { “name”: “otswriter”, “parameter”: { “endpoint”: “http://xxx.vpc.ots.yun.zzz.com/”, “accessId”: “dest_accessId”, “accessKey”: “dest_accessKey”, “instanceName”: ” dest_instanceName”, “table”: ” tablename “, “primaryKey”: [{ “name”: “xxxxx”, “type”: “string” }], “column”: [{ “name”: “xxxxx”, “type”: “string” }, { “name”: “xxxxx”, “type”: “string” } ], “writeMode”: “PutRow” } } }] } }

Note that the OTS is a NoSQL system. In the configuration of migrated data, all the attribute columns must be configured. Otherwise, the value of the corresponding attribute column will be missing.

2) Execute the datax task

  • After logging in to the ECS where datax is located, access the path where datax is located
  • Run the pre_transfer.sh script on the corresponding tool machine to start data migration from the private domain to the private cloud OTS.

sh pre_transfer.sh

  • The pre_transfer.sh script contains the following information:

#! /bin/bash nohup python datax.py table_1.json –jvm=”-Xms16G -Xmx16G” >table_1.log &

At this point, everything is ready, the data just need to be migrated!

Before migrating, let’s finally focus on the data migration goals:

Next, enter the official migration phase!

Third, the formal migration stage

1. OTS data silence

The data silence of OTS is determined by observing whether the data in the corresponding table changes. The verification methods include row number statistics and content statistics.

1) Row count statistics

Because OTS does not provide the count interface, create an external OTS table on Hive to read OTS data and calculate the number of rows in the corresponding data table.

  • Create an external table
  • Start Hive and create external tables corresponding to the preceding data tables. To improve statistical efficiency, the external table can read only the primary key column of OTS. Table construction sentences are as follows:

CREATE EXTERNAL TABLE t_oilcard_expenses_old (h_card_no string) STORED BY ‘com.aliyun.openservices.tablestore.hive.TableStoreStorageHandler’ WITH SERDEPROPERTIES( “tablestore.columns.mapping”=”card_no”) TBLPROPERTIES (“tablestore.endpoint”=”$endpoint “,”tablestore.instance”=”instanceName”,”tablestore.access_key_id”=”ak”,”tablestore.access_key_secret”=”sk”,”tablestore.t able.name”=”tableName”);

  • The script path is displayed
  • Log in to the ECS where the Hadoop cluster master resides and go to the hive directory
  • Perform row statistics
  • Run the pre_all_count.sh script to start collecting the number of rows in the OTS table in the source environment

nohup sh pre_all_count.sh >pre_all_count.log &

  • The pre_all_count.sh script contains the following information:

#! /bin/bash ./bin/hive -e “use ots; select count(h_card_no) from tableName;” >table.rs &

If the statistics on the number of rows are consistent for two consecutive times, the data is silent and data writing is stopped.

2) Content statistics

Data is stored in the same partition because the partitioning key of some data tables has a single value. It takes a long time to count the number of rows using Hive. Therefore, use DATax to import OTS data to oss to collect the number of rows.

  • The script path is displayed
  • Log in to the ECS corresponding to the above table and enter the path where datax is located.
  • Perform content verification

A. Run the check_table.sh script to export OTS data tables from the source environment to OSS.

sh check_table.sh

  • The check_table.sh script contains the following contents:

#! /bin/bash nohup python datax.py table.json –jvm=”-Xms8G -Xmx8G”>ots2oss01.log &

B. Obtain the ETAG value of the OSS object and write the corresponding file table_check01. Rs performs two consecutive content statistics and compares the ETAG value of the two exported objects.

2. OTS data migration

1) Preparation

To ensure data consistency between the old and new environments after migration and prevent dirty data generated in the target environment, clear all other OTS data before data migration.

Data clearing modes include Drop and Delete. The differences between the two modes are as follows:

A. Drop table operations

Log in to the tool where the OTS graphical client resides, run the following command to connect to the specified OTS instance, and drop the corresponding OTS instance.

AK: dest_accessId SK: dest_accessKey InstanceName: InstanceName Endpoint: Endpoint

After confirming the deletion, recreate the corresponding data on the client.

B. Delete table operations

Delete deletes data from a table using the DataX tool without creating a new table. Refer to 2.1.1 for configuration files required by DataX.

  • After logging in to the ECS where datax is located, access the path where datax is located
  • Run the delete script on the corresponding machine to clear data from the OTS table.

sh del_table_01.sh

  • The content of the del_table_01.sh script is as follows:

#! /bin/bash nohup python datax.py del_table.json –jvm=”-Xms16G -Xmx16G”>del_table.log &

2) Data migration

When the source environment stops service, all the incremental tables in dual-write mode and other small tables are migrated to the corresponding data tables in the target environment.

Specific operations are as follows:

A. Configure the DataX task

Configure the corresponding JSON file for the above data table in DataX. Refer to 2.2.1 for the specific content of migration configuration. In the configuration of migration data, all attribute columns should be listed.

B. Execute the DataX task

  • After logging in to the ECS where DataX is located, access the path where DataX is located
  • Run the transfer.sh script on the corresponding tool machine to start data migration from the private domain to the private cloud OTS.

sh transfer.sh

  • The transfer.sh script contains the following content:

#! /bin/bash nohup python datax.py Table.json >Table.log &

3. OTS data verification

OTS data verification methods for both the new and old environments include row number statistics and content statistics, as follows:

1) Source environment data statistics

The data volume statistics of the OTS data table in the source environment are based on the last statistics result during data silence.

2) Target environment data statistics

A. Row count statistics

OTS does not provide the count interface, and the target environment ODPS supports the creation of OTS external tables. Therefore, create an OTS external table on THE ODPS to read OTS data and calculate the number of rows in the corresponding data table.

  • Create an external table
  • Log in to ODPSCMD and create the external table corresponding to the above data table;
  • The script path is displayed
  • Log in to the ECS where the ODPSCMD tool resides and go to the path where the ODPS tool resides. The following figure shows the OTS data migration process
  • Perform row statistics
  • Run the newots_count.sh script to collect the number of rows in the OTS table in the target environment.

nohup sh newots_count.sh >newots_count.log &

The following figure shows the OTS data migration process

#! /bin/bash ./bin/odpscmd -e “select count(h_card_no) from tableName;” >table.rs &

B. Content statistics

Because some data tables in the source environment adopt the method of content statistics for data verification, in order to facilitate the comparison of data consistency, the target environment also adopts the method of content statistics. Please refer to 3.1.2 for specific operations.