Mysql and Hadoop data synchronization (migration), you need to know Sqoop

In the previous article, how to synchronize Mysql to Hbase data in real time, powerful Streamsets showed you how to use Streamsets to synchronize (migrate) Mysql to Hbase data in real time. The advantages of using Streamsets are simple deployment, flexible configuration, and no coding required.

If you read the previous article carefully, you’ll notice that Streamsets can track data changes in real time and synchronize data updates to hbase. However, the synchronization of historical data (and records with unchanged data) is not mentioned in this article. Of course, it’s not because Streamsets can’t do these things, but I haven’t used them for historical data migration yet. Therefore, for the migration of historical data, TODAY I will introduce another tool – Sqoop. You should be more familiar with Sqoop than Streamsets.

If you need to synchronize data from a relational database (Mysql, Oracle, etc.) to Hadoop (HDFS, Hive, hbase), or from Hadoop to a relational database, this article should help you.

Note: Streamsets and Sqoop are popular data synchronization and migration solutions, and there are many similar tools, such as Datax, kettle, etc. You can learn more about them and find a solution suitable for your own work scenario.

To meet

Sqoop’s name looks like a splice of SQL and Hadoop. The name is pretty obvious: it’s a tool for moving data from a relational database to Hadoop, and it works both ways. You can see it at a glance:

From relational database to Hadoop we call it import, and from Hadoop to relational database we call it export. Later in the article you will see two patterns for the “import” and “export” commands.

The installation

The installation of Sqoop will not be covered here. There are many tutorials on the Internet, and because it is a tool, it is easy to install. It is important to note that there are two large versions of Sqoop: Sqoop1 and Sqoop2.

Visit the Sqoop official website, where 1.4.x is Sqoop1 and 1.99.* is Sqoop2.

The difference between Sqoop1 and Sqoop2 is:

Sqoop1 is just a client tool, and Sqoop2 joins the Server to centralize the management connector
Sqoop1 works on the command line in a single way, while Sqoop2 has more ways to work, such as REST apis and Web pages
Sqoop2 adds the permission security mechanism

For me, Sqoop is a synchronization tool, the command line is sufficient, and most of the data synchronization is on the same LAN (so there are no data security issues), so Sqoop1 (version 1.4.6) was chosen.

Principle of the framework

As can be seen from the figure above, Sqoop Client uses Sqoop through shell commands, and Task Translater in Sqoop converts commands into MapReduce jobs in Hadoop to perform specific data operations. For example, if a table in Mysql is synchronized to Hadoop, Sqoop will divide the table records into multiple parts, and each part will be divided into its own mapper to perform hadoop (to ensure efficient synchronization). You might notice that mapReduce doesn’t have Reduce, it has map.

In field

Now that you know what Sqoop is, what you can do, and the general framework, let’s use the Sqoop commands directly to get a feel for how simple and effective it is to use Sqoop. The relational database in this article uses mysql, Oracle, and other relational databases using JDBC connections operate similarly, with little difference.

Run Sqoop Help to see what sqoop provides, as shown in the figure below

Each of these actions corresponds to a runnable script file in the sqoop bin directory. If you want to see the details, you can open these scripts

Common operations or commands are as follows:

List-databases: view the available databases
List-tables: View the tables in the database
Import: synchronizes data from relational databases to Hadoop
Export: synchronizes data from Hadoop to a relational database
Version: Displays the Sqoop version

Listing databases

Sqoop list - databases -- connect JDBC: mysql: / / 192.168.1.123: $3306 / - the username root -- password 12345678Copy the code

List the table

Sqoop list - databases -- connect JDBC: mysql: / / 192.168.1.123:3306 / databasename - the username root -- password 12345678Copy the code

Mysql to the HDFS

Sqoop import - connect JDBC: mysql: / / 192.168.1.123:3306 / databasename - the username root - password 12345678 - table tablename --target-dir /hadoopDir/ --fields-terminalted-by'\t'
-m 1
--check-column id
--last-value num
--incremental append
Copy the code

–connect: the JDBC URL for the database, followed by databasename the name of the database to which you want to connect

–table: indicates the database table

–username: indicates the database username

–password: indicates the database password

–target-dir: indicates the HDFS target directory

— field-terminated -by: The delimiter between each field after data is imported

-m: indicates the number of mapper concurrent requests

–check-column: specifies the reference column for incremental imports, in this case id (primary key)

–last-value: the last value imported last time

— Incremental Append: Incremental import mode

Note: In situations where incremental synchronization is required, we can use –incremental Append and –last-value. For example, here we use the id as the reference column. If the last synchronization was 1000 and we want to synchronize only the new data, we can take the last-value 1000 parameter.

Mysql to hive

Use imort – create – hive – table

Sqoop import - create - hive - table - m 1 -- connect JDBC: mysql: / / 192.168.1.123:3306 / databasename - the username root password 12345678 --table tablename --hive-import --hive-database databasename_hive --hive-overwrite --hive-table tablename_hiveCopy the code

Mysql to hbase

hbase shell
create_namespace 'database_tmp'
create 'database_tmp:table_tmp'.'info'Sqoop import - connect JDBC: mysql: / / 192.168.1.123:3306 / databasename - the username'root' 
--password '12345678' 
--table 'tablename' 
--hbase-table 'database_tmp:table_tmp' 
--hbase-row-key 'id' 
--column-family 'info'
Copy the code

First, go to the hbase shell and create the namespace and database. Database_tmp bit namespace, table_tmp for the database.

HDFS to mysql

sqoop export- connect JDBC: mysql: / / 192.168.1.123:3306 / databasename - the username root password'12345678' 
--table tablename
--m 1
--export-dir /hadoopDir/
--input-fields-terminated-by '\t'
--columns="column1,column2"
Copy the code

–columns: Specifies which columns to export

Hive to mysql

If you know Hive, you know that the real hive data is actually the data on the HDFS disk. Therefore, the synchronization between Hive and mysql is similar to that between HDFS and mysql

Hbase to mysql

Currently, Sqoop does not provide direct synchronization of hbase data to mysql

Bottom line: Sqoop is an effective and flexible tool for data synchronization in SQL to Hadoop and Hadoop to SQL scenarios.