Sqoop Preface

1.1 introduction of Sqoop

SQOOP, or SQL to Hadoop, is a convenient tool for data migration between traditional relational databases and Hadoop. It takes full advantage of the parallelism of MapReduce to speed up data transmission in a batch way. So far, it has evolved into two major versions, SQOOP 1 and SQOOP 2.

SQOOP: Developed by Clouder

Production background

  1. Mysql import Hadoop

  2. Hadoop import mysql

Note: The above Hadoop refers to Hive, HBase, HDFS, etc

1.2 characteristics of Sqoop

The SQOOP architecture is a simple one that integrates Hive, HBase, and Oozie to transfer data through Map-Reduce tasks, providing concurrency and fault tolerance.

SQOOP consists of two parts: a client and a server. You need to install the Server on one of the nodes in the cluster whose server can serve as an entry point for other SQOOP clients.

Hadoop must be installed on the node on the server side. Client can be installed on any number of machines. Hadoop does not need to be installed on the machine with the client.

Sqoop website: https://sqoop.apache.org 1.4.5 official document: https://sqoop.apache.org/docs/1.4.5/ sqoop2 don't recommend reason: http://blog.csdn.net/robbyo/article/details/50737356

1.3 Advantages and disadvantages of SQOOP

advantages

  1. Efficient and controllable utilization of resources, task parallelism, timeout.

  2. Data type mapping and transformation, can be automatically, user can also define.

  3. Support a variety of mainstream databases, MySQL,Oracle, SQL Server, DB2 and so on.

disadvantages

  1. Command – line – based operation is error-prone and unsafe.

  2. Data transport and data format are tightly coupled, which prevents the Connector from supporting all data formats

  3. The user name and password were leaked

1.4 principle of Sqoop

1.4.1 Principle of SQOOP Import

When SQOOP imports, it needs to specify the split-by parameter.

SQOOP splits according to different split-by parameter values, and then allocates the split-by areas to different maps. In each map, the values obtained from the database are processed row by row and written to HDFS. While split-by can be split in different ways depending on the parameter type, such as a simple int, SQoop takes the maximum and minimum split-by field values and then determines which areas to partition based on the num-mappers that are passed in.

For example, select Max (split_by),min(split-by) from get Max (split-by) and min(split-by) from 1000 and 1, respectively, while num-mappers get 2. It is divided into two fields (1,500) and (501-100), and it is also divided into two SQL and two maps for import operation. Select XXX from table where split-by>=1 and split-by<500; select XXX from table where split-by>=501 and split-by<500 The split – by < = 1000. Finally, each Map retrieves the data in its own SQL for import.

1.4.2. Export Principle of SQOOP

According to the MySQL table name, generate a Java class named by the table name, this class inherits the SQOOPRecord, is a Map only MR, and custom output fields.

sqoop export –connect jdbc:mysql://$url:3306/$3? characterEncoding=utf8 –username $username –password $password –table $1 –export-dir $2 –input-fields-terminated-by ‘|’ –null-non-string ‘0’ –null-string ‘0’;

1.5 SQOOP uses instances

The environment

Sqoop: sqoop-1.4.5+cdh5.3.6+78 hive: hive-0.13.1+cdh5.3.6+397 hbase: hbase-0.98.6+cdh5.3.6+115

1.5.1. Mysql to Hadoop

  • Mysql to Hdfs

  sqoop import \
    --connect ${jdbc_url} --username ${jdbc_username} --password  ${jdbc_passwd} \
    --query "${exec_sql}" \
    --split-by ${id} -m 10 \
    --target-dir ${target_dir} \
    --fields-terminated-by "\001" --lines-terminated-by "\n" \
    --hive-drop-import-delims \
    --null-string '\\N' --null-non-string '\\N'
  • Mysql To Hive

  sqoop import \
    --connect ${jdbc_url} \
    --username ${jdbc_username} --password  ${jdbc_passwd} \
    --table ${jdbc_table} --fields-terminated-by "\001" --lines-terminated-by "\n" \
    --hive-import --hive-overwrite --hive-table ${hive_table} \
    --null-string '\\N' --null-non-string '\\N'
  • Mysql To HBase

1.5.2 Hadoop to Mysql

  • Hdfs To Mysql

  sqoop export \
    --connect ${jdbc_url} --username ${jdbc_username} --password  ${jdbc_passwd} \ 
    --table category \
    --export-dir /dc_ext/xbd/dm/tmp/ods_dm_category_tmp \
    --input-fields-terminated-by '\001' \
    --input-null-non-string '\\N' \
    --input-null-string '\\N';

Refer to the article

Hive to MySQL: Hive to MySQL: Hive to MySQL: Hive to MySQL