This is the sixth day of my participation in the August Text Challenge.More challenges in August

I. Environmental information

Sqoop version: 1.4.7 – cdh6.3.2

Sqoop import

Purpose 1.

The import tool imports a single table from a relational database into HDFS. Each row in a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

2. Grammar

I. Connect the database with the server

Sqoop is designed to import tables from a database into HDFS. To do this, you must specify a connection string that describes how to connect to the database. In simple terms, we must specify a source table, which is the URL of the database connection to the relational database

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \Copy the code
import: Import mark, prove that subsequent parameters are all imported from the relational database to the big data platform parameters
–connect Parameter to import, indicating the database connection where our source data resides
–username Parameter to import, indicating the user name of our database
–password Parameter of import, database password
–table Parameter to import, metadata table name

Official warning: Sqoop will read the entire contents of the password file and use it as a password. This will include any trailing whitespace characters, such as the newline character that most text editors add by default. You need to make sure that your password file contains only the characters that belong to your password.

Sqoop automatically supports a variety of databases, including MySQL. The connection string that starts with JDBC :mysql:// is handled automatically in Sqoop. The Supported Databases section provides a complete list of databases with built-in support. For some databases, you may need to install your own JDBC driver.)

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import and install the.jar file in a directory on the $SQOOP_HOME/lib client computer. (/usr/lib/sqoop/lib this happens if you are installing from an RPM or Debian package.) Each driver.jar file also has a specific driver class that defines the entry point to the driver. For example, the Driver class of the Connector/J library for MySQL is com.mysql.jdbc.driver. Refer to the database vendor-specific documentation to identify the main driver classes. This class must be supplied as an argument to Sqoop –driver.

Sqoop typically imports data in a table-centric manner. Use the –table argument to select the table to import. For example, –table employees. This parameter can also identify a VIEW or other table-like entity in the database.

II. Basic Parameters

–append Append data to an existing data set in HDFS
–as-avrodatafile Import the data into the Avro data file
–as-sequencefile Import the data into SequenceFiles
–as-textfile Import data in plain text (default)
–as-parquetfile Import the data into the Parquet file
–boundary-query Used to create split boundary queries
– the columns < col, col, col… > Columns to import from the table
–delete-target-dir Delete the import target directory (if present)
–direct If the database exists, use the direct connector
–fetch-size The number of entries read from the database at one time.
–inline-lob-limit Set the maximum size of an inline LOB
-m,–num-mappers usenParallel import of Map tasks
-e,–query Import resultstatement*.
–split-by Table columns used to split units of work. Can’t with--autoreset-to-one-mapperOption used together.
–split-limit Upper limit for each split size. This applies only to integer and date columns. For date or timestamp fields, it is measured in seconds.
–autoreset-to-one-mapper Import should use a mapper if the table has no primary key and does not provide disaggregation. Can’t with--split-by <col>Option used together.
–table The name of the table
–target-dir HDFS Target directory
–temporary-rootdir HDFS directory for temporary files created during import (overrides default “_sqoop”)
–warehouse-dir HDFS parent of the table target
–where The WHERE clause used during the import
-z,–compress Enable compression
–compression-codec Using Hadoop codecs (default gzip)
–null-string The string to write for the null value of the string column
–null-non-string A string to write for a null value of a non-string column
–fields-terminated-by Set the field separator
–lines-terminated-by Sets the end-of-line character
–mysql-delimiters Use MySQL’s default delimiter set: fields:. lines: \n escaped-by: \ optional-enclosed-by:'

The –mysql-delimiters parameter is a shorthand parameter that uses the program’s default delimiter mysqldump. If you use the mysqldump delimiter in conjunction with direct mode imports (using –direct), you can achieve very fast imports.

Three, basic use

1. Import mysql to HDFS

Sudo -u HDFS command

I. Import all tables

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

II. Free format import

If the results of the query are to be imported in parallel, then each Map task needs to execute a copy of the query, partitioned by Sqoop’s inferred boundary conditions. Your query must contain the $CONDITIONS flag that each Sqoop process will replace with a unique conditional expression. You must also select the split by that comes with it.

Sqoop imports data in parallel from most database sources. You can specify (parallel processing) the number of map tasks to use by executing import -m or –num-mappers argument. Each of these parameters takes an integer value corresponding to the degree of parallelism to be adopted. By default, four tasks are used. Some databases may improve performance by increasing this value to 8 or 16. Do not increase parallelism beyond what is available in a MapReduce cluster; The task will run continuously and may increase the time required to perform the import. Also, don’t increase parallelism beyond what your database can reasonably support. Connecting 100 concurrent clients to your database can increase the load on your database server to the point where performance is affected.

Sqoop needs a standard to split the workload when performing parallel imports. Sqoop uses disassembly to split the workload. By default, Sqoop will recognize the primary key column in the table (if it exists) and use it as a split column. The disaggregated low and high values are retrieved from the database, and the mapping task operates on uniformly sized components with an overall range. For example, if you have a table whose primary key column id has a minimum of 0 and a maximum of 1000, and Sqoop is instructed to use four tasks, Sqoop will run four processes, SELECT * FROM sometable WHERE id >= lo AND id < hi, (LO, HI) is set to (0, 250), (250, 500), (500, 750) and (750, 1001) in different tasks.

If the actual value of the primary key is not evenly distributed across its range, the task can be unbalanced. You should use the –split-by argument to explicitly select different columns. For example, –split-by employee_id. Sqoop cannot currently be split over multi-column indexes. If your table has no indexed columns or multiple column keys, you must also manually select uncollation.

Users can override –num-mapers with the –split-limit option. Using the –split-limit argument limits the size of the split part created. If a split is created larger than the size specified in this parameter, the size of the split is adjusted to fit this limit, and the number of splits changes accordingly. This affects the actual number of Mapper. If the split size calculated from the supplied –num-mappers argument exceeds the –split-limit argument, the number of actual mappers increases. If the value specified in the argument is –split-limit 0 or negative, the argument is ignored completely and the split size mapper is calculated based on the number.

If the table does not define a primary key –split-by and is not provided, the import will fail unless the number of mappers is explicitly set to 1 using the –num-mappers 1 option or using the –autoreset-to-one-mapper option. This option –autoreset-to-one-mapper is usually used in conjunction with the import-all-tables tool to automatically process tables that do not have primary keys in mode.

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root 123456 \ \ - the password --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t" \ --query 'select id,name,sex from sqoop_test where id>2 and $CONDITIONS'Copy the code

III. Conditional import

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id<=2" \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

IV. Import specified columns

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id>=2" \ --columns name,sex \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \  --fields-terminated-by "\t"Copy the code

V. Transaction isolation level

By default, Sqoop imports data using read commit transaction isolation in the mapper. This may not be ideal in all ETL workflows and may require reduced isolation guarantees. This — parity-Isolation option can be used to instruct Sqoop to use read uncommitted isolation levels.

This Read-uncommitted isolation level does not support all databases (e.g. Oracle), so the specified option –relaxed- Isolation may not be supported across all databases.

VI. Incremental import

debate describe
–check-column (col) Specifies the columns to check when determining the rows to import. (the type of the column should not be CHAR/NCHAR/VARCHAR/VARNCHAR/LONGVARCHAR/LONGNVARCHAR)
–incremental (mode) Specifies how Sqoop determines which rows are new.mode includeappendThe legal value of alpha and betalastmodified.
–last-value (value) Specifies the maximum number of check columns imported last time.

Sqoop supports two types of incremental imports: Append and LastModified. You can use the –incremental parameter to specify the type of incremental import to perform.

Each import generates a new file:

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental append \ --check-column id \ --last-value 5 \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

Each time new data or updated data is merged with the original data:

Sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental lastmodified \ --check-column date \ --last-value "2021-07-26 11:53:50" \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t" \ --merge-key idCopy the code

You should append the schema when importing the table, continually adding new rows and increasing row ID values in the table. You use the –check-column that specifies the row ID. Sqoop imports checks for rows –last-value where the value of the column is greater than the specified value.

Another table update strategy supported by Sqoop is called the LastModified schema. You should use it whenever a row of the source table might be updated, and each such update sets the value of the last modified column to the current timestamp. –last-value Imports checks that the column holds rows with a timestamp more recent than the specified timestamp.

At the end of an incremental import, the value specified by –last-value for subsequent imports will be printed to the screen. When running subsequent imports, you should specify –last-value in this way to ensure that only new or updated data is imported. This is handled automatically by creating incremental imports as saved jobs, which is the preferred mechanism for performing repeated incremental imports. For more information, see the section on saved jobs later in this document.

For example, when we use time increments:

--incremental lastmodified \
--check-column date_col \
--last-value "2021-08-02 17:17:23"
Copy the code

2. Import mysql to Hive

I. Basic parameters

--hive-home <dir> cover$HIVE_HOME
–hive-import Import tables into Hive (if not set, use Hive’s default delimiter.)
–hive-overwrite Overwrites existing data in Hive tables.
–create-hive-table If set, the job will fail if the target configuration unit is set
–hive-table Set the table name to be used when importing to Hive.
–hive-drop-import-delims When importing to Hive, delete *\n*,\rAnd * \ * 01.
–hive-delims-replacement When importing to Hive, add *\n*,\rAnd *\01* are replaced with user-defined strings.
–hive-partition-key The name of the Hive field to be partitioned is fragmented
–hive-partition-value The string value used in this job as the partitioning key imported into the configuration unit.
–map-column-hive Overrides the default mapping from SQL type to Hive type for configured columns. If you specify a comma in this parameter, use urL-encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).

II. Introductory demo

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --hive-import \ --fields-terminated-by "\t" \ --hive-overwrite \ --hive-table sqoop_test_hiveCopy the code

In fact, there are two steps. When SQoop synchronizes to Hive, sqoop will actually import data into HDFS and transfer data from HDFS to Hive!

He will write the HDFS data to a temporary directory, and then copied to/user/hive/warehouse/sqoop_test_hive directory!

III. Official warning

Official: If you have multiple Hive installations, or Hive is not in your.hive $PATH, use the **–hive-home** option to identify the Hive installation directory. Sqoop$HIVE_HOME/bin/hive will be used from here.

Although Hive supports escape characters, it does not handle the escape of newlines. In addition, it does not support the concept of closed characters that may contain field separators in closed strings. Therefore, it is recommended that you choose explicit field and record termination delimiters when using Hive instead of escaping and closing characters; This is due to the limited input parsing capability of Hive. Sqoop will print a warning message if you do use — dance-by, –enclosed-by, or –optionally enclosed-by when importing data into Hive.

If your database rows contain string fields with Hive’s default row delimiters (\n and \ R characters) or column delimiters (\01 characters), Hive will have problems importing data using Sqoop. You can use the –hive-drop-import-delims option to remove these characters at import time to provide Hive-compatible text data. Alternatively, you can use the –hive-delims-replacement option to replace these characters with user-defined strings at import time to provide Hive-compatible text data. These options should only be used if you use Hive’s default delimiter and should not be used if a different delimiter is specified.

Sqoop imports NULL values as String NULL by default. However, Hive uses the string \N to represent NULL values, so predicates that handle NULL (such as IS NULL) will not work properly. You should append arguments –null-string and –null-non-string to import work or case –input-null-string, and –input-null-non-string to export work if you want to store null values properly. Because SQoop uses these parameters in the generated code, you need to escape the value \N to \\N correctly:

Such as:

sqoop import ... --null-string '\\N' --null-non-string '\\N'
Copy the code

By default, table names used in Hive are the same as those used in source tables. You can control the output table name with the –hive-table option.

Hive can put data into partitions for more efficient query performance. You can tell the Sqoop job to import Hive data to a specific partition by specifying –hive-partition-key and –hive-partition-value. Partition values must be strings. Refer to the Hive documentation for more details on partitioning.

You can use the –compress and –compression-codec options to import compression tables into Hive. One disadvantage of compressing imported Hive tables is that many codecs cannot be split up for parallel mapping tasks. However, the LZOP codec does support splitting. When importing tables using this codec, Sqoop will automatically index files to split and configure new Hive tables using the correct InputFormat. This feature currently requires that all partitions of the table be compressed using the LZOP codec.

3. Import mysql to HBash

I. Basic parameters

--column-family <family> Sets the target column family for the import
–hbase-create-table If this parameter is specified, missing HBase tables are created
–hbase-row-key
Specifies the input columns to be used as row keys, if the input table contains compound keys
–hbase-table Specify the HBase table to be used as the target, not HDFS
–hbase-bulkload Enabling Batch Loading

If the input table has compound keys, –hbase-row-key must be in the form of a comma-separated compound key attribute list. In this case, the HBase row keys are generated by combining the values of the compound key attributes using underscores as separators. Note: Sqoop imported tables can use compound keys only if –hbase-row-key is specified.

If the target table and column family do not exist, the Sqoop job exits with an error. You should create the target table and column family before running the import. If –hbase-create-table is specified, Sqoop creates the target table and column family using the default parameters in the hbase configuration (if they do not exist).

Sqoop currently serializes all values to HBase by converting each field to its string representation (just as you would import it into HDFS in text mode), and then inserts utF-8 bytes of this string into the target cell. Sqoop skips all rows that contain a null value in all columns except the row key column.

To reduce the load on hbase, Sqoop can be batch loaded rather than written directly. To use bulk loading, run — hbase-BulkLoad.

debate describe
–accumulo-table Specify the Accumulo table to be used as the target and not HDFS
–accumulo-column-family Sets the target column family for the import
–accumulo-create-table If specified, the missing Accumulo table is created
–accumulo-row-key
Specifies the input column to be used as the row key
–accumulo-visibility (Optional) Specify a visibility tag to apply to all rows inserted into Accumulo. The default is an empty string.
–accumulo-batch-size (Optional) Set the size in bytes of the write buffer in Accumulo. The default value is 4MB.
–accumulo-max-latency (Optional) Set the maximum latency in milliseconds for the Accumulo batch writer. The default value is 0.
–accumulo-zookeepers host:port Accumulo is a comma-separated list of Zookeeper servers used by the Accumulo instance
–accumulo-instance The name of the target Accumulo instance
–accumulo-user The name of the Accumulo user to be imported
–accumulo-password The password of user Accumulo

II. Introductory demo

Example projects:

Hbase creation project:

hbase shell

create 'sqoop_test_hbase','info'
Copy the code
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --column-family "info" \ --hbase-create-table \ --hbase-row-key "id" \ --hbase-table "sqoop_test_hbase" \ --split-by idCopy the code

Four, export

That is, HBash cannot be imported to Mysql from a big data cluster to a non-big data cluster.

The export tool exports a set of files from HDFS back to the RDBMS. The target table must already exist in the database. Read the input file and parse it into a set of records based on the user-specified delimiter.

The default action is to convert these into a set of INSERT statements that inject records into the database. In “UPDATE mode,” Sqoop generates statements to UPDATE and replace existing records in the database, while in “Call mode,” Sqoop makes stored procedure calls on each record.

1. Syntax format

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Copy the code

2. Common parameters

key describe
–connect Specifies the JDBC connection string
–connection-manager Specify the connection manager class to use
–driver Manually specify the JDBC driver class to use
–hadoop-mapred-home Cover $HADOOP_MAPRED_HOME
–help Print instructions
–password-file Set the path for the file that contains the authentication password
-P Read the password from the console
–password Setting an Authentication Password
–username Set the authentication user name
–verbose Print more information at work
–connection-param-file Provides optional properties files for connection parameters
–relaxed-isolation Sets connection transaction isolation to uncommitted reads by the mapper.

Export control parameters:

debate describe
– the columns < col, col, col… > The column to export to the table
–direct Use the direct export fast path
–export-dir Source path of the EXPORTED HDFS
-m,–num-mappers usenExport map tasks in parallel
–table The table to populate
–call Called stored procedure
–update-key Anchor column for updates. If there is more than one column, use a comma-separated list of columns.
–update-mode Specifies how to perform updates when a new row with a mismatched key is found in the database. Legal valuesmodeincludingupdateonly(Default) andallowinsert
–input-null-string Strings to be interpreted as null for string columns
–input-null-non-string Strings to be interpreted as NULL for non-string columns
–staging-table The table in which data will be held temporarily before being inserted into the target table.
–clear-staging-table Indicates that any data existing in the staging table can be deleted.
–batch Use batch mode for low-level statement execution.

The –export-dir argument and a –table or –call are required. They specify the tables to populate in the database (or the stored procedures to invoke), and the directories in HDFS that contain the source data.

By default, all columns in the table are selected for export. You can select a subset of columns and control their order using the –columns parameter. This should include a comma-separated list of columns to export. Example: –columns “col1,col2,col3”. Note that columns not included in the –columns parameter need to have defined default values or allow NULL values. Otherwise, your database will reject the imported data, which in turn will cause the Sqoop job to fail.

3. Export hive to mysql

Bin/sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/hive/warehouse/sqoop_test_hive \ --input-fields-terminated-by "\t"Copy the code

4. Export HDFS data to mysql

vim test.txt
Copy the code
1, Zhang SAN, male,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:51Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test.txt \ --input-fields-terminated-by ","Copy the code

5. Update the export. Only existing data is updated

vim test1.txt
Copy the code
1, Zhang SAN, unknown,2021-08-03 15:05:552, Li Si, FEMALE,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test1.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode updateonlyCopy the code

Updateonly mode: Only existing data is updated. No insert is performed to add new data

6. Incremental export

vim test3.txt
Copy the code
1, Zhang SAN, FEMALE,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code
sudo -u hdfs hdfs dfs -put ./test3.txt /user/
Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test3.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode allowinsertCopy the code

Allowinsert mode: Update existing data and add non-existing data

5. Script export (same as import)

vim sqp_export.opt
Copy the code

Script information:

Export - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test - the username root - 123456 - table sqoop_test password --num-mappers 1 --export-dir /user/hive/warehouse/sqoop_test_hive --input-fields-terminated-by "\t"Copy the code

Execute command:

sudo -u hdfs sqoop --options-file ./sqp_export.opt
Copy the code