This article is enough for sqoop learning

This is the sixth day of my participation in the August Text Challenge.More challenges in August

I. Environmental information

Sqoop version: 1.4.7 – cdh6.3.2

Sqoop import

Purpose 1.

The import tool imports a single table from a relational database into HDFS. Each row in a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

2. Grammar

I. Connect the database with the server

Sqoop is designed to import tables from a database into HDFS. To do this, you must specify a connection string that describes how to connect to the database. In simple terms, we must specify a source table, which is the URL of the database connection to the relational database

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \Copy the code

import:	Import mark, prove that subsequent parameters are all imported from the relational database to the big data platform parameters
–connect	Parameter to import, indicating the database connection where our source data resides
–username	Parameter to import, indicating the user name of our database
–password	Parameter of import, database password
–table	Parameter to import, metadata table name

Official warning: Sqoop will read the entire contents of the password file and use it as a password. This will include any trailing whitespace characters, such as the newline character that most text editors add by default. You need to make sure that your password file contains only the characters that belong to your password.

Sqoop automatically supports a variety of databases, including MySQL. The connection string that starts with JDBC :mysql:// is handled automatically in Sqoop. The Supported Databases section provides a complete list of databases with built-in support. For some databases, you may need to install your own JDBC driver.)

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import and install the.jar file in a directory on the $SQOOP_HOME/lib client computer. (/usr/lib/sqoop/lib this happens if you are installing from an RPM or Debian package.) Each driver.jar file also has a specific driver class that defines the entry point to the driver. For example, the Driver class of the Connector/J library for MySQL is com.mysql.jdbc.driver. Refer to the database vendor-specific documentation to identify the main driver classes. This class must be supplied as an argument to Sqoop –driver.

Sqoop typically imports data in a table-centric manner. Use the –table argument to select the table to import. For example, –table employees. This parameter can also identify a VIEW or other table-like entity in the database.

II. Basic Parameters


–append	Append data to an existing data set in HDFS
–as-avrodatafile	Import the data into the Avro data file
–as-sequencefile	Import the data into SequenceFiles
–as-textfile	Import data in plain text (default)
–as-parquetfile	Import the data into the Parquet file
–boundary-query	Used to create split boundary queries
– the columns < col, col, col… >	Columns to import from the table
–delete-target-dir	Delete the import target directory (if present)
–direct	If the database exists, use the direct connector
–fetch-size	The number of entries read from the database at one time.
–inline-lob-limit	Set the maximum size of an inline LOB
-m,–num-mappers	usenParallel import of Map tasks
-e,–query	Import result`statement`*.
–split-by	Table columns used to split units of work. Can’t with`--autoreset-to-one-mapper`Option used together.
–split-limit	Upper limit for each split size. This applies only to integer and date columns. For date or timestamp fields, it is measured in seconds.
–autoreset-to-one-mapper	Import should use a mapper if the table has no primary key and does not provide disaggregation. Can’t with`--split-by <col>`Option used together.
–table	The name of the table
–target-dir	HDFS Target directory
–temporary-rootdir	HDFS directory for temporary files created during import (overrides default “_sqoop”)
–warehouse-dir	HDFS parent of the table target
–where	The WHERE clause used during the import
-z,–compress	Enable compression
–compression-codec	Using Hadoop codecs (default gzip)
–null-string	The string to write for the null value of the string column
–null-non-string	A string to write for a null value of a non-string column
–fields-terminated-by	Set the field separator
–lines-terminated-by	Sets the end-of-line character
–mysql-delimiters	Use MySQL’s default delimiter set: fields:`.` lines: `\n` escaped-by: `\` optional-enclosed-by:`'`

The –mysql-delimiters parameter is a shorthand parameter that uses the program’s default delimiter mysqldump. If you use the mysqldump delimiter in conjunction with direct mode imports (using –direct), you can achieve very fast imports.

Three, basic use

1. Import mysql to HDFS

Sudo -u HDFS command

I. Import all tables

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

II. Free format import

If the results of the query are to be imported in parallel, then each Map task needs to execute a copy of the query, partitioned by Sqoop’s inferred boundary conditions. Your query must contain the $CONDITIONS flag that each Sqoop process will replace with a unique conditional expression. You must also select the split by that comes with it.

Sqoop imports data in parallel from most database sources. You can specify (parallel processing) the number of map tasks to use by executing import -m or –num-mappers argument. Each of these parameters takes an integer value corresponding to the degree of parallelism to be adopted. By default, four tasks are used. Some databases may improve performance by increasing this value to 8 or 16. Do not increase parallelism beyond what is available in a MapReduce cluster; The task will run continuously and may increase the time required to perform the import. Also, don’t increase parallelism beyond what your database can reasonably support. Connecting 100 concurrent clients to your database can increase the load on your database server to the point where performance is affected.

Sqoop needs a standard to split the workload when performing parallel imports. Sqoop uses disassembly to split the workload. By default, Sqoop will recognize the primary key column in the table (if it exists) and use it as a split column. The disaggregated low and high values are retrieved from the database, and the mapping task operates on uniformly sized components with an overall range. For example, if you have a table whose primary key column id has a minimum of 0 and a maximum of 1000, and Sqoop is instructed to use four tasks, Sqoop will run four processes, SELECT * FROM sometable WHERE id >= lo AND id < hi, (LO, HI) is set to (0, 250), (250, 500), (500, 750) and (750, 1001) in different tasks.

If the actual value of the primary key is not evenly distributed across its range, the task can be unbalanced. You should use the –split-by argument to explicitly select different columns. For example, –split-by employee_id. Sqoop cannot currently be split over multi-column indexes. If your table has no indexed columns or multiple column keys, you must also manually select uncollation.

Users can override –num-mapers with the –split-limit option. Using the –split-limit argument limits the size of the split part created. If a split is created larger than the size specified in this parameter, the size of the split is adjusted to fit this limit, and the number of splits changes accordingly. This affects the actual number of Mapper. If the split size calculated from the supplied –num-mappers argument exceeds the –split-limit argument, the number of actual mappers increases. If the value specified in the argument is –split-limit 0 or negative, the argument is ignored completely and the split size mapper is calculated based on the number.

If the table does not define a primary key –split-by and is not provided, the import will fail unless the number of mappers is explicitly set to 1 using the –num-mappers 1 option or using the –autoreset-to-one-mapper option. This option –autoreset-to-one-mapper is usually used in conjunction with the import-all-tables tool to automatically process tables that do not have primary keys in mode.

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root 123456 \ \ - the password --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t" \ --query 'select id,name,sex from sqoop_test where id>2 and $CONDITIONS'Copy the code

III. Conditional import

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id<=2" \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

IV. Import specified columns

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id>=2" \ --columns name,sex \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \  --fields-terminated-by "\t"Copy the code

V. Transaction isolation level

By default, Sqoop imports data using read commit transaction isolation in the mapper. This may not be ideal in all ETL workflows and may require reduced isolation guarantees. This — parity-Isolation option can be used to instruct Sqoop to use read uncommitted isolation levels.

This Read-uncommitted isolation level does not support all databases (e.g. Oracle), so the specified option –relaxed- Isolation may not be supported across all databases.

VI. Incremental import

debate	describe
–check-column (col)	Specifies the columns to check when determining the rows to import. (the type of the column should not be CHAR/NCHAR/VARCHAR/VARNCHAR/LONGVARCHAR/LONGNVARCHAR)
–incremental (mode)	Specifies how Sqoop determines which rows are new.`mode` include`append`The legal value of alpha and beta`lastmodified`.
–last-value (value)	Specifies the maximum number of check columns imported last time.

Sqoop supports two types of incremental imports: Append and LastModified. You can use the –incremental parameter to specify the type of incremental import to perform.

Each import generates a new file:

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental append \ --check-column id \ --last-value 5 \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code

Each time new data or updated data is merged with the original data:

Sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental lastmodified \ --check-column date \ --last-value "2021-07-26 11:53:50" \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t" \ --merge-key idCopy the code

You should append the schema when importing the table, continually adding new rows and increasing row ID values in the table. You use the –check-column that specifies the row ID. Sqoop imports checks for rows –last-value where the value of the column is greater than the specified value.

Another table update strategy supported by Sqoop is called the LastModified schema. You should use it whenever a row of the source table might be updated, and each such update sets the value of the last modified column to the current timestamp. –last-value Imports checks that the column holds rows with a timestamp more recent than the specified timestamp.

At the end of an incremental import, the value specified by –last-value for subsequent imports will be printed to the screen. When running subsequent imports, you should specify –last-value in this way to ensure that only new or updated data is imported. This is handled automatically by creating incremental imports as saved jobs, which is the preferred mechanism for performing repeated incremental imports. For more information, see the section on saved jobs later in this document.

For example, when we use time increments:

--incremental lastmodified \
--check-column date_col \
--last-value "2021-08-02 17:17:23"
Copy the code

2. Import mysql to Hive

I. Basic parameters


`--hive-home <dir>`	cover`$HIVE_HOME`
–hive-import	Import tables into Hive (if not set, use Hive’s default delimiter.)
–hive-overwrite	Overwrites existing data in Hive tables.
–create-hive-table	If set, the job will fail if the target configuration unit is set
–hive-table	Set the table name to be used when importing to Hive.
–hive-drop-import-delims	When importing to Hive, delete \n,\rAnd * \ * 01.
–hive-delims-replacement	When importing to Hive, add \n,\rAnd \01 are replaced with user-defined strings.
–hive-partition-key	The name of the Hive field to be partitioned is fragmented
–hive-partition-value	The string value used in this job as the partitioning key imported into the configuration unit.
–map-column-hive	Overrides the default mapping from SQL type to Hive type for configured columns. If you specify a comma in this parameter, use urL-encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).

II. Introductory demo

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --hive-import \ --fields-terminated-by "\t" \ --hive-overwrite \ --hive-table sqoop_test_hiveCopy the code

In fact, there are two steps. When SQoop synchronizes to Hive, sqoop will actually import data into HDFS and transfer data from HDFS to Hive!

He will write the HDFS data to a temporary directory, and then copied to/user/hive/warehouse/sqoop_test_hive directory!

III. Official warning

Official: If you have multiple Hive installations, or Hive is not in your.hive $PATH, use the **–hive-home** option to identify the Hive installation directory. Sqoop$HIVE_HOME/bin/hive will be used from here.

Although Hive supports escape characters, it does not handle the escape of newlines. In addition, it does not support the concept of closed characters that may contain field separators in closed strings. Therefore, it is recommended that you choose explicit field and record termination delimiters when using Hive instead of escaping and closing characters; This is due to the limited input parsing capability of Hive. Sqoop will print a warning message if you do use — dance-by, –enclosed-by, or –optionally enclosed-by when importing data into Hive.

If your database rows contain string fields with Hive’s default row delimiters (\n and \ R characters) or column delimiters (\01 characters), Hive will have problems importing data using Sqoop. You can use the –hive-drop-import-delims option to remove these characters at import time to provide Hive-compatible text data. Alternatively, you can use the –hive-delims-replacement option to replace these characters with user-defined strings at import time to provide Hive-compatible text data. These options should only be used if you use Hive’s default delimiter and should not be used if a different delimiter is specified.

Sqoop imports NULL values as String NULL by default. However, Hive uses the string \N to represent NULL values, so predicates that handle NULL (such as IS NULL) will not work properly. You should append arguments –null-string and –null-non-string to import work or case –input-null-string, and –input-null-non-string to export work if you want to store null values properly. Because SQoop uses these parameters in the generated code, you need to escape the value \N to \\N correctly:

Such as:

sqoop import ... --null-string '\\N' --null-non-string '\\N'
Copy the code

By default, table names used in Hive are the same as those used in source tables. You can control the output table name with the –hive-table option.

Hive can put data into partitions for more efficient query performance. You can tell the Sqoop job to import Hive data to a specific partition by specifying –hive-partition-key and –hive-partition-value. Partition values must be strings. Refer to the Hive documentation for more details on partitioning.

You can use the –compress and –compression-codec options to import compression tables into Hive. One disadvantage of compressing imported Hive tables is that many codecs cannot be split up for parallel mapping tasks. However, the LZOP codec does support splitting. When importing tables using this codec, Sqoop will automatically index files to split and configure new Hive tables using the correct InputFormat. This feature currently requires that all partitions of the table be compressed using the LZOP codec.

3. Import mysql to HBash

I. Basic parameters


`--column-family <family>`	Sets the target column family for the import
–hbase-create-table	If this parameter is specified, missing HBase tables are created
–hbase-row-key
Specifies the input columns to be used as row keys, if the input table contains compound keys
–hbase-table	Specify the HBase table to be used as the target, not HDFS
–hbase-bulkload	Enabling Batch Loading

If the input table has compound keys, –hbase-row-key must be in the form of a comma-separated compound key attribute list. In this case, the HBase row keys are generated by combining the values of the compound key attributes using underscores as separators. Note: Sqoop imported tables can use compound keys only if –hbase-row-key is specified.

If the target table and column family do not exist, the Sqoop job exits with an error. You should create the target table and column family before running the import. If –hbase-create-table is specified, Sqoop creates the target table and column family using the default parameters in the hbase configuration (if they do not exist).

Sqoop currently serializes all values to HBase by converting each field to its string representation (just as you would import it into HDFS in text mode), and then inserts utF-8 bytes of this string into the target cell. Sqoop skips all rows that contain a null value in all columns except the row key column.

To reduce the load on hbase, Sqoop can be batch loaded rather than written directly. To use bulk loading, run — hbase-BulkLoad.

debate	describe
–accumulo-table	Specify the Accumulo table to be used as the target and not HDFS
–accumulo-column-family	Sets the target column family for the import
–accumulo-create-table	If specified, the missing Accumulo table is created
–accumulo-row-key
Specifies the input column to be used as the row key
–accumulo-visibility	(Optional) Specify a visibility tag to apply to all rows inserted into Accumulo. The default is an empty string.
–accumulo-batch-size	(Optional) Set the size in bytes of the write buffer in Accumulo. The default value is 4MB.
–accumulo-max-latency	(Optional) Set the maximum latency in milliseconds for the Accumulo batch writer. The default value is 0.
–accumulo-zookeepers host:port	Accumulo is a comma-separated list of Zookeeper servers used by the Accumulo instance
–accumulo-instance	The name of the target Accumulo instance
–accumulo-user	The name of the Accumulo user to be imported
–accumulo-password	The password of user Accumulo

II. Introductory demo

Example projects:

Hbase creation project:

hbase shell

create 'sqoop_test_hbase','info'
Copy the code

Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --column-family "info" \ --hbase-create-table \ --hbase-row-key "id" \ --hbase-table "sqoop_test_hbase" \ --split-by idCopy the code

Four, export

That is, HBash cannot be imported to Mysql from a big data cluster to a non-big data cluster.

The export tool exports a set of files from HDFS back to the RDBMS. The target table must already exist in the database. Read the input file and parse it into a set of records based on the user-specified delimiter.

The default action is to convert these into a set of INSERT statements that inject records into the database. In “UPDATE mode,” Sqoop generates statements to UPDATE and replace existing records in the database, while in “Call mode,” Sqoop makes stored procedure calls on each record.

1. Syntax format

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Copy the code

2. Common parameters

key	describe
–connect	Specifies the JDBC connection string
–connection-manager	Specify the connection manager class to use
–driver	Manually specify the JDBC driver class to use
–hadoop-mapred-home	Cover $HADOOP_MAPRED_HOME
–help	Print instructions
–password-file	Set the path for the file that contains the authentication password
-P	Read the password from the console
–password	Setting an Authentication Password
–username	Set the authentication user name
–verbose	Print more information at work
–connection-param-file	Provides optional properties files for connection parameters
–relaxed-isolation	Sets connection transaction isolation to uncommitted reads by the mapper.

Export control parameters:

debate	describe
– the columns < col, col, col… >	The column to export to the table
–direct	Use the direct export fast path
–export-dir	Source path of the EXPORTED HDFS
-m,–num-mappers	usenExport map tasks in parallel
–table	The table to populate
–call	Called stored procedure
–update-key	Anchor column for updates. If there is more than one column, use a comma-separated list of columns.
–update-mode	Specifies how to perform updates when a new row with a mismatched key is found in the database. Legal values`mode`including`updateonly`(Default) and`allowinsert`
–input-null-string	Strings to be interpreted as null for string columns
–input-null-non-string	Strings to be interpreted as NULL for non-string columns
–staging-table	The table in which data will be held temporarily before being inserted into the target table.
–clear-staging-table	Indicates that any data existing in the staging table can be deleted.
–batch	Use batch mode for low-level statement execution.

The –export-dir argument and a –table or –call are required. They specify the tables to populate in the database (or the stored procedures to invoke), and the directories in HDFS that contain the source data.

By default, all columns in the table are selected for export. You can select a subset of columns and control their order using the –columns parameter. This should include a comma-separated list of columns to export. Example: –columns “col1,col2,col3”. Note that columns not included in the –columns parameter need to have defined default values or allow NULL values. Otherwise, your database will reject the imported data, which in turn will cause the Sqoop job to fail.

3. Export hive to mysql

Bin/sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/hive/warehouse/sqoop_test_hive \ --input-fields-terminated-by "\t"Copy the code

4. Export HDFS data to mysql

vim test.txt
Copy the code

1, Zhang SAN, male,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:51Copy the code

Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test.txt \ --input-fields-terminated-by ","Copy the code

5. Update the export. Only existing data is updated

vim test1.txt
Copy the code

1, Zhang SAN, unknown,2021-08-03 15:05:552, Li Si, FEMALE,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code

Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test1.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode updateonlyCopy the code

Updateonly mode: Only existing data is updated. No insert is performed to add new data

6. Incremental export

vim test3.txt
Copy the code

1, Zhang SAN, FEMALE,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code

sudo -u hdfs hdfs dfs -put ./test3.txt /user/
Copy the code

Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test3.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode allowinsertCopy the code

Allowinsert mode: Update existing data and add non-existing data

5. Script export (same as import)

vim sqp_export.opt
Copy the code

Script information:

Export - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test - the username root - 123456 - table sqoop_test password --num-mappers 1 --export-dir /user/hive/warehouse/sqoop_test_hive --input-fields-terminated-by "\t"Copy the code

Execute command:

sudo -u hdfs sqoop --options-file ./sqp_export.opt
Copy the code

This article is enough for sqoop learning

I. Environmental information

Sqoop import

Purpose 1.

2. Grammar

I. Connect the database with the server

II. Basic Parameters

Three, basic use

1. Import mysql to HDFS

I. Import all tables

II. Free format import

III. Conditional import

IV. Import specified columns

V. Transaction isolation level

VI. Incremental import

2. Import mysql to Hive

I. Basic parameters

II. Introductory demo

III. Official warning

3. Import mysql to HBash

I. Basic parameters

II. Introductory demo

Four, export

1. Syntax format

2. Common parameters

3. Export hive to mysql

4. Export HDFS data to mysql

5. Update the export. Only existing data is updated

6. Incremental export

5. Script export (same as import)

Related Posts

MySQL experiment: practice index on full column matching, left-most prefix matching, range query and other conditions, and understand dirty read, phantom read, etc

Thrift Combat — A Simple Matching System (II)

Stronger, More stable, more efficient: Understanding the Troika of ETCD technology upgrades