This is the sixth day of my participation in the August Text Challenge.More challenges in August
I. Environmental information
Sqoop version: 1.4.7 – cdh6.3.2
Sqoop import
Purpose 1.
The import tool imports a single table from a relational database into HDFS. Each row in a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.
2. Grammar
I. Connect the database with the server
Sqoop is designed to import tables from a database into HDFS. To do this, you must specify a connection string that describes how to connect to the database. In simple terms, we must specify a source table, which is the URL of the database connection to the relational database
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \Copy the code
import: | Import mark, prove that subsequent parameters are all imported from the relational database to the big data platform parameters |
---|---|
–connect | Parameter to import, indicating the database connection where our source data resides |
–username | Parameter to import, indicating the user name of our database |
–password | Parameter of import, database password |
–table | Parameter to import, metadata table name |
Official warning: Sqoop will read the entire contents of the password file and use it as a password. This will include any trailing whitespace characters, such as the newline character that most text editors add by default. You need to make sure that your password file contains only the characters that belong to your password.
Sqoop automatically supports a variety of databases, including MySQL. The connection string that starts with JDBC :mysql:// is handled automatically in Sqoop. The Supported Databases section provides a complete list of databases with built-in support. For some databases, you may need to install your own JDBC driver.)
You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import and install the.jar file in a directory on the $SQOOP_HOME/lib client computer. (/usr/lib/sqoop/lib this happens if you are installing from an RPM or Debian package.) Each driver.jar file also has a specific driver class that defines the entry point to the driver. For example, the Driver class of the Connector/J library for MySQL is com.mysql.jdbc.driver. Refer to the database vendor-specific documentation to identify the main driver classes. This class must be supplied as an argument to Sqoop –driver.
Sqoop typically imports data in a table-centric manner. Use the –table argument to select the table to import. For example, –table employees. This parameter can also identify a VIEW or other table-like entity in the database.
II. Basic Parameters
–append | Append data to an existing data set in HDFS |
–as-avrodatafile | Import the data into the Avro data file |
–as-sequencefile | Import the data into SequenceFiles |
–as-textfile | Import data in plain text (default) |
–as-parquetfile | Import the data into the Parquet file |
–boundary-query | Used to create split boundary queries |
– the columns < col, col, col… > | Columns to import from the table |
–delete-target-dir | Delete the import target directory (if present) |
–direct | If the database exists, use the direct connector |
–fetch-size | The number of entries read from the database at one time. |
–inline-lob-limit | Set the maximum size of an inline LOB |
-m,–num-mappers | usenParallel import of Map tasks |
-e,–query | Import resultstatement *. |
–split-by | Table columns used to split units of work. Can’t with--autoreset-to-one-mapper Option used together. |
–split-limit | Upper limit for each split size. This applies only to integer and date columns. For date or timestamp fields, it is measured in seconds. |
–autoreset-to-one-mapper | Import should use a mapper if the table has no primary key and does not provide disaggregation. Can’t with--split-by <col> Option used together. |
–table | The name of the table |
–target-dir | HDFS Target directory |
–temporary-rootdir | HDFS directory for temporary files created during import (overrides default “_sqoop”) |
–warehouse-dir | HDFS parent of the table target |
–where | The WHERE clause used during the import |
-z,–compress | Enable compression |
–compression-codec | Using Hadoop codecs (default gzip) |
–null-string | The string to write for the null value of the string column |
–null-non-string | A string to write for a null value of a non-string column |
–fields-terminated-by | Set the field separator |
–lines-terminated-by | Sets the end-of-line character |
–mysql-delimiters | Use MySQL’s default delimiter set: fields:. lines: \n escaped-by: \ optional-enclosed-by:' |
The –mysql-delimiters parameter is a shorthand parameter that uses the program’s default delimiter mysqldump. If you use the mysqldump delimiter in conjunction with direct mode imports (using –direct), you can achieve very fast imports.
Three, basic use
1. Import mysql to HDFS
Sudo -u HDFS command
I. Import all tables
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code
II. Free format import
If the results of the query are to be imported in parallel, then each Map task needs to execute a copy of the query, partitioned by Sqoop’s inferred boundary conditions. Your query must contain the $CONDITIONS flag that each Sqoop process will replace with a unique conditional expression. You must also select the split by that comes with it.
Sqoop imports data in parallel from most database sources. You can specify (parallel processing) the number of map tasks to use by executing import -m or –num-mappers argument. Each of these parameters takes an integer value corresponding to the degree of parallelism to be adopted. By default, four tasks are used. Some databases may improve performance by increasing this value to 8 or 16. Do not increase parallelism beyond what is available in a MapReduce cluster; The task will run continuously and may increase the time required to perform the import. Also, don’t increase parallelism beyond what your database can reasonably support. Connecting 100 concurrent clients to your database can increase the load on your database server to the point where performance is affected.
Sqoop needs a standard to split the workload when performing parallel imports. Sqoop uses disassembly to split the workload. By default, Sqoop will recognize the primary key column in the table (if it exists) and use it as a split column. The disaggregated low and high values are retrieved from the database, and the mapping task operates on uniformly sized components with an overall range. For example, if you have a table whose primary key column id has a minimum of 0 and a maximum of 1000, and Sqoop is instructed to use four tasks, Sqoop will run four processes, SELECT * FROM sometable WHERE id >= lo AND id < hi, (LO, HI) is set to (0, 250), (250, 500), (500, 750) and (750, 1001) in different tasks.
If the actual value of the primary key is not evenly distributed across its range, the task can be unbalanced. You should use the –split-by argument to explicitly select different columns. For example, –split-by employee_id. Sqoop cannot currently be split over multi-column indexes. If your table has no indexed columns or multiple column keys, you must also manually select uncollation.
Users can override –num-mapers with the –split-limit option. Using the –split-limit argument limits the size of the split part created. If a split is created larger than the size specified in this parameter, the size of the split is adjusted to fit this limit, and the number of splits changes accordingly. This affects the actual number of Mapper. If the split size calculated from the supplied –num-mappers argument exceeds the –split-limit argument, the number of actual mappers increases. If the value specified in the argument is –split-limit 0 or negative, the argument is ignored completely and the split size mapper is calculated based on the number.
If the table does not define a primary key –split-by and is not provided, the import will fail unless the number of mappers is explicitly set to 1 using the –num-mappers 1 option or using the –autoreset-to-one-mapper option. This option –autoreset-to-one-mapper is usually used in conjunction with the import-all-tables tool to automatically process tables that do not have primary keys in mode.
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root 123456 \ \ - the password --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t" \ --query 'select id,name,sex from sqoop_test where id>2 and $CONDITIONS'Copy the code
III. Conditional import
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id<=2" \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code
IV. Import specified columns
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --where "id>=2" \ --columns name,sex \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code
V. Transaction isolation level
By default, Sqoop imports data using read commit transaction isolation in the mapper. This may not be ideal in all ETL workflows and may require reduced isolation guarantees. This — parity-Isolation option can be used to instruct Sqoop to use read uncommitted isolation levels.
This Read-uncommitted isolation level does not support all databases (e.g. Oracle), so the specified option –relaxed- Isolation may not be supported across all databases.
VI. Incremental import
debate | describe |
---|---|
–check-column (col) | Specifies the columns to check when determining the rows to import. (the type of the column should not be CHAR/NCHAR/VARCHAR/VARNCHAR/LONGVARCHAR/LONGNVARCHAR) |
–incremental (mode) | Specifies how Sqoop determines which rows are new.mode includeappend The legal value of alpha and betalastmodified . |
–last-value (value) | Specifies the maximum number of check columns imported last time. |
Sqoop supports two types of incremental imports: Append and LastModified. You can use the –incremental parameter to specify the type of incremental import to perform.
Each import generates a new file:
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental append \ --check-column id \ --last-value 5 \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t"Copy the code
Each time new data or updated data is merged with the original data:
Sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --incremental lastmodified \ --check-column date \ --last-value "2021-07-26 11:53:50" \ --target-dir /user/company \ --num-mappers 1 \ --fields-terminated-by "\t" \ --merge-key idCopy the code
You should append the schema when importing the table, continually adding new rows and increasing row ID values in the table. You use the –check-column that specifies the row ID. Sqoop imports checks for rows –last-value where the value of the column is greater than the specified value.
Another table update strategy supported by Sqoop is called the LastModified schema. You should use it whenever a row of the source table might be updated, and each such update sets the value of the last modified column to the current timestamp. –last-value Imports checks that the column holds rows with a timestamp more recent than the specified timestamp.
At the end of an incremental import, the value specified by –last-value for subsequent imports will be printed to the screen. When running subsequent imports, you should specify –last-value in this way to ensure that only new or updated data is imported. This is handled automatically by creating incremental imports as saved jobs, which is the preferred mechanism for performing repeated incremental imports. For more information, see the section on saved jobs later in this document.
For example, when we use time increments:
--incremental lastmodified \
--check-column date_col \
--last-value "2021-08-02 17:17:23"
Copy the code
2. Import mysql to Hive
I. Basic parameters
--hive-home <dir> |
cover$HIVE_HOME |
–hive-import | Import tables into Hive (if not set, use Hive’s default delimiter.) |
–hive-overwrite | Overwrites existing data in Hive tables. |
–create-hive-table | If set, the job will fail if the target configuration unit is set |
–hive-table | Set the table name to be used when importing to Hive. |
–hive-drop-import-delims | When importing to Hive, delete *\n*,\rAnd * \ * 01. |
–hive-delims-replacement | When importing to Hive, add *\n*,\rAnd *\01* are replaced with user-defined strings. |
–hive-partition-key | The name of the Hive field to be partitioned is fragmented |
–hive-partition-value | The string value used in this job as the partitioning key imported into the configuration unit. |
–map-column-hive | Overrides the default mapping from SQL type to Hive type for configured columns. If you specify a comma in this parameter, use urL-encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1). |
II. Introductory demo
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --hive-import \ --fields-terminated-by "\t" \ --hive-overwrite \ --hive-table sqoop_test_hiveCopy the code
In fact, there are two steps. When SQoop synchronizes to Hive, sqoop will actually import data into HDFS and transfer data from HDFS to Hive!
He will write the HDFS data to a temporary directory, and then copied to/user/hive/warehouse/sqoop_test_hive directory!
III. Official warning
Official: If you have multiple Hive installations, or Hive is not in your.hive $PATH, use the **–hive-home** option to identify the Hive installation directory. Sqoop$HIVE_HOME/bin/hive will be used from here.
Although Hive supports escape characters, it does not handle the escape of newlines. In addition, it does not support the concept of closed characters that may contain field separators in closed strings. Therefore, it is recommended that you choose explicit field and record termination delimiters when using Hive instead of escaping and closing characters; This is due to the limited input parsing capability of Hive. Sqoop will print a warning message if you do use — dance-by, –enclosed-by, or –optionally enclosed-by when importing data into Hive.
If your database rows contain string fields with Hive’s default row delimiters (\n and \ R characters) or column delimiters (\01 characters), Hive will have problems importing data using Sqoop. You can use the –hive-drop-import-delims option to remove these characters at import time to provide Hive-compatible text data. Alternatively, you can use the –hive-delims-replacement option to replace these characters with user-defined strings at import time to provide Hive-compatible text data. These options should only be used if you use Hive’s default delimiter and should not be used if a different delimiter is specified.
Sqoop imports NULL values as String NULL by default. However, Hive uses the string \N to represent NULL values, so predicates that handle NULL (such as IS NULL) will not work properly. You should append arguments –null-string and –null-non-string to import work or case –input-null-string, and –input-null-non-string to export work if you want to store null values properly. Because SQoop uses these parameters in the generated code, you need to escape the value \N to \\N correctly:
Such as:
sqoop import ... --null-string '\\N' --null-non-string '\\N'
Copy the code
By default, table names used in Hive are the same as those used in source tables. You can control the output table name with the –hive-table option.
Hive can put data into partitions for more efficient query performance. You can tell the Sqoop job to import Hive data to a specific partition by specifying –hive-partition-key and –hive-partition-value. Partition values must be strings. Refer to the Hive documentation for more details on partitioning.
You can use the –compress and –compression-codec options to import compression tables into Hive. One disadvantage of compressing imported Hive tables is that many codecs cannot be split up for parallel mapping tasks. However, the LZOP codec does support splitting. When importing tables using this codec, Sqoop will automatically index files to split and configure new Hive tables using the correct InputFormat. This feature currently requires that all partitions of the table be compressed using the LZOP codec.
3. Import mysql to HBash
I. Basic parameters
--column-family <family> |
Sets the target column family for the import |
–hbase-create-table | If this parameter is specified, missing HBase tables are created |
–hbase-row-key | |
Specifies the input columns to be used as row keys, if the input table contains compound keys | |
–hbase-table | Specify the HBase table to be used as the target, not HDFS |
–hbase-bulkload | Enabling Batch Loading |
If the input table has compound keys, –hbase-row-key must be in the form of a comma-separated compound key attribute list. In this case, the HBase row keys are generated by combining the values of the compound key attributes using underscores as separators. Note: Sqoop imported tables can use compound keys only if –hbase-row-key is specified.
If the target table and column family do not exist, the Sqoop job exits with an error. You should create the target table and column family before running the import. If –hbase-create-table is specified, Sqoop creates the target table and column family using the default parameters in the hbase configuration (if they do not exist).
Sqoop currently serializes all values to HBase by converting each field to its string representation (just as you would import it into HDFS in text mode), and then inserts utF-8 bytes of this string into the target cell. Sqoop skips all rows that contain a null value in all columns except the row key column.
To reduce the load on hbase, Sqoop can be batch loaded rather than written directly. To use bulk loading, run — hbase-BulkLoad.
debate | describe |
---|---|
–accumulo-table | Specify the Accumulo table to be used as the target and not HDFS |
–accumulo-column-family | Sets the target column family for the import |
–accumulo-create-table | If specified, the missing Accumulo table is created |
–accumulo-row-key | |
Specifies the input column to be used as the row key | |
–accumulo-visibility | (Optional) Specify a visibility tag to apply to all rows inserted into Accumulo. The default is an empty string. |
–accumulo-batch-size | (Optional) Set the size in bytes of the write buffer in Accumulo. The default value is 4MB. |
–accumulo-max-latency | (Optional) Set the maximum latency in milliseconds for the Accumulo batch writer. The default value is 0. |
–accumulo-zookeepers host:port | Accumulo is a comma-separated list of Zookeeper servers used by the Accumulo instance |
–accumulo-instance | The name of the target Accumulo instance |
–accumulo-user | The name of the Accumulo user to be imported |
–accumulo-password | The password of user Accumulo |
II. Introductory demo
Example projects:
Hbase creation project:
hbase shell
create 'sqoop_test_hbase','info'
Copy the code
Bin/sqoop import \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --column-family "info" \ --hbase-create-table \ --hbase-row-key "id" \ --hbase-table "sqoop_test_hbase" \ --split-by idCopy the code
Four, export
That is, HBash cannot be imported to Mysql from a big data cluster to a non-big data cluster.
The export tool exports a set of files from HDFS back to the RDBMS. The target table must already exist in the database. Read the input file and parse it into a set of records based on the user-specified delimiter.
The default action is to convert these into a set of INSERT statements that inject records into the database. In “UPDATE mode,” Sqoop generates statements to UPDATE and replace existing records in the database, while in “Call mode,” Sqoop makes stored procedure calls on each record.
1. Syntax format
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Copy the code
2. Common parameters
key | describe |
---|---|
–connect | Specifies the JDBC connection string |
–connection-manager | Specify the connection manager class to use |
–driver | Manually specify the JDBC driver class to use |
–hadoop-mapred-home | Cover $HADOOP_MAPRED_HOME |
–help | Print instructions |
–password-file | Set the path for the file that contains the authentication password |
-P | Read the password from the console |
–password | Setting an Authentication Password |
–username | Set the authentication user name |
–verbose | Print more information at work |
–connection-param-file | Provides optional properties files for connection parameters |
–relaxed-isolation | Sets connection transaction isolation to uncommitted reads by the mapper. |
Export control parameters:
debate | describe |
---|---|
– the columns < col, col, col… > | The column to export to the table |
–direct | Use the direct export fast path |
–export-dir | Source path of the EXPORTED HDFS |
-m,–num-mappers | usenExport map tasks in parallel |
–table | The table to populate |
–call | Called stored procedure |
–update-key | Anchor column for updates. If there is more than one column, use a comma-separated list of columns. |
–update-mode | Specifies how to perform updates when a new row with a mismatched key is found in the database. Legal valuesmode includingupdateonly (Default) andallowinsert |
–input-null-string | Strings to be interpreted as null for string columns |
–input-null-non-string | Strings to be interpreted as NULL for non-string columns |
–staging-table | The table in which data will be held temporarily before being inserted into the target table. |
–clear-staging-table | Indicates that any data existing in the staging table can be deleted. |
–batch | Use batch mode for low-level statement execution. |
The –export-dir argument and a –table or –call are required. They specify the tables to populate in the database (or the stored procedures to invoke), and the directories in HDFS that contain the source data.
By default, all columns in the table are selected for export. You can select a subset of columns and control their order using the –columns parameter. This should include a comma-separated list of columns to export. Example: –columns “col1,col2,col3”. Note that columns not included in the –columns parameter need to have defined default values or allow NULL values. Otherwise, your database will reject the imported data, which in turn will cause the Sqoop job to fail.
3. Export hive to mysql
Bin/sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/hive/warehouse/sqoop_test_hive \ --input-fields-terminated-by "\t"Copy the code
4. Export HDFS data to mysql
vim test.txt
Copy the code
1, Zhang SAN, male,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:51Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test.txt \ --input-fields-terminated-by ","Copy the code
5. Update the export. Only existing data is updated
vim test1.txt
Copy the code
1, Zhang SAN, unknown,2021-08-03 15:05:552, Li Si, FEMALE,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test1.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode updateonlyCopy the code
Updateonly mode: Only existing data is updated. No insert is performed to add new data
6. Incremental export
vim test3.txt
Copy the code
1, Zhang SAN, FEMALE,2021-08-03 15:05:552, Li Si, female,2021-08-02 15:05:553, Zhao Liu, male,2021-08-01 15:05:554, Wang Wu, male,2021-07-01 15:05:51Copy the code
sudo -u hdfs hdfs dfs -put ./test3.txt /user/
Copy the code
Sqoop export \ - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test \ - the username root \ - 123456 \ password - table sqoop_test \ --num-mappers 1 \ --export-dir /user/test3.txt \ --input-fields-terminated-by "," \ --update-key id \ --update-mode allowinsertCopy the code
Allowinsert mode: Update existing data and add non-existing data
5. Script export (same as import)
vim sqp_export.opt
Copy the code
Script information:
Export - connect JDBC: mysql: / / 10.0.10.118:3306 / sqoop_test - the username root - 123456 - table sqoop_test password --num-mappers 1 --export-dir /user/hive/warehouse/sqoop_test_hive --input-fields-terminated-by "\t"Copy the code
Execute command:
sudo -u hdfs sqoop --options-file ./sqp_export.opt
Copy the code