CDH Hadoop directory:
Hadoop Deployment (3) _ Virtual machine building CDH full distribution mode
Hadoop Deployment (4) _Hadoop cluster management and resource allocation
Hadoop Deployment (5) _Hadoop OPERATION and maintenance experience
Hadoop Deployment (8) _CDH Add Hive services and Hive infrastructure
Hadoop Combat (9) _Hive and UDF development
Sqoop syntax description
Sqoop official learning documentation:
http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.9.0/
Sqoop import is relative to HDFS, that is, import from relational databases to HDFS.
Mysql driver package in sqoop/lib.
Case 1: Import data to the HDFS
/root/project
mkdir sqoop_prj
cd sqoop_prj/
mkdir DBS
cd DBS/
touch DBS.opt
hadoop fs -mkdir /user/hive/warehouse/DBS
which sqoop
Copy the code
Sqoop –options-file aa.opt run the opt file without sending the parameter. -m: specifies the number of maps. If a large amount of table data is extracted, the number of maps is increased. If -m is set to 5 or 5 threads, 5 files are generated in HDFS.
The advantage of writing SQoop to a shell script is that you can pass parameters.
#! /bin/sh
. /etc/profile
hadoop fs -rmr /user/hive/warehouse/DBS
sqoop import --connect "jdbc:mysql://cdhmaster:3306/hive" \
--username root \
--password 123456 \
-m 1 \
--table DBS \
--columns "DB_ID,DESC,DB_LOCATION_URI,NAME,OWNER_NAME,OWNER_TYPE" \
--target-dir "/user/hive/warehouse/DBS"
#--where "length(DESC)>0" \
#--null-string ''
Copy the code
Bugs, driver issues
ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@3c1a42fa is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
Copy the code
Add parameters, refer to
https://stackoverflow.com/questions/29162447/sqoop-import-issue-with-mysql
https://stackoverflow.com/questions/26375269/sqoop-error-manager-sqlmanager-error-reading-from-database-java-sql-sqlexce pt
--driver com.mysql.jdbc.Driver
Copy the code
Warning after adding parameters,
WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
Copy the code
Bugs, SQL syntax problems,
Error: java.io.IOException: SQLException in nextKeyValue
Copy the code
Remove the keyword column DESC, reference,
https://community.cloudera.com/t5/Data-Ingestion-Integration/sqoop-throws-SQLException-in-nextKeyValue/m-p/42653
Case 2: Write Data to Hive common table (Non-partitioned table)
# mysql
create table test (id int, pdate date);
insert into test(id, pdate) values (1, '2017-11-05');
insert into test(id, pdate) values (2, '2017-11-06');
insert into test(id, pdate) values (3, '2017-11-05');
insert into test(id, pdate) values (4, '2017-11-06');
# hive
drop table if exists test;
create table test(id int, pdate string);
Copy the code
–hive-import, specifies that hive tables are to be written. There is no value.
– the hive – overwrite.
– the hive – table, the test.
Example 3: Create a Hive partition table, so, Salesorder
Matters needing attention:
1. What fields do partitions use? Create time, not last_modify_time.
Q: The creation time is used to extract data from hive partitions. The order status changes every 45 days. How do I synchronize Hive data after the order status changes?
Hive does not support Update. Orders placed within the last 15 days are placed to hive partitions every day. Hive is doing statistical analysis, usually most concerned with yesterday’s situation.
# cdhmaster
cd ~
mysql -uroot -p123456 < so.sql
ERROR 1046 (3D000) at line 3: No database selected
vi so.sql
use test;
mysql -uroot -p123456 < so.sql
Copy the code
# hive
CREATE TABLE so (
order_id bigint,
user_id bigint,
order_amt double ,
last_modify_time string
) partitioned by (date string);
Copy the code
After Sqoop is executed, note:
- A directory with the same name as the source table is generated in the HDFS home directory of the user, for example, /user/root/so. If sqoop is successfully imported to hive, the directory is automatically deleted.
- Generate a Java file in the directory of execution, that is, the MR Job code transformed by opt.
- Sqoop Import is automatically compatible with hive tables regardless of column separators.
Sqoop extraction framework encapsulation:
- Create a mysql configuration table, configure the table to be extracted and information;
- Mysql > generate opt file;
- In Java, run the Process command to tune the local system. -sqoop -options -file opt file.
Sqoop-imp -task 1 “2015-04-21”
Sqoop-imp “2015-04-21”
Sqoop export
# mysql test
create table so1 as
select * from so where1 = 0;Copy the code
The source must be HDFS/Hive and the target relational database.
Change date and last_modify_time of table SO1 to vARCHAR.
Sqoop tool encapsulation
Flow ETL performs all configured table extraction.
Flow etl -task 1
Flow etl -task 1 2017-01-01
- Read the mysql
extract_to_hdfs
andextract_db_info
The. Opt file is generated based on the configuration information. - Calling a Linux command from Java’s Process class:
Sqoop --options-file opt file
.
Pack the idea Flow. The jar, ‘D: / Java/idea/IdeaProjects/Hive_Prj/SRC/meta-inf/MANIFEST. MF’ already exists in the VFS, meta-inf delete folder.
Db.properties is the configuration to access the mysql database.
Extract_db_info, extracted table from the database configuration.
Jar to /root/project/lib.
/root/project/bin to create the Flow command.
Configure FLOW_HOME,
vi /etc/profile
export FLOW_HOME=/root/project
source /etc/profile
Copy the code
Configure the properties,
# FLOW_HOME
mkdir conf
vi db.properties
db.driver=com.mysql.jdbc.Driver
db.url=jdbc:mysql://cdhmaster:3306/test
db.user=root
db.password=123456
Copy the code
Configure the sqoop option directory sqoop/opts.
# FLOW_HOME
mkdir -p sqoop/opts
Copy the code
If you want to log at execution time, you need to configure Log4J when you develop the JAR.
ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@310d117d is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
Copy the code
Driver com.mysql.jdbc.driver.
The job can be modified accordingly, such as sh./so.sh
# /root/project/sqoop_prj/DBS
vi so.sh
Flow etl -task 1 $yestoday
Copy the code
You might want to see more
Data analysis/data mining/machine learning
Python Data Mining and Machine Learning — Communication Credit Risk Assessment (1) — Reading data
Python Data Mining and Machine Learning — Communication Credit Risk Assessment (part 2) — Data preprocessing
Python Data Mining and Machine Learning — Communication Credit Risk Assessment (3) — Feature Engineering
Python Data Mining and Machine Learning — Communication Credit Risk Assessment In action (4) — Model training and tuning
The crawler
Simple single-page crawler for Python
Make the crawler bigger
The Python crawler is stored in Python
Python crawler combat to climb lianjia Guangzhou housing prices _04 Lianjia simulation login (record)
Sogou Thesaurus crawler (1) : Basic crawler architecture and crawler thesaurus classification
Sogou Thesaurus crawler (2) : the running process of the basic crawler framework
The wechat official account “Data Analysis” is used to share self-cultivation of data scientists. Since we met each other, it is better to grow up together.
Reprint please specify: Reprint from wechat official account “Data Analysis”
Reader communication telegraph group:
https://t.me/sspadluo