Hive.apache.org top project
Cause of Hive
Not everyone who does data analysis is a programmer, and MapReduce costs to write. In other words, SQL is used to analyze data in HDFS.
Metadata is the most valuable. Spark uses MetaStore of Hive
Introduction of hive
- The data warehouse
- Interpreters (SQL), compilers (converted to MR), optimizers, etc
- Hive metadata (table metadata) is stored in mysql.
Metadata of HDFS data is in NameNode
Hive architecture
- The user interface
- Command line (CLI), CLI startup will start a hive copy (this copy is metaStore client to connect metadata)
- jdbc
- Webui. no one’s using it
- Hive uses mysql to store metadata in relational databases. Metadata contains the name of the table, its columns, partitions, and attributes. Whether the table is managerTable(internal table, self-managed table) or externalTable. The directory where the table data is stored.
- Compiler, optimizer, executor to complete HQL analysis.
- Hive data is stored in the HDFS, and most queries are performed by MR. (Queries that contain *, such as Select * from TBL, do not generate MapRedcue tasks)
Hive Construction Mode
Use Derby, the built-in in-memory database
An embedded metastore database is mainly used for unit tests. Only one process can connect to the metastore database at a time, so it is not really a practical solution but works well for unit tests. That is, unit testing only.
Connect to MySQL over the network
In this configuration, you would use a traditional standalone RDBMS server. The following example configuration will set up a metastore in a MySQL server. This configuration of metastore database is recommended for any real use.
Use thrift to access metadata through metaStore Server
In remote metastore setup, all Hive Clients will make a connection to a metastore server which in turn queries the datastore (MySQL in this Metastore server and client communicate using Thrift Protocol. Starting with Hive 0.5.0, You can start a Thrift server by executing the following command:hive –service metastore default port 9083
Set up
The client connects to the MetaStore service, and MetaStore connects to the MySQL database to access metadata. With the MetaStore service, multiple HIVE clients can connect to the MetaStore service at the same time. These clients only need to connect to the MetaStore service without knowing the user name and password of the MySQL database.
- Find two machines for node1, one for server and one for client. The metaStore server requires the mysql driver package
- Modify the
hive-default.xml.template
forhive-site.xml
, delete the configuration clean.,$-1d
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive_remote/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node01:3306/hive_remote? createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123</value> </property>Copy the code
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://node03:9083</value>
</property>
Copy the code
The server initializes the metadata database using the schematool -dbtype mysql-initschema command
hive sql
- The default is the default library. And the library cannot drop
desc formatted psn
Formatted is to look at the detailed information on the watch.- Metastore also displays hive table and column information.
- Insert into XXX values is a MAP operation.
- The default separator is
^A
! - Hive Data Types
- String type String
- Support for array, map, struct strings, and more
DDL
The document
- Build table statements
1, Xiao Ming 1, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 2, Xiao Ming 2, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 2 4, Xiao Ming 3, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 4, Xiao Ming 4, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 4 5, Xiao Ming 5, league-movie, Beijing: Mashibing-Shanghai: Pudong 6, Xiao Ming 6, league-book-movie, Beijing: Mashibing-Shanghai: Pudong 6 8, Xiao Ming 7, league-book, Beijing: Mashibing-Shanghai: Pudong 8, Xiao Ming 8, league-book, Beijing: Mashibing-Shanghai: Pudong 8 9, 9, xiao Ming lol - book - movie, Beijing: mashibing - Shanghai: pudongCopy the code
create table psn
(id int.name string, likes array<string>, address map<string.string>)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n'
;
Copy the code
- Viewing table information
desc xxx
. The detailed informationdesc formatted xxx
- Insert data DML
load
Is a pure copy operation.
Local: uploads data locally. If local is not added, data is stored in the HDFS
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION(partcol1=val1, partcol2=val2 ...) ]- speaking
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn;
Copy the code
If you upload files to /user/hive_remote/warehouse/ PSN, this data will be detected by the PSN table. Check as you write, check as you read
Insert overwrite local directory 'local directory' select...
Insert the query result into a local fileinsert overwrite table tablename select ...
Insert the query results into a table
Managed External table
Managed vs External
create external table psn1
(id int.name string, likes array<string>, address map<string.string>)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n'
-- The table location corresponds to HDFS
-- As long as this directory has data, it can be mapped to hive tables
location '/data';
Copy the code
- Internal table create table XXX deletes data from the table. The default built table is an internal table
- External table create External table XXX When deleting a table, only metadata is deleted.
- External tables can have data before tables.
The partition table
create table psn2
(id int.name string, likes array<string>, address map<string.string>)
-- is itself a field
partitioned by (gender string)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n';
--load Specifies a partition
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn2 partition(gender='man');
Gender age of multi-partition tables
create table psn3
(id int.name string, likes array<string>, address map<string.string>)
partitioned by (gender string,age int)
row format delimited
fields terminated by ', '
collection items terminated by The '-'
map keys terminated by ':'
lines terminated by '\n';
-- Load must be on both partitions. The partition order is changeable, it is not sequential.
load data local inpath '/ opt/hive - 2.3.5 / data. TXT' into table psn3 partition(gender='man',age=12);
Copy the code
- Add a partitioned column
alter table psn3 add partition(gender='girl');
--FAILED: ValidationFailureSemanticException partition spec {gender=girl} doesn't contain all (2) partition columns
All partition columns need to be specified
Copy the code
- Deleting a partitioned column from a table removes all partitions.
alter table psn3 drop partition(age='12');
--Dropped the partition gender=girl/age=12
--Dropped the partition gender=man/age=12
--OK
Copy the code
- Control the intensity of zoning in practice. Partition is used to improve query efficiency. For example, partition by day.
Repair the partition
Manually created partition directory. But there are no records in the metadata so you need to repair the partition
hive> msck repair table psn4;
OK
Partitions not inmetastore: psn4:age=10 psn4:age=12 Repair: Added partition to metastore psn4:age=10 Repair: Added partition to Metastore psn4:age=12 Time taken: 0.313 seconds, touchless: 3 row(s)Copy the code
msck repair
Dynamic partitioning
Deleting and updating data requires configuration.
Hive supports transactions! , but there are no commit and rollback operations. The table also needs to be bucked and so on.
hive serde
Serializer and deserilizer are inserted when the re is written
hiveserver2
- To connect to Hive using JDBC, hiveserver2 is required
HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
Hiveserver2 setup and use
Data Volume Size Number of data items indicates the size of each item
Why Hiveserver2? Hiveserver2 acts as a server (and client of MetaStore Server). Hiveserver2 enables multiple clients to analyze data through Hiveserver2, such as Java JDBC Beeline client. Also connected via JDBC. Generally only allowed to query! Both Hive and Hiveserver2 use configuration files to find metadata servicesCopy the code
The website recommends using hiveserver2 in a production environment
When setting up the Hiveserver2 service, you need to change the management rights of the HDFS superuser
<! -- in the core of HDFS cluster - site. Add the following in the XML file configuration file - > < property > < name >. Hadoop proxyuser. Root. Groups < / name > < value > * < value > / < / property > <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <! Restart the cluster after the configuration is complete. Or execute the following command in the namenode node - > HDFS dfsadmin - fs HDFS: / / node1:8020 - refreshSuperUserGroupsConfiguration HDFS dfsadmin - fs hdfs://node2:8020 -refreshSuperUserGroupsConfigurationCopy the code
Hive function
Just look it up when you use it.
Name Subject Score ZS Yuwen 100 ZS Wuli 20 zs Yingyu 80 ls Yuwen 98 ls Wuli 39 ls Yingyu 67 ======================== Name yuwen Yingyu wuli ZS 100 80 20 ls 98 67 39 ======================== select name, sum(case when subject = 'yuwen' then score else 0 end),
sum(case when subject = 'yingyu' then score else 0 end),
sum(case when subject = 'wuli' then score else 0 end)
from cj
group by name;
Copy the code
- UDAF many-to-one aggregation function
- UDTF one-to-many, for example, explode
- UDF One in, one out user-defined function
- Desensitization function, id card 15 to 18 function. , etc.
-- Use Hive for Wordcount
select m.word,count(m.word) from
(select explode(split(str ,' ')) as word from wc) as m
group by m.word;
Copy the code
Hive parameter operation and running mode
- Hive query No table header Parameter is added when the HIVE CLI is started
hive --hiveconf hive.cli.print.header=true
This is valid only for the current session. - In the cli
set
Command to see all configurations. Modify the configuration on the cliset hive.cli.print.header=true
- In the user’s home directory there is a hidden file called
.hivehistory
Historical records of Hive operations - Create a hidden file in the user’s home directory
.hiverc
It can write statements and set operations. This will be read and executed when Hive starts.