This is the 8th day of my participation in the genwen Challenge

Hive definition

Hive is a data warehouse technology used to query and manage large data sets stored in a distributed environment. A data warehouse built on HDFS and MapReduce of Hadoop to manage and query and analyze structured/unstructured data;

  • Use HQL (SQL-like statement) as query interface;
  • HDFS is used as the underlying storage.
  • Using MapReduce as the execution layer, HQL statements are translated into M/R jobs and executed in Hadoop

Hive tables are directories or folders of HDFS. Separate folders by table names. If it is a partitioned table, the partition value is a subfolder;

Hive Overview – Metadata, data, and directories

  • Metadata is stored in DB (Derby/MySQL), including the name of the table, the columns and partitions of the table and their attributes. The attributes of the table include whether it is an external table and the directory where the data of the table resides.
  • The data is in the cluster directory:
  1. Internal table: /user/hive/warehouse/ table name;
  2. External table: user – defined directory;
  • The data of the table is the file (under the corresponding folder of the table); Uploading a file is equivalent to uploading data to a data table; There can be more than one file; Table partitions correspond to subdirectories;

Hive Overview – Hive Shell

The Hive Shell converts HiveQL queries into a series of MapReduce jobs for parallel processing and returns the processing results.

Hive organizes data in the form of RDBMS tables and provides metadata or table schema for displaying data stored on Hadoop. Stored in an RDBMS database as metaStore.

Hive usage – Differences between internal and external tables

Similarities: Metadata needs to be specified; All support partitioning

Difference: Actual data is stored in different locations

  1. The inner table. The actual data is stored in the data warehouse directory (default cluster /user/hive/warehouse). When a table is deleted, data and metadata in the table are deleted simultaneously.
  2. The external table. The actual data is stored in the HDFS path specified in the location statement and will not be moved to the database directory. If you drop an external table, only metadata is deleted; data in the table is not deleted.

Hive uses – Partition

In real projects, it is common to “design the database according to the schema of the day table! A Hive partition is similar to an index of the corresponding partition column in a database. A partition in a Hive table corresponds to a directory in the table. Data in all partitions is stored in their respective subdirectories

For example, if htable contains two partitions, DS and City, the HDFS directories of the same date and different cities are as follows:

  • /datawarehouse/htable/ds=20100301/city=GZ
  • /datawarehouse/htable/ds=20100301/city=BJ

Hive uses – Bucket Bucket

When a bucket hashes a specified column, it splits data according to the hash value. Each bucket corresponds to a file. For example, if the attribute column user is distributed among 32 buckets, the corresponding files with hash values 0 and 10 are as follows:

  • /datawarehouse/htable/ds=20100301/city=GZ/part-00000
  • /datawarehouse/htable/ds=20100301/city=GZ/part-00010

Hive usage – Partition, bucket, and index

  • The biggest difference between indexes and partitions is that indexes do not split tables. Partitions split tables.
  • The biggest difference between partitions and buckets is that buckets randomly split tables, while partitions are non-randomly split tables.

Hive Uses – Hive table DDL operations

  1. The Create/Drop/Alter database
  2. The Create/Drop/Truncate table
  3. Alter table/partition/column
  4. The Create/Drop/Alter the view
  5. The Create/Drop/Alter index
  6. The Create/Drop function
  7. Create/Drop/Grant/Revoke Role and permission

Hive Uses – Hive table DML operations

  1. Load data from files to Hive tables
  2. Select the operating
  3. Insert the SELECT query result into the Hive table
  4. Write the select query results to a file
  5. Hive table ACID transaction feature

Hive uses – Hive data type

Basic data types:

  • Tinyint/smallint/int/bigint integer types
  • Float /double Floating point type
  • Boolean Indicates the Boolean type
  • String/vARCHar /char The value is a string

Complex data types:

  • Array: An array type consisting of a series of elements of the same data type
  • Map: Collection type, containing key->value key-value pairs, through which elements are accessed.
  • Struct: Structural type that can contain elements of different data types that can be “dot-synched” to get the desired elements.

Time type:

  • Date: (year month Date)
  • Timestamp: is a Unix time offset
  • select unix_timestamp(); View the time offset of the system

Hive uses the – Hive table operation syntax

Create a table

  • External External table, similar to the CSV engine of mysql
  • Partitioned BY Specifies a partition field
  • Clustered by sorted by Can be used to bucket a column for tables and partitions, and sorted by can be used to sort a field
  • Row format delimited fields terminated by ‘\t’ Specifies the separators and data row separators between fields in the data row
  • Stored as Specifies the data file format: TextFile Sequence RCFile InputFormat (user-defined InputFormat class)
  • Location Specifies the HDFS directory where the data file is stored

Internal TABLE creation TABLE instruction (example) CREATE TABLE POkes (foo INT, bar STRING) Row format Delimited fields terminated by ‘\t’;

CREATE external TABLE ext_pokes (foo INT, Bar STRING) row format delimited fields terminated by ‘\t’ location ‘/data/extpokes’

Delete table

  • drop table [IF EXISTS] table_name
  • Metadata and table data files are deleted when internal tables are deleted
  • When an external table is deleted, only metadata is deleted

Modify the table

Hive uses – HiveQL to load data

LOAD file DATA :(local or HDFS) LOAD DATA local INPATH ‘/user/myname/kv2. TXT ‘OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-15’);

Select data from the specified table and insert it into another table

FROM src 
INSERT OVERWRITE TABLE dest1 SELECT src.* 
WHERE src.key < 100 
INSERT OVERWRITE TABLE dest2 SELECT src.key, 
src.value WHERE src.key >= 100 and src.key < 200 
Copy the code