1. Hive database
A database in Hive is essentially just a directory or namespace, but this concept is useful for clusters with many users and groups. First, it avoids table name conflicts; Second, it is equivalent to the database concept in a relational database. It is a set of tables or logical group of tables, which is very easy to understand.
2, Hive table
A Table in Hive is similar to a Table in a relational database in concept. Each Table has a corresponding directory to store data in Hive. If no database for the Table is specified, The Hive will pass {Hive_HOME} / conf/Hive – site. Hive in the XML configuration file. The metastore. Warehouse. Dir attribute to use the default value (usually/user/Hive/warehouse, All table data (excluding external tables) is stored in this directory.
Hive tables are classified into internal tables and external tables. Managed tables are managed by Hive internal tables. Managed tables are managed both logically and syntactically, and physically. When creating Hive internal tables, data is stored in the directory where the table resides. External tables, on the other hand, are managed only logically and syntactically, meaning that the new table simply points to an external directory. Likewise, when you delete, you do not delete the external directory, but only the references and definitions. Consider the following statement:
create table stu (name string);
load data inpath '/user/stu/stu.txt' into table stu
Copy the code
The above statements will HDFS: / / / user/stu/stu. TXT moved to the corresponding directory HDFS Hive: / / user/Hive/warehouse/stu. However, the load speed is very fast, because Hive only moves the data to the corresponding directory, and does not check whether the data conforms to the defined Schema. This is usually done at Read time (Schema On Read) and stU drops the data using the drop statement. The data and table metadata are removed and no longer exist
drop table stu
Copy the code
For external tables, data creation and deletion are completely controlled by Hive. Hive does not manage such data. The location of the data is specified at creation time:
create external table external_table (name string)
location '/user/external/external_table'
load data inpath '/user/external/external_table.txt' into table external_table
Copy the code
If external is specified, Hive does not move data to warehouse. In fact, Hive does not even verify the existence of directories for external tables. This allows us to create the data after the table is created. When deleting an external table, Hive only deletes metadata, not data stored in HDFS
Select internal table or external table? In most cases, the distinction between the two is not clear. If all of the processing of the data takes place in Hive, internal is preferred. But if Hive and other tools work on the same data set, then external tables are more suitable. A common pattern is to use external tables to access the initial data in the stored HDFS (often created by other tools), then use Hive to transform the data and put the results in internal tables. Conversely, external tables can also be used to export Hive processing results for use by other applications. Another scenario for using external tables is to associate multiple schemas for a single data set.
3. Partition and bucket
Hive divides a table into partitions based on partition fields. Partitioning makes partial queries of data faster. A table or partition can be further divided into buckets. Buckets often add some extra structure to the raw data that can be used for efficient queries. For example, buckets based on user IDS can make user-based queries very fast
3.1, partition
Assume that in a log partition, each record has a timestamp. If you partition by time, the data of the same day will be divided into the same partition. Querying data for a single day or several days can be very efficient because only the files for the partition need to be read. Partitioning does not make large-span queries inefficient. Partitioning can be done with multiple dimensions. For example, after partition, it can be further divided by country. Partitions are defined with a Partitioned By clause when creating a table, which receives a list of columns:
create table partition_table (id int,name string) partitioned by (data string , country string);
Copy the code
When importing data into a partitioned table, the value of the partition is displayed as specified:
load data inpath '/user/root/data' into table partition_table partition (data = '2020-11-01',country = 'cn');
Copy the code
In a file system, partitions exist as the next level of the table’s directory, as shown in the figure below:In real SQL, the flexibility to specify the partition will greatly improve its efficiency, the following code will scan only directories in the 2020-11-01 directory:
select id , name from partition_table where data = '2020-11-01' and country = 'cn';
Copy the code
3.2, barrel
Buckets are usually used in tables or partitions for two reasons: first, for efficient queries. Buckets add special results to tables and Hive can use these structures to improve query efficiency. For example, if two tables are bucket based on the same field, map-side association can be implemented efficiently when the two tables are associated, provided that the associated field appears in the bucket field. Second, it can be sampled efficiently. In the analysis of large data sets, it is often necessary to observe and analyze part of the sampling data, which is conducive to efficient sampling. CLUSTERED BY clause specifies a CLUSTERED BY clause to create tables in Hive.
create table bucketed_table (id int ,name string)
clustered by (id) into 4 buckets;
Copy the code
Specifies that the table is divided into four buckets based on the ID field. Hive determines which bucket to place data in based on the remainder of the field hash. Therefore, each bucket is a random sample of the whole data. In map-side association, the two tables are divided into buckets based on the same field. Therefore, when processing the buckets in the left table, data can be directly extracted from the corresponding buckets for association. The two tables associated with the map-side do not need to have exactly the same number of buckets, just multiple of them. Note that Hive does not verify whether data meets bucket requirements in the table definition. An error message is reported only when an exception occurs during query. Therefore, a better approach is to delegate bucket splitting to Hive (set Hive. Enfoce. Bucketing to true).