1. Introduction
Hive is a data warehouse built on Hadoop. It maps structured data files into tables and provides SQL-like query functions. SQL statements used for query are converted into MapReduce jobs and submitted to Hadoop for running.
2. Hive features
- The operation interface adopts SQL-like syntax to avoid writing MapReduce, reduce the learning cost of developers, and provide the ability of rapid development.
- Hive has a high execution latency. Therefore, Hive is commonly used for data analysis and does not require high real-time performance.
- Hive also has a high execution delay. Hive has the advantage of processing large data, but not small data.
- Hive supports user-defined functions. Users can implement user-defined functions based on their own requirements.
3. Hive system architecture
3.1 User Interface: Client
CLI (Hive shell), JDBC/ODBC(Java to Access Hive), WEBUI (Browser to Access Hive)
3.2 Metadata: Metastore
Metadata includes the table name, the database to which the table belongs, the owner of the table, the column/partition fields, the type of the table (inside and outside), and the directory where the table data resides.
By default, it is stored in the Derby database (single user access only). MySQL is recommended to store Metastore.
3.3 Driver: Driver
-
SQL Parser: Converts SQL strings into abstract syntax tree AST (usually completed with third-party libraries such as ANTLR), and analyzes AST syntax, such as whether tables exist, fields exist, and WHETHER SQL semantics are wrong.
-
Compiler (Physical Plan) : Compiles the AST to produce a logical execution Plan.
-
Query Optimizer: Optimizes the logical execution plan.
-
Execution: Converts a logical Execution plan into a physical plan that can be executed. In Hive’s case, this is MR/Spark.
4. Compare Hive and databases
Hive uses the Hive Query Language (HQL), which is similar to SQL. Therefore, Hive can be easily understood as a database. In fact, Hive has nothing in common with databases structurally, except that they share a similar query language.
4.1 Data Storage Location
Hive is built on Hadoop, and all Hive data is stored in HDFS.
Databases can store data on block devices or local file systems.
4.2 Data Update
Hive is designed for data warehouse applications, and data warehouse content is read and written frequently. Therefore, data cannot be rewritten or added in Hive. All data is determined during loading.
Data in a database usually needs to be modified frequently, so you can use inserts or updates to UPDATE the data.
4.3 the index
Hive does not process data during data loading and does not create indexes for certain keys in data.
In a database, one or more columns are usually indexed, so the database can have high efficiency and low latency for accessing a small amount of data under certain conditions.
4.4 Execution Engine
Most Hive queries are executed using MapReduce provided by Hadoop.
Databases usually have their own execution engine.
4.5 Execution Delay
Hive does not have indexes and scans the entire table during data query, resulting in high latency. Another factor that causes high Hive execution latency is the MapReduce framework. Because MapReduce has a high latency, Hive query execution using MapReduce also has a high latency. (Hive is not suitable for online data query due to high data access latency.)
In contrast, the database execution latency is low. Of course, the latency here is low relative to the size of the data.
With the introduction of MapReduce, Hive can access data in parallel. So even without indexes, Hive’s parallel computing has obvious advantages when the data size is too large to handle the database.
4.6 Scalability
Because Hive is built on top of Hadoop, Hive’s scalability is the same as Hadoop’s.
The database has very limited extended rows due to ACID semantics.
4.7 Data Scale
Because Hive is built on clusters and can use MapReduce for parallel computing, it can support large amounts of data.
Correspondingly, the database can support a smaller data scale.