What is the Hive

The definition of the Hive

What is Hive? What is Hive

Apache Hive is a data warehouse tool based on Hadoop. It maps structured data files to a table and provides SQL for querying and managing large data sets stored in distributed storage. It provides the following functions:

  1. It provides a set of tools for accessing data through SQL, which can be used to extract/transform/load data (ETL);
  2. Add schemas for various types of data
  3. It can query and analyze large-scale data stored in HDFS (or HBase).
  4. Queries can be performed using MapReduce, TEZ, spark-SQL (not all queries require MapReduce, such as select * from XXX).
  5. Use Hive LLAP, Apache YARN and Apache Slider to achieve sub-second query

Summed up in one sentence: Hive is a data warehouse architecture based on Hadoop file system, and analyzes and manages data stored in HDFS. Hive is a data warehouse architecture based on Hadoop file system, and analyzes and manages data stored in HDFS.

If we do not want to use manual, that is, MR, we can develop a tool, then this tool can be Hive or other tools, the entire Hive architecture as follows: The task of translating SQL into big data takes place in the Driver

For more information about Hive architecture design, see Hive Architecture Design

How do you query and manage data

Hive provides standard SQL functions, including SQL:2003, SQL:2011, and SQL:2016. It is called HQL. Users who are familiar with SQL can query data using Hive

In addition to providing standard SQL functionality, hive-SQL also supports extensions, Implement user-defined Defined Functions (UDF), User Defined Aggregation Functions (UDAF), and User Defined Table Generating Functions provides (UDTF)

The most important thing about HDFS is that there is no schema for data storage in HDFS (schema: a table contains columns, fields, field names, and separators between fields), whereas HDFS is just a plain text file. Without schema, there is no way to use SQL to query. Therefore, in this context, the question arises: how do I add schema information to files on HDFS, and if so, can I use SQL to process it? The result is powerful Hive, which is to say that Hive is done by adding a schema to files in HDFS and then using SQL processing, which is different from traditional relational databases because SQL is ultimately translated to RUN as MR.

Hive is a data warehouse analysis system built based on Hadoop. Hive provides rich SQL query methods to analyze data stored in Hadoop distributed file system. Hive SQL can map structured data files to a database table and provide complete SQL query functions. It can convert SQL statements into MapReduce tasks and use its OWN SQL to query and analyze the required content. Hive SQL Users who are not familiar with MapReduce can easily use SQL to query, summarize, and analyze data. Mapreduce developers can use their own mapper and Reducer as plug-ins to support Hive for more complex data analysis.

Hive design positioning

Hive is different from traditional relational databases. Hive parses external tasks into a MapReduce executable plan. Starting MapReduce is a high-latency process that takes a lot of time to submit and execute a task. This means Hive can only handle high-latency applications. (If you want to handle low-latency applications, consider Hbase.)

Hive does not currently support transactions because of its different design goals. Table data cannot be modified (cannot be updated, deleted, inserted; You can only append and re-import data from files); Columns cannot be indexed (Hive supports index creation but does not improve Hive query speed. If you want to improve Hive query speed, learn Hive partition and bucket applications.

In fact, it is possible to modify the update operation now, but it needs to be configured separately. The main reason is that we seldom use this function, and more of us use the way of overwriting the update

Hive is not designed for OLTP. It is mainly used as a data warehouse tool for data processing and management. Because Hive is based on Hadoop, it has high scalability.

The role of big data

Hive in the current big data ecosystem, the main role is the data warehouse tool, in fact, is a tool to translate SQL into MR, but this is not accurate, because it now supports TEZ and Spark execution engine, lower the threshold of business development, write SQL instead of write MR

Hive data warehouse is a service concept or system, and Hive supports one of the main links, is to process data warehouse.

It does not have its own storage and processing capacity, and its main role is to assume the role of a translator, because the whole big data reflects that many components have their own processing capacity and storage capacity, and then the translation results are submitted to other big data processing components, and then data processing

Compare the Hbase

At the same time, the connection and difference between Hive and hbase are added. As both hbase and Hive rely on Hadoop at the bottom and are commonly used components in daily life, the similarities and differences are as follows: 1. Hbase and Hive are based on Hadoop. Both use Hadoop as the underlying storage

The difference between:

Hive is a batch processing system based on Hadoop to reduce MapReduce jobs. HBase is a project to make up for the shortcomings of Hadoop in real-time operation. 3. Imagine you are operating an RMDB database using Hive+Hadoop for full table scan or HBase+Hadoop for index access. 4.Hive Query is MapReduce jobs that can last from 5 minutes to several hours. HBase is very efficient and definitely much more efficient than Hive. 5.Hive does not store and calculate data. It relies on HDFS and MapReduce. Hive executes some Hive commands using Hadoop’s MapReduce. 7. Hbase is a physical table, not a logical table, and provides a large memory hash table. 8. Hbase is column storage. 9. HDFS serves as the underlying storage system. HDFS stores files, and Hbase organizes files. 10. Hive uses HDFS to store files and the MapReduce computing framework.

conclusion

  1. Hive is a data warehouse framework built on top of Hadoop. Hive was originally developed by Facebook and later handed over to the Apache Software Foundation as an Apache open source project. It’s an analytical application

  2. Hive is used to analyze reports and make decisions, similar to traditional data warehouses. Hive is different from traditional data warehouses in that it can process large scale data and has strong scalability and fault tolerance.

  3. Hive stores all data in HDFS and builds on Hadoop. Most of the query and calculation are performed by MapReduce. Later, Hive can use other engines instead of relying on MR execution engine due to its improved functions