The last article was for homeHadoop interview questionsHive is a data warehouse component that stands on the shoulders of giants.

What is Hive? Why Hive

A lot of times interview above to ask what is Hive, with MapReduc why also have Hive. Many small partners answer is not very good, good, that big data elder brother to everyone to explain.

Hive is a data warehouse tool based on Hadoop. It maps structured data files to a database table and provides SQL-like query functions (converting SQL statements into MapReduce tasks for execution).

Why Hive? Context: Hadoop is a good thing, but it is difficult, expensive, and steep to learn. Meaning: Make hadoop easier for programmers to use. Reduce the cost of learning.

2. Hive Components

  • Parsers (parsers SQL statements)
  • Compiler (to compile SQL statements into MapReduce programs)
  • Optimizer (optimizing the MapRedue program)
  • Executor (submits results of MapReduce running to HDFS)

3. Advantages and disadvantages of Hive

Mate is the data stored in the MySQL database is stored in HDFS, the computing engine used by MapReduce

The operation interface adopts SQL to improve rapid development capability (simple and easy to use). 2. Avoids MapReduce writing and reduces the learning cost of developers. 3. Hive is advantageous for processing large data, but not for processing small data because of the high execution delay of HVIE. Hive supports user-defined functions. Users can implement customized functions based on their own requirements

  • Iterative juggling is inexpressible
  • Not good at data mining

2. Hive is inefficient

  • Hive automatically generates MapReduce jobs, which is not rational
  • Hive is difficult to tune and has coarse granularity

4. Common functions (high frequency)

When I interviewed 10 big data companies about Hive, I was asked to list a few common functions whenever I asked about Hive.

  • Relational operators (=, >,<,<>)
  • Logical operations (+,-,*,/^)
  • Numerical calculation (round,floor, Ceil,ceiling)
  • Date function (from_unixtime,unix_timestamp,to_date)
  • Conditional functions (if,COALESCE,CASE)

Due to too much content, wechat search big data elder brother

5. Differences between internal and external tables of Hvie

If you delete an external table, both the original data and metadata will be deleted. Therefore, it is not suitable for sharing data

A. By B. By C. By D. By

Sort By: order By: globally ordered, with only one Reduce Distrbute By: similar to MR Partition, combined with Sort By. Cluster By: When the Distribute By and Sorts By fields are the same, the Cluster By mode can be used. In addition to being Distribute by, Cluster BY also functions as Sort BY. However, the sort can only be in ascending order and cannot be specified as ASC or DESC.

Vii. Hive Optimization (key points)

MapJoin If MapJoin is not specified or the conditions of MapJoin are not met, the Hive parser will convert the Join operation to Common Join. That is, Join is completed in Reduce phase, and data bias is likely to occur. You can use MapJoin to load all small tables to the Map end for join to avoid Reduce processing. 2. Column filtering

  1. In Select, take only the columns you need, and if so, use partition filtering as much as possible and Select less
  2. In partitioning clipping, when external associations are used, if the sub-index filter condition is written after where, the full table association is first followed by filtering.

Using partition technology

3. Set the number of maps properly. Typically, one or more Map tasks will be generated from the input directory

  • The main determinants are: the total number of input files, the size of the input file, and the block size set by the cluster.

Is the more maps the better?

  • If a task has many small files, each small file will be treated as a block and completed by a map task. The time required to start and initialize a map task is much longer than the time required for logical processing, resulting in a huge waste of resources. Also, the number of maps that can be executed at the same time is limited.

Is it safe to make sure that each map handles close to 128MB of blocks?

  • Not necessarily
  • For example, if there is a 127m file, it would normally be completed with a map. However, the file only has one or two small sections and tens of millions of records. If the logic of map processing is complicated, it will definitely be time-consuming to complete it with a map task.

4, small file merge

CombineHiveInputFormat merges small files before running a Map to reduce the number of maps. CombineHiveInputFormat has the function of merging small files (the default format). HiveInputFormat does not have the function of merging small files.

5. The number of Reduces is not the more the better

  • Too many starts and initializations of Reduce also consume time and resources
  • In addition, if there are more than one reduce, multiple files will be output. If many small files are generated, the problem of too many small files will occur if these small files are used as the input of the next task.
  • Configure Reduce to process large amounts of data. Use an appropriate Number of Reduce tasks to process an appropriate number of Reduce tasks

What are Hive metadata storage methods and features

  • Memory database Derby, according to small, but data in memory, unstable
  • MySQL database, data storage mode can be set, persistent number, easy to view

Hive partitioning and its advantages

  • Hive databases, tables, and partitions are all abstractions stored in HDFS
  • A partition in Hive corresponds to a directory in HDFS. The directory name is a partition field
  • If there is a large amount of data in a table, we all take out the query function, time-consuming and slow query
  • The use of partition, you can do to use the partition to take the data in that partition, convenient query, improve the efficiency of query

Hive storage format, storage format difference, and then introduce the next compression

Storage format

  • The default is TextFile
  • ORC compression ratio is high
  • Parquet binary storage, an analytical oriented storage format

The compression

  • We use LZO compression format for the original data. Because the original data is relatively large, we choose LZO compression that supports cutting

  • The cleaned data is stored in the DWD layer. We need to analyze the cleaned data in DWS, so the storage format used by the DWD layer is Parquet and the compression format is Snappy

  • At that time, the project team leader asked us to use SNappy +ORC storage. As a result, I found that using SNappy +ORC storage occupied nearly half more space than ORC storage alone. Later, I conducted a test on the combination of various compression formats and storage formats, and finally used ORC storage alone. It saves a lot of space.

welfare

Wechat search [big data brother] can get the content of this bank benefitsYou can obtain super detailed documents of hive functions