Hive parameter configuration tuning

Hive achieves parallel processing by dividing a query into one or more MapReduce tasks. Each task may have multiple mapper and reducer tasks, at least some of which can be executed in parallel.

The determination of the optimal number of mapper and reducer depends on several variables, such as the size of the input data and the type of operations to be performed on these data.

It is necessary to maintain balance. For a big data system like Spark/Hadoop, it is not terrible to have a large amount of data. What is terrible is the skewness of the data, the uneven processing of each node.

If there are too many mapper or reducer tasks, it will lead to too much overhead during startup, scheduling, and running JOBS. If the number of Settings is too small, you may not be taking full advantage of the inherent parallelism of the cluster.

Mapred.reduce. Tasks The number of reduers to which the Job is submitted, using the Hadoop Client configuration. 1

Hive. Mapred. mode Map/Redure mode, if set to strict, will disable queries of type 3:1. The WHERE filter on a partitioned table must contain a partitioned field; 2. For queries that use the ORDER BY statement, the LIMIT statement must be used (the ORDER BY statement will distribute all the result set data to the same reducer for processing to sort, so adding the LIMIT statement can prevent the reducer from executing for a long time) 3. A query that restricts the Cartesian product has a WHERE statement but no ON statement. ‘nonstrict’

If the Hive input is made up of many small files, each small file will start a Map task. If the file is too small, the Hive input will start a Map task. If the file is too small, the Hive input will start a Map task. It takes longer to start and initialize the map task than it takes to process it logically, resulting in a waste of resources and even OOM. For this reason, when we start a task and find a small amount of input data but a large number of tasks, we need to take care to merge the input at the Map front end. Of course, when we write data to a table, we also need to pay attention to the output file size true

Hive. Merge. MapRedFiles: whether to enable merge Map/Reduce small files (false at the end of map-reduce task

Hive. Exec. Parallel enabled concurrent commit of map/reduce jobs. false

Hive. Limit. Optimize. The enable when using limit statement, it can be to sampling of data source, avoid execute the query, and then return to the partial results But the function has a shortcoming, may be useful in the input data will never be processed.

Hive. The exec. Reducers. Bytes. Per. Each reducer reducer for the average load of bytes. 1000000000

Set the upper limit of the number of reducers, which can prevent a query from consuming too many reducer resources. For the setting of the value size of this attribute, a suggested calculation formula is as follows: (total number of Reduce slots in the cluster *1.5)/(average number of queries in execution) A multiple of 1.5 is a rule of thumb factor used to prevent underutilization of the cluster. 999

Hive. Exec. RowOffset Hive provides two virtual columns: one for the input file name to be partitioned, and another for the in-block offsets in the file. These virtual columns can be used to diagnose queries when Hive produces unexpected or null return results. With these “fields”, the user can see which file and even which data is causing the problem: SELECT INPUT_FILE_NAME, BLOCK_OFFSET_INSIDE_FILE, ROW_OFFSET_INSIDE_BLOCK, line FROM hive_text WHERE line LIKE ‘%hive%’ LIMIT 2; true

Hive. Multigroupby. Singlemr a special optimization, whether the query of multiple group by operation assembled into a single task graphs. false

Hive. The exec. Dynamic. The partition is on dynamic partitioning. false

Hive. The exec. Dynamic. Partition. Mode after open the dynamic partitioning, dynamic partitioning model, there are strict and nonstrict two values optional, strict request contains at least one static partitioning columns, nonstrict, no this requirement. strict

Hive. The exec. Max. Dynamic. Partitions allowed by the number of the largest dynamic partitioning. 1000

Hive. The exec. Max. Dynamic. Partitions. Pernode single reduce the number of nodes allowed by the biggest dynamic partitioning. 100

Hive. The exec. Default. Partition. Name the name of the default dynamic partitioning, when dynamic partitioning as’ or null, use this name. ” ‘__HIVE_DEFAULT_PARTITION__’

Hive, exec mode. Local. Auto decided to hive should automatically according to the size of the input file, running locally (in GateWay) true

Hive. The exec. Mode. Local. Auto. Inputbytes. Max if hive. The exec mode. The local. The auto to true, when the input file size is less than the threshold value automatically in local mode, the default is 128 million. 134217728L

Hive, exec mode. Local. Auto. The tasks. The Max if hive. The exec mode. The local. The auto to true, when a hive tasks (Hadoop Jobs) is less than the threshold, automatically running in local mode. 4

Whether HIV.AUTO.CONVERT.JOIN automatically converts the Common join of Reduce side to Map join according to the size of the input small table, thus speeding up the join of large table to small table. false

Maximum amount of memory in local mode, in bytes, 0 is unlimited. 0

The hive.exec.scratchdir HDFS path, used to store the execution plans of different map/reduce phases and the intermediate output of those phases. /tmp/

/hive

Hive. Metastore. Warehouse. Dir hive default data file storage path, usually written for HDFS can path. “

HIVE.GROUPBY. SKEWINATA determines whether skew data is supported by a GROUP BY operation. false

The default output fileformat for hive is the same as that specified when creating the table, with options of ‘TextFile’, ‘SequenceFile’, or ‘RCFile’. ‘TextFile’

Hive. Security. Authorization. Enabled hive whether open access authentication. false

Hive. Exec. Plan Hive execution plan path is set to null automatically in the program

Hive. Exec. SubmitViachild determines whether the map/reduce Job should be committed using a separate JVM (the Child process). By default, the commit is done using the same JVM as the HQL Compiler. false

Hive. The exec. Script. Maxerrsize through TRANSFROM/MAP/REDUCE user script execution by the allowed maximum serialization error count. 100000

Hive. The exec. Script. Allow. Partial. Consumption whether to allow script processing only part of the data, if set to true, because of the broken pipe data such as untreated finish will be regarded as normal. false

Hive.exec.com the last map/reduce the output decided to query whether the output of the job for a compressed format. false

Hive.exec.com the intermediate decided to query in the middle of the map/reduce jobs (intermediate stage) output for a compressed format. false

Hive.intermediate.com pression. Among the codec map/reduce jobs compression codec class name (a compression codec may contain a variety of compression type), the value may be automatically set in the program.

Hive.intermediate.com pression. Among the type map/reduce the compression of job types, such as “BLOCK” “RECORD”.

Hive.exec.pre. Hooks level, the name of the hook class before the entire HQL statement is executed. “

Hive.exec.post.hooks level, which is the name of the hook class after the execution of the entire HQL statement.

Hive. The exec. Parallel. Thread. The number of concurrent submitted by the number of concurrent threads. 8

Hive. Mapred. Reduce. The tasks. The speculative. Execution whether open reducer speculated that the execution, and mapred. Reduce. The tasks. The speculative. Execution effect is the same. false

Hive. The exec. Counters. Pull. The interval client pull progress counters time, in milliseconds. 1000L

Hadoop.bin. path The path to the Hadoop Client executable script that is used to submit the Job through a separate JVM, using the configuration of the Hadoop Client. $HADOOP_HOME/bin/hadoop

The path to the Hadoop. Config. Dir Hadoop Client configuration file, using the Hadoop Client configuration. $HADOOP_HOME/conf

Fs.default.name NameNode URL, using the Hadoop Client configuration. file:///

Map.input.file Map, using the configuration of the Hadoop Client. null

The input directory of the mapred.input.dir Map, using the Hadoop Client configuration. null

Mapred. Input. Dir. Whether can be input recursive directory recursively nested, Hadoop Client configuration. false

Mapred.job. Tracker Job Tracker URL, using the Hadoop Client configuration. If this configuration is set to ‘local’, it will use the local mode. local

Mapred.job. name Map/Reduce job name, if not set, then use the generated job name, using the Hadoop Client configuration. null

Mapred. Reduce. The tasks. The speculative. Execution Map/reduce speculation, Hadoop Client configuration. null

Hive. Metastore. Metadb. Dir hive metadata repository path. “

Hive. Metastore. Uris hive metadata URIs, multiple Thrift :// addresses, separated by English commas. “

Hive. Metastore. Connect. The retries the connection to the maximum retries Thrift metadata service. 3

Javax.mail. Jdo. Option. ConnectionPassword jdo connection password. “

Hive. Metastore. Ds. Connection. Url. Hook JDO the class name of the connection url hook, the hook is used to obtain the JDO metadata database connection string, in order to realize the JDOConnectionURLHook interface classes. “

Javax.mail. Jdo. Option. ConnectionURL metadata database connection URL. “

Hive. Metastore. Ds, retry attempts when no JDO data connection error, try to connect the maximum number of backend data store. 1

Hive. Metastore. Ds. Retry. The interval every time try to connect the background data storage time interval, in milliseconds. 1000

Hive. Metastore. Force. Reload. Conf whether mandatory reload metadata configuration, a reload, the value will be reset to false. false

Hive. Metastore. Server. Min. Threads Thrift service thread pool, the minimum number of threads. 8

Hive. Metastore. Server. Max. Threads Thrift service thread pool the maximum number of threads. 0x7fffffff

Hive. Metastore. Server, TCP keepalive Thrift service whether to maintain the TCP connection. true

Hive. Metastore. Archive. Intermediate. What the suffix for archive compression of the original intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_ORIGINAL’

Hive. Metastore. Archive. Intermediate. Archived for archive compression compression after the suffix of intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_ARCHIVED’

Hive. Metastore. Archive. Intermediate. Extracted for archive compression decompression after the suffix of intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_EXTRACTED’

Ignore whether or not to ignore the error. For SQL files containing more than one line, you can ignore the error line and proceed to the next line. false

The identifier of the current session, in the format of “Username_time”, is used for recording in the Job conf. It is not normally set manually. “

Whether the current session is running in silent mode. If not silent mode, all INFO level messages typed in the log will be output to the console as a standard error stream. false

Hive.query.string The query string currently being executed. “

Hive.query.id The ID of the query currently being executed. “

HIVE.QUERY.PLANID The ID of the map/reduce plan currently being executed. “

Hive omits the middle part of the jobname according to the length of the jobname. 50

“Hive_cli. Jar” is the path that hive_cli. Jar is on when submitting a job through a separate JVM.

HIVE.AUX.JARS.PATH The path to which various plug-in JARs composed of user-defined UDFs and SerDE are located. “

Hive.added. Files. Path to ADD FILE. “

The path to the file that was added to the hive.addition.jars.path JAR. “

Hive. Added. Archives. Path to ADD ARCHIEVE files. “

Hive. Table. name The name of the current hive table that will be passed into the user script via the ScirptOperator. “

The name of the current hive partition. This configuration will be passed to the user script via a ScriptOperator. “

Hive. Script. Auto. Progress script whether periodically sends a heartbeat to the Job Tracker to avoid script execution time is too long, make the Job Tracker that script has been hang up. false

Hive. Script. Operator. Id. The env. Var is used to identify ScriptOperator id the name of the environment variable. ‘HIVE_SCRIPT_OPERATOR_ID’

Hive. Alias is the current hive alias. This configuration will be passed into the user script via ScriptOpertaor. “

Heiv.map.aggr determines whether aggregation can be true on the map side

Hive. Join. Emit. Interval Hive. Join. Emit. Interval Hive join 1000

The cache size of the Hive Join operation, in bytes. 25000

Hive. Mapjoin. Bucket. Cache. The size hive Map Join barrels of cache size, in bytes. 100

Hive. MapJoin.size.key Hive Map Join Size of the key in each row. 10000

Hive. Mapjoin. Cache. Numrows hive Map Join the cache the number of rows. 25000

Hive. Groupby. Mapaggr. Checkinterval for Group By operation Map polymerization of testing time, in milliseconds. 100000

Hive. The map. The aggr. Hash. Percentmemory hive map the aggregation of ha thin virtual machine memory storage occupies proportion. 0.5

Hive. Map. The aggr. Hash. Min. Reduction hive map the polymerization of the minimum reduce proportion ha dilute storage. 0.5

Progress Hive UDTF does not report a heartbeat periodically. Useful when the UDTF takes a long time to execute and does not output rows. false

Hive. FileFormat. Check hive whether to check output fileformat. true

Hive. QueryLog. location The directory where the hive real-time query logs are located. If this value is null, no real-time query logs will be created. ‘/tmp/$USER’

Hive. Script. Serde Hive user script serde. ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

HIVE.SCRIPT.recordreader The RecordRedaer of the HIVE.SCRIPT.recordreader user scripts. ‘org.apache.hadoop.hive.ql.exec.TextRecordReader’

Hive. Script. RecordWriter Hive User Scripts RecordWriter. ‘org.apache.hadoop.hive.ql.exec.TextRecordWriter’

Hwi.listen.host The host or IP attached to the hwi. ‘0.0.0.0’

Hwi. listen.port The HTTP port on which HWI is listening. 9999

The path to the war file for hwi. Hwi.war. $HWI_WAR_FILE

Whether to run hive false in test mode

Hive. Test.mode.prefix. Prefix hive test mode. ‘test_’

Hive. Test. Mode. Samplefreq hive test mode sampling frequency and sampling times per second. 32

Hive. Test. Mode. Nosamplelist hive test mode sampling exclusion list, with a comma. “

The size of files after each task is merged. The number of reducers is determined according to this size. The default is 256M. 256000000

Hive. Merge. Smallfiles. Avgsize need to merge the average size of a group of small files, the default 16 M. 16000000

HIV.OPTIMIZ.SKEWJOIN: Whether to optimize a data-skewed Join or not, it will enable a new Map/Reduce Job processing. false

Hive. SkewJoin. key skew key threshold value, above which is considered a skewjoin query. 1000000

Hive. Skewjoin. Mapjoin. Map. The tasks of dealing with the data skew map Join limit the number of the map. 10000

Hive. Skewjoin. Mapjoin. Min. The split of dealing with the data skew Map Join the smallest data segmentation size, in bytes, of the default is 32 m. 33554432

Mapred.min.split.size Map Reduce Job with the same configuration as the Hadoop Client. 1

Hive. MergeJob. MapOnly to enable mergejob on Map Only. true

The heartbeat interval of a hive Job is measured in milliseconds. 1000

The maximum number of rows processed by the hive. MapJoin. maxSize Map Join. If the number of rows is exceeded, the Map Join process exits abnormally. 1000000

Hive. Hashtable. InitialCapacity hive Map Join will dump small table to an in-memory hashtable, this parameter specifies the initial size of the hashtable. 100000

Hive. Hashtable. Loadfactor hive Map Join will dump small table to an in-memory hashtable, load factor which specifies the hashtable. 0.75

Hive. Mapjoin. Followby. Gby. Localtask. Max. The memory. The usage MapJoinOperator followed GroupByOperator, the biggest use ratio 0.55 of the memory

Hive. Mapjoin. Localtask. Max. The memory. The usage of Map Join local tasks using the maximum ratio of 0.9 heap memory

Hive. Mapjoin. Localtask. A timeout Map Join local task timeout, taobao special features 600000 edition

Hive. Mapjoin. Check. The memory. Rows set every how many rows to detect the size of the memory, if more than hive mapjoin. Localtask. Max. The memory. The abnormal usage will be quit, the Map Join failure. 100000

Hive. Debug. LocalTask whether to debug a localtask. This parameter does not currently have false

Hive. Task. Progress Whether the counters are opened to record the progress of the Job execution and clients will open the progress counters. false

InputFormat for hive. InputFormat. Default is org, apache hadoop. Hive. Ql. IO. HiveInputFormat, other bineHiveInputFormat at org.apache.hadoop.hive.ql.io.Com

Enforce bucketing. Are you using compulsory bucketing? false

Enforce sorting by force. false

Hive. MapRed. Partitioner Hive’s Partitioner class. ‘org.apache.hadoop.hive.ql.io.DefaultHivePartitioner’

hive.exec.script.trust

Hive Script Operator For trust

false

Does hive.hadoop.supports.splittable.com bineinputformat support can be shard CombieInputFormat false

Whether hive. OPTIMIZE. CP optimizes column pruning. true

Whether hive. OPTIMIZ.PPD optimizes predicate pushdown. true

Hive. Optimize. GroupBy: true

Hive. Optimize. Whether bucketmapjoin bucket map join optimization. false

Hive. Optimize. Bucketmapjoin. Whether sortedmerge in optimizing the bucket map join when trying to use compulsory sorted merge bucket map join. false

Hive. Optimize. Whether reducededuplication optimization reduce redundancy. true

Hive.hbase.wal. enabled Whether to enable the HBase Storage Handler. true

HIVE.ARCHIVE.ENABLED HAR files. false

Hive. Archive. Har. Parentdir. Settable whether to enable har file can be set to the parent directory. false

Hive. Outerjoin. Supports. Whether the filters start outside connection support filter conditions. true

Hive. Fetch. The output. The serde to fetch the Task of the serde class ‘. Org. Apache hadoop. Hive. Serde2. DelimitedJSONSerDe ‘

Hive, semantic analyzer. The hook hook hive semantic analysis, before and after the semantic analysis phase is called, is used to analyze and modify the AST and generate the execution of the plan, with a comma. null

If the heiv.cli. Print. Header should not display the column name of the query result. false

Hive. Cli. Encoding hive Default command-line character encoding. ‘UTF8’

Hive.log.plan.progress Whether to record the progress of the execution plan. true

Hive. Exec.script.wrapper script Operator The encapsulation of a script call, usually a script interpreter. For example, you can set the name of the variable value to “python”. The Script passed to the Script Operator will be called as “python < Script command>”. If the value is null or is not set, the Script will be called as “python < Script command>”. The script will be called directly with the command “

Hive. Check. Fatal. Errors. The interval client through pull counters inspection cycle of serious errors, in milliseconds, taobao specific configuration items. 5000L

Related Posts

SQOOP imports MySQL data into HBase, Hive Blood and Tears

Hive configuration and internal functions

Car Home: A Lake-Warehouse Architecture Practice based on Flink + Iceberg