Hive achieves parallel processing by dividing a query into one or more MapReduce tasks. Each task may have multiple mapper and reducer tasks, at least some of which can be executed in parallel.
The determination of the optimal number of mapper and reducer depends on several variables, such as the size of the input data and the type of operations to be performed on these data.
It is necessary to maintain balance. For a big data system like Spark/Hadoop, it is not terrible to have a large amount of data. What is terrible is the skewness of the data, the uneven processing of each node.
If there are too many mapper or reducer tasks, it will lead to too much overhead during startup, scheduling, and running JOBS. If the number of Settings is too small, you may not be taking full advantage of the inherent parallelism of the cluster.
Mapred.reduce. Tasks The number of reduers to which the Job is submitted, using the Hadoop Client configuration. 1
Hive. Mapred. mode Map/Redure mode, if set to strict, will disable queries of type 3:1. The WHERE filter on a partitioned table must contain a partitioned field; 2. For queries that use the ORDER BY statement, the LIMIT statement must be used (the ORDER BY statement will distribute all the result set data to the same reducer for processing to sort, so adding the LIMIT statement can prevent the reducer from executing for a long time) 3. A query that restricts the Cartesian product has a WHERE statement but no ON statement. ‘nonstrict’
If the Hive input is made up of many small files, each small file will start a Map task. If the file is too small, the Hive input will start a Map task. If the file is too small, the Hive input will start a Map task. It takes longer to start and initialize the map task than it takes to process it logically, resulting in a waste of resources and even OOM. For this reason, when we start a task and find a small amount of input data but a large number of tasks, we need to take care to merge the input at the Map front end. Of course, when we write data to a table, we also need to pay attention to the output file size true
Hive. Merge. MapRedFiles: whether to enable merge Map/Reduce small files (false at the end of map-reduce task
Hive. Exec. Parallel enabled concurrent commit of map/reduce jobs. false
Hive. Limit. Optimize. The enable when using limit statement, it can be to sampling of data source, avoid execute the query, and then return to the partial results But the function has a shortcoming, may be useful in the input data will never be processed.
Hive. The exec. Reducers. Bytes. Per. Each reducer reducer for the average load of bytes. 1000000000
Set the upper limit of the number of reducers, which can prevent a query from consuming too many reducer resources. For the setting of the value size of this attribute, a suggested calculation formula is as follows: (total number of Reduce slots in the cluster *1.5)/(average number of queries in execution) A multiple of 1.5 is a rule of thumb factor used to prevent underutilization of the cluster. 999
Hive. Exec. RowOffset Hive provides two virtual columns: one for the input file name to be partitioned, and another for the in-block offsets in the file. These virtual columns can be used to diagnose queries when Hive produces unexpected or null return results. With these “fields”, the user can see which file and even which data is causing the problem: SELECT INPUT_FILE_NAME, BLOCK_OFFSET_INSIDE_FILE, ROW_OFFSET_INSIDE_BLOCK, line FROM hive_text WHERE line LIKE ‘%hive%’ LIMIT 2; true
Hive. Multigroupby. Singlemr a special optimization, whether the query of multiple group by operation assembled into a single task graphs. false
Hive. The exec. Dynamic. The partition is on dynamic partitioning. false
Hive. The exec. Dynamic. Partition. Mode after open the dynamic partitioning, dynamic partitioning model, there are strict and nonstrict two values optional, strict request contains at least one static partitioning columns, nonstrict, no this requirement. strict
Hive. The exec. Max. Dynamic. Partitions allowed by the number of the largest dynamic partitioning. 1000
Hive. The exec. Max. Dynamic. Partitions. Pernode single reduce the number of nodes allowed by the biggest dynamic partitioning. 100
Hive. The exec. Default. Partition. Name the name of the default dynamic partitioning, when dynamic partitioning as’ or null, use this name. ” ‘__HIVE_DEFAULT_PARTITION__’
Hive, exec mode. Local. Auto decided to hive should automatically according to the size of the input file, running locally (in GateWay) true
Hive. The exec. Mode. Local. Auto. Inputbytes. Max if hive. The exec mode. The local. The auto to true, when the input file size is less than the threshold value automatically in local mode, the default is 128 million. 134217728L
Hive, exec mode. Local. Auto. The tasks. The Max if hive. The exec mode. The local. The auto to true, when a hive tasks (Hadoop Jobs) is less than the threshold, automatically running in local mode. 4
Whether HIV.AUTO.CONVERT.JOIN automatically converts the Common join of Reduce side to Map join according to the size of the input small table, thus speeding up the join of large table to small table. false
Maximum amount of memory in local mode, in bytes, 0 is unlimited. 0
The hive.exec.scratchdir HDFS path, used to store the execution plans of different map/reduce phases and the intermediate output of those phases. /tmp/
/hive
Hive. Metastore. Warehouse. Dir hive default data file storage path, usually written for HDFS can path. “
HIVE.GROUPBY. SKEWINATA determines whether skew data is supported by a GROUP BY operation. false
The default output fileformat for hive is the same as that specified when creating the table, with options of ‘TextFile’, ‘SequenceFile’, or ‘RCFile’. ‘TextFile’
Hive. Security. Authorization. Enabled hive whether open access authentication. false
Hive. Exec. Plan Hive execution plan path is set to null automatically in the program
Hive. Exec. SubmitViachild determines whether the map/reduce Job should be committed using a separate JVM (the Child process). By default, the commit is done using the same JVM as the HQL Compiler. false
Hive. The exec. Script. Maxerrsize through TRANSFROM/MAP/REDUCE user script execution by the allowed maximum serialization error count. 100000
Hive. The exec. Script. Allow. Partial. Consumption whether to allow script processing only part of the data, if set to true, because of the broken pipe data such as untreated finish will be regarded as normal. false
Hive.exec.com the last map/reduce the output decided to query whether the output of the job for a compressed format. false
Hive.exec.com the intermediate decided to query in the middle of the map/reduce jobs (intermediate stage) output for a compressed format. false
Hive.intermediate.com pression. Among the codec map/reduce jobs compression codec class name (a compression codec may contain a variety of compression type), the value may be automatically set in the program.
Hive.intermediate.com pression. Among the type map/reduce the compression of job types, such as “BLOCK” “RECORD”.
Hive.exec.pre. Hooks level, the name of the hook class before the entire HQL statement is executed. “
Hive.exec.post.hooks level, which is the name of the hook class after the execution of the entire HQL statement.
Hive. The exec. Parallel. Thread. The number of concurrent submitted by the number of concurrent threads. 8
Hive. Mapred. Reduce. The tasks. The speculative. Execution whether open reducer speculated that the execution, and mapred. Reduce. The tasks. The speculative. Execution effect is the same. false
Hive. The exec. Counters. Pull. The interval client pull progress counters time, in milliseconds. 1000L
Hadoop.bin. path The path to the Hadoop Client executable script that is used to submit the Job through a separate JVM, using the configuration of the Hadoop Client. $HADOOP_HOME/bin/hadoop
The path to the Hadoop. Config. Dir Hadoop Client configuration file, using the Hadoop Client configuration. $HADOOP_HOME/conf
Fs.default.name NameNode URL, using the Hadoop Client configuration. file:///
Map.input.file Map, using the configuration of the Hadoop Client. null
The input directory of the mapred.input.dir Map, using the Hadoop Client configuration. null
Mapred. Input. Dir. Whether can be input recursive directory recursively nested, Hadoop Client configuration. false
Mapred.job. Tracker Job Tracker URL, using the Hadoop Client configuration. If this configuration is set to ‘local’, it will use the local mode. local
Mapred.job. name Map/Reduce job name, if not set, then use the generated job name, using the Hadoop Client configuration. null
Mapred. Reduce. The tasks. The speculative. Execution Map/reduce speculation, Hadoop Client configuration. null
Hive. Metastore. Metadb. Dir hive metadata repository path. “
Hive. Metastore. Uris hive metadata URIs, multiple Thrift :// addresses, separated by English commas. “
Hive. Metastore. Connect. The retries the connection to the maximum retries Thrift metadata service. 3
Javax.mail. Jdo. Option. ConnectionPassword jdo connection password. “
Hive. Metastore. Ds. Connection. Url. Hook JDO the class name of the connection url hook, the hook is used to obtain the JDO metadata database connection string, in order to realize the JDOConnectionURLHook interface classes. “
Javax.mail. Jdo. Option. ConnectionURL metadata database connection URL. “
Hive. Metastore. Ds, retry attempts when no JDO data connection error, try to connect the maximum number of backend data store. 1
Hive. Metastore. Ds. Retry. The interval every time try to connect the background data storage time interval, in milliseconds. 1000
Hive. Metastore. Force. Reload. Conf whether mandatory reload metadata configuration, a reload, the value will be reset to false. false
Hive. Metastore. Server. Min. Threads Thrift service thread pool, the minimum number of threads. 8
Hive. Metastore. Server. Max. Threads Thrift service thread pool the maximum number of threads. 0x7fffffff
Hive. Metastore. Server, TCP keepalive Thrift service whether to maintain the TCP connection. true
Hive. Metastore. Archive. Intermediate. What the suffix for archive compression of the original intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_ORIGINAL’
Hive. Metastore. Archive. Intermediate. Archived for archive compression compression after the suffix of intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_ARCHIVED’
Hive. Metastore. Archive. Intermediate. Extracted for archive compression decompression after the suffix of intermediate directory, the directory is what is not important, as long as you can to avoid conflict. ‘_INTERMEDIATE_EXTRACTED’
Ignore whether or not to ignore the error. For SQL files containing more than one line, you can ignore the error line and proceed to the next line. false
The identifier of the current session, in the format of “Username_time”, is used for recording in the Job conf. It is not normally set manually. “
Whether the current session is running in silent mode. If not silent mode, all INFO level messages typed in the log will be output to the console as a standard error stream. false
Hive.query.string The query string currently being executed. “
Hive.query.id The ID of the query currently being executed. “
HIVE.QUERY.PLANID The ID of the map/reduce plan currently being executed. “
Hive omits the middle part of the jobname according to the length of the jobname. 50
“Hive_cli. Jar” is the path that hive_cli. Jar is on when submitting a job through a separate JVM.
HIVE.AUX.JARS.PATH The path to which various plug-in JARs composed of user-defined UDFs and SerDE are located. “
Hive.added. Files. Path to ADD FILE. “
The path to the file that was added to the hive.addition.jars.path JAR. “
Hive. Added. Archives. Path to ADD ARCHIEVE files. “
Hive. Table. name The name of the current hive table that will be passed into the user script via the ScirptOperator. “
The name of the current hive partition. This configuration will be passed to the user script via a ScriptOperator. “
Hive. Script. Auto. Progress script whether periodically sends a heartbeat to the Job Tracker to avoid script execution time is too long, make the Job Tracker that script has been hang up. false
Hive. Script. Operator. Id. The env. Var is used to identify ScriptOperator id the name of the environment variable. ‘HIVE_SCRIPT_OPERATOR_ID’
Hive. Alias is the current hive alias. This configuration will be passed into the user script via ScriptOpertaor. “
Heiv.map.aggr determines whether aggregation can be true on the map side
Hive. Join. Emit. Interval Hive. Join. Emit. Interval Hive join 1000
The cache size of the Hive Join operation, in bytes. 25000
Hive. Mapjoin. Bucket. Cache. The size hive Map Join barrels of cache size, in bytes. 100
Hive. MapJoin.size.key Hive Map Join Size of the key in each row. 10000
Hive. Mapjoin. Cache. Numrows hive Map Join the cache the number of rows. 25000
Hive. Groupby. Mapaggr. Checkinterval for Group By operation Map polymerization of testing time, in milliseconds. 100000
Hive. The map. The aggr. Hash. Percentmemory hive map the aggregation of ha thin virtual machine memory storage occupies proportion. 0.5
Hive. Map. The aggr. Hash. Min. Reduction hive map the polymerization of the minimum reduce proportion ha dilute storage. 0.5
Progress Hive UDTF does not report a heartbeat periodically. Useful when the UDTF takes a long time to execute and does not output rows. false
Hive. FileFormat. Check hive whether to check output fileformat. true
Hive. QueryLog. location The directory where the hive real-time query logs are located. If this value is null, no real-time query logs will be created. ‘/tmp/$USER’
Hive. Script. Serde Hive user script serde. ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
HIVE.SCRIPT.recordreader The RecordRedaer of the HIVE.SCRIPT.recordreader user scripts. ‘org.apache.hadoop.hive.ql.exec.TextRecordReader’
Hive. Script. RecordWriter Hive User Scripts RecordWriter. ‘org.apache.hadoop.hive.ql.exec.TextRecordWriter’
Hwi.listen.host The host or IP attached to the hwi. ‘0.0.0.0’
Hwi. listen.port The HTTP port on which HWI is listening. 9999
The path to the war file for hwi. Hwi.war. $HWI_WAR_FILE
Whether to run hive false in test mode
Hive. Test.mode.prefix. Prefix hive test mode. ‘test_’
Hive. Test. Mode. Samplefreq hive test mode sampling frequency and sampling times per second. 32
Hive. Test. Mode. Nosamplelist hive test mode sampling exclusion list, with a comma. “
The size of files after each task is merged. The number of reducers is determined according to this size. The default is 256M. 256000000
Hive. Merge. Smallfiles. Avgsize need to merge the average size of a group of small files, the default 16 M. 16000000
HIV.OPTIMIZ.SKEWJOIN: Whether to optimize a data-skewed Join or not, it will enable a new Map/Reduce Job processing. false
Hive. SkewJoin. key skew key threshold value, above which is considered a skewjoin query. 1000000
Hive. Skewjoin. Mapjoin. Map. The tasks of dealing with the data skew map Join limit the number of the map. 10000
Hive. Skewjoin. Mapjoin. Min. The split of dealing with the data skew Map Join the smallest data segmentation size, in bytes, of the default is 32 m. 33554432
Mapred.min.split.size Map Reduce Job with the same configuration as the Hadoop Client. 1
Hive. MergeJob. MapOnly to enable mergejob on Map Only. true
The heartbeat interval of a hive Job is measured in milliseconds. 1000
The maximum number of rows processed by the hive. MapJoin. maxSize Map Join. If the number of rows is exceeded, the Map Join process exits abnormally. 1000000
Hive. Hashtable. InitialCapacity hive Map Join will dump small table to an in-memory hashtable, this parameter specifies the initial size of the hashtable. 100000
Hive. Hashtable. Loadfactor hive Map Join will dump small table to an in-memory hashtable, load factor which specifies the hashtable. 0.75
Hive. Mapjoin. Followby. Gby. Localtask. Max. The memory. The usage MapJoinOperator followed GroupByOperator, the biggest use ratio 0.55 of the memory
Hive. Mapjoin. Localtask. Max. The memory. The usage of Map Join local tasks using the maximum ratio of 0.9 heap memory
Hive. Mapjoin. Localtask. A timeout Map Join local task timeout, taobao special features 600000 edition
Hive. Mapjoin. Check. The memory. Rows set every how many rows to detect the size of the memory, if more than hive mapjoin. Localtask. Max. The memory. The abnormal usage will be quit, the Map Join failure. 100000
Hive. Debug. LocalTask whether to debug a localtask. This parameter does not currently have false
Hive. Task. Progress Whether the counters are opened to record the progress of the Job execution and clients will open the progress counters. false
InputFormat for hive. InputFormat. Default is org, apache hadoop. Hive. Ql. IO. HiveInputFormat, other bineHiveInputFormat at org.apache.hadoop.hive.ql.io.Com
Enforce bucketing. Are you using compulsory bucketing? false
Enforce sorting by force. false
Hive. MapRed. Partitioner Hive’s Partitioner class. ‘org.apache.hadoop.hive.ql.io.DefaultHivePartitioner’
hive.exec.script.trust
Hive Script Operator For trust
false
Does hive.hadoop.supports.splittable.com bineinputformat support can be shard CombieInputFormat false
Whether hive. OPTIMIZE. CP optimizes column pruning. true
Whether hive. OPTIMIZ.PPD optimizes predicate pushdown. true
Hive. Optimize. GroupBy: true
Hive. Optimize. Whether bucketmapjoin bucket map join optimization. false
Hive. Optimize. Bucketmapjoin. Whether sortedmerge in optimizing the bucket map join when trying to use compulsory sorted merge bucket map join. false
Hive. Optimize. Whether reducededuplication optimization reduce redundancy. true
Hive.hbase.wal. enabled Whether to enable the HBase Storage Handler. true
HIVE.ARCHIVE.ENABLED HAR files. false
Hive. Archive. Har. Parentdir. Settable whether to enable har file can be set to the parent directory. false
Hive. Outerjoin. Supports. Whether the filters start outside connection support filter conditions. true
Hive. Fetch. The output. The serde to fetch the Task of the serde class ‘. Org. Apache hadoop. Hive. Serde2. DelimitedJSONSerDe ‘
Hive, semantic analyzer. The hook hook hive semantic analysis, before and after the semantic analysis phase is called, is used to analyze and modify the AST and generate the execution of the plan, with a comma. null
If the heiv.cli. Print. Header should not display the column name of the query result. false
Hive. Cli. Encoding hive Default command-line character encoding. ‘UTF8’
Hive.log.plan.progress Whether to record the progress of the execution plan. true
Hive. Exec.script.wrapper script Operator The encapsulation of a script call, usually a script interpreter. For example, you can set the name of the variable value to “python”. The Script passed to the Script Operator will be called as “python < Script command>”. If the value is null or is not set, the Script will be called as “python < Script command>”. The script will be called directly with the command “
Hive. Check. Fatal. Errors. The interval client through pull counters inspection cycle of serious errors, in milliseconds, taobao specific configuration items. 5000L