First, problem phenomenon

The client is doing a combination analysis of several tables using HIVE SQL, using Mr Engine. Because there is a table with more than 50,000 partitions and a total of more than 800 billion data, it fails to run, as shown in the following error:

org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:

Job init failed org.apache.hadoop.yarn.executions.YarnRuntimeException:

java.io.IOException:Split metadata size exceeded 10000000. Aborting job job_1558160008053_0002

The size of the job.splitmetainfo file of the job exceeds the upper limit.

From hadoop source can query to the inside, because the graphs. The job. The split. The metainfo. Maxsize parameters caused by the default setting of 10 million.

Why isn’t the default 10 million enough? This is from the graphs. The job. The split. The metainfo. Maxsize parameters about the meaning of:

Job.splitmetainfo This file records split metadata information. If the input file is too many, an error will be reported if the file structure information exceeds the default setting. Input files include a large number of small files or file directories, causing Splitmetainfo files to exceed the default limit.

This mechanism is also required by the Hadoop cluster to ensure that the file size is not too small or there are too many directories to avoid the bottleneck of namenode metadata loading. If the default block size is 128 MB, the file size should be larger than this.

Why did hive SQL fail this time? With more than 50,000 partitions of calculated Hive tables, over 800 billion data, and over 1.4 million data files stored in HDFS,

Graphs. The job. The split. The metainfo. Maxsize 10 MB by default size is not enough to record these metadata.

Two, repair methods

In the mapred-site.xml configuration file:

Modify the parameter graphs. The jobtracker. Split. The metainfo. Maxsize = 200000000 (200 m)

Then, restart the MapReducev2 and YARN components. If hive SQL tasks are performed, restart hive.

Of course, the root cause of this problem is still caused by too many input files or directories, so it is recommended to merge small files. If the amount of hive table data to be calculated is too large, you are advised to select proper partitions for SQL statements instead of computing the entire table data.

Hadoop platform operation parameter Settings on graphs. The job. The split. The metainfo. Maxsize instructions

1, MR program execution error:

YarnRuntimeException: java.io.IOException:Split metadata size exceeded 10000000.

2. Cause analysis:

Input files include a large number of small files or file directories, causing Splitmetainfo size to exceed the default limit.

3. Solutions:

In the mapred-site.xml configuration file:

Change the default operation parameter graphs. The jobtracker. Split. The metainfo. Maxsize = 100000000

Or graphs. The jobtracker. Split. The metainfo. Maxsize = 1 (the default value is 1000000)

mapreduce.job.split.metainfo.maxsize

10000000

4. In-depth analysis:

Job.splitmetainfo This file records split metadata information. If the input file is too many, an error will be reported if the file structure information exceeds the default setting.

The Hadoop cluster requires that the file size is not too small or there are too many directories to avoid the bottleneck of namenode metadata loading, which usually occurs in the image storage.

If the default block size is 128 MB, the file size should be larger than this.

5, source code analysis:

org.apache.hadoop.mapreduce.split.JobSplit

Splitmetainfo stores file structure information:

    @Override
    public String toString() {
      StringBuffer buf = new StringBuffer();
      buf.append("data-size : " + inputDataLength + "\n");
      buf.append("start-offset : " + startOffset + "\n");
      buf.append("locations : " + "\n");
      for (String loc : locations) {
        buf.append("  " + loc + "\n");
      }
      return buf.toString();
    }
Copy the code

end