This is the 26th day of my participation in Gwen Challenge
1 Common Hadoop port number
hadoop2.x | Hadoop3.x | |
---|---|---|
Accessing HDFS Ports | 50070 | 9870 |
Access the MR Performance port | 8088 | 8088 |
History server | 19888 | 19888 |
The client accesses the cluster port | 9000 | 8020 |
2 Hadoop configuration file and simple Hadoop cluster construction
- Configuration file:
X core-site. XML, HDFS-site. XML, mapred-site. XML, yarn-site. XML slaves HAdoop3.x Core-site. XML, hdFS-site. XML, mapred-site. XML, yarn-site. XML workersCopy the code
- Simple cluster building process:
JDK installation Configure SSH password-free login Configure hadoop core file formatting NamenodeCopy the code
3 HDFS small file processing
-
What will be the impact
-
150 bytes 100 million small files x 150 bytes 1 file block x 150 bytes How many file blocks can be stored in 128 GB? 128 x 1024 x 1024 x 1024 bytes /150 bytes = 900 million file blocks
-
Computing: Each small file acts as a MapTask and consumes a lot of computing resources
-
-
How to solve
- Har archiving is used to archive small files
- Using CombineTextInputFormat
- Enable JVM reuse with small file scenarios; If there are no small files, do not enable JVM reuse because the used Task card slot will be held until the task is completed.
JVM reuse enables JVM instances to be reused N times in the same job, and the value of N can be configured in Hadoop’s mapred-site.xml file. Usually between 10 and 20
<property>
<name>mapreduce.job.jvm.numtasks</name>
<value>10</value>
<description>How many tasks to run per jvm,if set to -1 ,there is no limit</description>
</property>
Copy the code
4 NameNode memory of HDFS
- For hadoop2. x series, NameNode configuration is 2000 MB by default
- Hadoop3.x series, configuration NameNode memory is dynamically allocated, the minimum NameNode memory is 1 gb, every 1 million additional blocks, 1 GB more memory.
5 the Hadoop downtime
- If MR causes system downtime. In this case, you need to control the number of concurrent Yarn tasks and the maximum memory required by each task. Adjustment parameters: yarn.scheduler. Maximum-allocation-mb (the maximum amount of physical memory that can be allocated to a task. The default value is 8192MB)
- If NameNode crashes due to file writing too fast. Increase the Kafka storage size and control the write speed from Kafka to HDFS. For example, you can adjust the batchsize parameter of Flume data volume in each batch.
6 Hadoop Solution to Data Skew
- Combine in map ahead of time to reduce the amount of data transferred
Adding combiner to a Mapper is equivalent to reducing in advance. That is, same keys in a Mapper are converged, reducing the amount of data transferred during shuffle and the amount of calculation on the Reducer end.
This approach is not very effective if the keys that cause data skew are distributed in large numbers across different Mapper.
-
Keys that cause data skew are distributed in large numbers across different Mappers
-
Local polymerization plus global polymerization.
In the map stage for the first time, add random prefixes from 1 to N to those keys that cause data bias, so that the same keys will also be divided into multiple Reducer for local aggregation, and the number will be greatly reduced.
In the second MapReduce, the random prefix of the key is removed to perform global aggregation.
Idea: Second Mr, hash keys randomly to different reducer for the first time to achieve load balancing. The second time, remove the random prefix of the key and reduce the key.
This method performs mapReduce twice, with slightly lower performance.
-
Increase the Reducer, improve parallelism JobConf. SetNumReduceTasks (int)
-
Customized partition According to data distribution, user-defined hash function, evenly distribute keys to different Reducer
-
7 benchmarking of project experience
After building a Hadoop cluster, test HDFS read/write performance and MR computing capability. The test JAR package is in the Hadoop share folder.
Total cluster throughput = Bandwidth x Number of cluster nodes/Number of copies
For example, 100m/s x 10 PCS / 3= 33m/s
Note: If the test data is local, the number of copies is -1. Because this replica does not take up cluster throughput. If data is uploaded to a cluster outside the cluster, bandwidth is required. I don’t have to subtract 1 from this formula.