Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

2. Common commands
1. – ls: view the contents in the specified directory
Hadoop fs -ls [file directory] eg: Hadoop fs -ls /user/wangwuCopy the code
2. – cat: displays the file content
Hadoop DFS -cat [file_path] eg:hadoop fs -cat /user/wangwu/data.txtCopy the code
3. – put: stores local files to Hadoop
Hadoop fs -put [local address] [Hadoop directory] eg: Hadoop fs -put /home/t/file.txt /user/t (file.txt is a file name)Copy the code
4. – PUT: stores the local folder to Hadoop
Hadoop fs -put [local directory] [Hadoop directory] eg: Hadoop fs -put /home/t/dir_name /user/t (dir_name is the folder name)Copy the code
5. -get: Deletes a file from Hadoop to a local directory
Hadoop fs -get [file directory] [local directory] eg: hadoop fs -get /user/t/ok.txt /home/tCopy the code
6. – rm: delete the specified file or folder on Hadoop
Hadoop fs -rm [file address] eg: hadoop fs -rm /user/t/ok.txtCopy the code
7, Delete the specified folder (including subdirectories, etc.) on Hadoop.
Hadoop fs -rm [directory address] eg: hadoop fs -rm /user/tCopy the code
8. – mkdir: create a directory in the specified hadoop directory
Eg: Hadoop fs -mkdir /user/tCopy the code
-touchz: create an empty file in the hadoop directory
Use the touchz command: eg: hadoop fs -touchz /user/new.txtCopy the code
10. -mv: rename a file on Hadoop
Hadoop fs -mv /user/test.txt /user/ok.txtCopy the code
11. -setrep: Sets the number of HDFS file copies
Eg: Hadoop fs -setrep 10 / TMP /tt/student.txtCopy the code
12. Kill the running Hadoop job
Eg: Hadoop job -kill [job-id]Copy the code
13. -help: Prints the command parameters
Eg: Hadoop fs -help rmCopy the code
14. -moveFromLocal: Cut and paste from local to HDFS
Hadoop fs -moveFromLocal./studnet.txt/TMP /test/Copy the code
-appendtoFile: Append a file to the end of an existing file
Eg: hadoop fs - appendToFile liubei. TXT/sanguo/shuguo/zhangsan. TXTCopy the code
16. -chgrp, -chmod, -chown: the same as in the Linux file system, modify the owning permission of a file
Eg: hadoop fs - chmod 666 / sanguo/shuguo/zhangsan. TXT eg: hadoop fs - chown itcast: itcast/sanguo/shuguo/zhangsan. TXTCopy the code
17. -copyfromLocal: copies files from the local file system to the HDFS path
Eg: Hadoop fs -copyfromLocal readme.txt /Copy the code
18. -copytolocal: copy data from HDFS to the local PC
Eg: hadoop fs - copyToLocal/sanguo/shuguo/zhangsan. TXT. /Copy the code
19. -cp: copies data from one HDFS path to another HDFS path
Eg: hadoop fs - cp/sanguo shuguo/zhangsan. TXT/zhuge. TXTCopy the code
-tail: Displays the end of a file
Eg: hadoop fs - tail/sanguo/shuguo/zhangsan. TXTCopy the code
21. -rmdir: Deletes an empty directory
Eg: hadoop fs-mkdir /test eg: Hadoop fs-rmdir /testCopy the code
22. -du: collects statistics about the folder size
Hadoop fs -du -s -h /user/itcast/test 2.7k /user/itcast/test eg: hadoop fs -du -s -h /user/itcast/test 2.7k /user/itcast/test Hadoop fs - du - h/user / 1.3 K/user/itcast itcast/test/test/README. TXT 15 / user/itcast/test/jinlian. TXT 1.4 K /user/itcast/test/nihao.txtCopy the code

Common Hadoop tuning parameters

The following parameters take effect when configured in the user’s own MR application (mapred-default.xml)

Configuration parameters Parameters that
mapreduce.map.memory.mb Maximum number of resources (unit :MB) that can be used by a MapTask. The default value is 1024. If the MapTask uses more resources than this value, it is forcibly killed.
mapreduce.reduce.memory.mb A ReduceTask The upper limit (MB) of resources that can be used, which is 1024 by default. If the amount of resources actually used by ReduceTask exceeds this value, it will be forcibly killed.
mapreduce.map.cpu.vcores Maximum number of CPU cores per MapTask. Default value: 1
mapreduce.reduce.cpu.vcores ReduceTask Maximum number of CPU cores that can be used. The default value is 1
mapreduce.reduce.shuffle.parallelcopies The parallel number of data that each Reduce obtains from the Map. The default value is 5
mapreduce.reduce.shuffle.merge.percent What percentage of the data in Buffer starts to be written to disk. The default value of 0.66
mapreduce.reduce.shuffle.input.buffer.percent Ratio of the Buffer size to the available memory of Reduce. The default value of 0.7
mapreduce.reduce.input.buffer.percent Specifies the percentage of memory used to store the data in Buffer. The default value is 0.0

The configuration takes effect in the configuration file of the server before YARN starts (yarn-default.xml).

Configuration parameters Parameters that
yarn.scheduler.minimum-allocation-mb Minimum amount of physical memory that can be applied for a single task. Default value: 1024
yarn.scheduler.maximum-allocation-mb Maximum amount of physical memory that can be applied for a single task. Default value: 8192
yarn.scheduler.minimum-allocation-vcores Minimum number of CPU cores applied for a single task. Default value: 1
yarn.scheduler.maximum-allocation-vcores Maximum number of CPU cores applied for a single task. Default value: 32
yarn.nodemanager.resource.memory-mb Specifies the total physical memory that can be used by YARN on a server node. Default value: 8192

Shuffle performance optimization parameters (mapred-default.xml) must be set before YARN starts.

Configuration parameters Parameters that
mapreduce.task.io.sort.mb Shuffle ring buffer size. The default value is 100m
mapreduce.map.sort.spill.percent Ring buffer overflow threshold, default 80%

Fault-tolerant parameters (MapReduce performance optimization)

Configuration parameters Parameters that
mapreduce.map.maxattempts Maximum number of Map Task retries. If this parameter is exceeded, the Map Task fails to run. The default value is 4.
mapreduce.reduce.maxattempts Maximum number of retries for each Reduce Task. If the retry parameter exceeds this value, the Map Task fails to run. The default value is 4.
mapreduce.task.timeout Task timeout, a parameter that needs to be set frequently. If a Task does not receive any new data or output data for a certain period of time, the Task is considered to be in the Block state and may be stuck, or may be stuck forever. To prevent the user program from being blocked forever, a timeout (in milliseconds) is imposed. The default is 600,000. If your program takes too long to process each input data (such as accessing the database, pulling data from the network, etc.), it is recommended to increase this parameter. AttemptID: attempt_14267829456721_123456_M_000224_0 Timed out after 300 secsContainer killed by the error message AttemptID: attempt_14267829456721_123456_M_000224_0 is often displayed when this parameter is too small ApplicationMaster. “.