❤️ Summary of common hadoop commands and million tuning ❤️

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

2. Common commands

1. – ls: view the contents in the specified directory

Hadoop fs -ls [file directory] eg: Hadoop fs -ls /user/wangwuCopy the code

2. – cat: displays the file content

Hadoop DFS -cat [file_path] eg:hadoop fs -cat /user/wangwu/data.txtCopy the code

3. – put: stores local files to Hadoop

Hadoop fs -put [local address] [Hadoop directory] eg: Hadoop fs -put /home/t/file.txt /user/t (file.txt is a file name)Copy the code

4. – PUT: stores the local folder to Hadoop

Hadoop fs -put [local directory] [Hadoop directory] eg: Hadoop fs -put /home/t/dir_name /user/t (dir_name is the folder name)Copy the code

5. -get: Deletes a file from Hadoop to a local directory

Hadoop fs -get [file directory] [local directory] eg: hadoop fs -get /user/t/ok.txt /home/tCopy the code

6. – rm: delete the specified file or folder on Hadoop

Hadoop fs -rm [file address] eg: hadoop fs -rm /user/t/ok.txtCopy the code

7, Delete the specified folder (including subdirectories, etc.) on Hadoop.

Hadoop fs -rm [directory address] eg: hadoop fs -rm /user/tCopy the code

8. – mkdir: create a directory in the specified hadoop directory

Eg: Hadoop fs -mkdir /user/tCopy the code

-touchz: create an empty file in the hadoop directory

Use the touchz command: eg: hadoop fs -touchz /user/new.txtCopy the code

10. -mv: rename a file on Hadoop

Hadoop fs -mv /user/test.txt /user/ok.txtCopy the code

11. -setrep: Sets the number of HDFS file copies

Eg: Hadoop fs -setrep 10 / TMP /tt/student.txtCopy the code

12. Kill the running Hadoop job

Eg: Hadoop job -kill [job-id]Copy the code

13. -help: Prints the command parameters

Eg: Hadoop fs -help rmCopy the code

14. -moveFromLocal: Cut and paste from local to HDFS

Hadoop fs -moveFromLocal./studnet.txt/TMP /test/Copy the code

-appendtoFile: Append a file to the end of an existing file

Eg: hadoop fs - appendToFile liubei. TXT/sanguo/shuguo/zhangsan. TXTCopy the code

16. -chgrp, -chmod, -chown: the same as in the Linux file system, modify the owning permission of a file

Eg: hadoop fs - chmod 666 / sanguo/shuguo/zhangsan. TXT eg: hadoop fs - chown itcast: itcast/sanguo/shuguo/zhangsan. TXTCopy the code

17. -copyfromLocal: copies files from the local file system to the HDFS path

Eg: Hadoop fs -copyfromLocal readme.txt /Copy the code

18. -copytolocal: copy data from HDFS to the local PC

Eg: hadoop fs - copyToLocal/sanguo/shuguo/zhangsan. TXT. /Copy the code

19. -cp: copies data from one HDFS path to another HDFS path

Eg: hadoop fs - cp/sanguo shuguo/zhangsan. TXT/zhuge. TXTCopy the code

-tail: Displays the end of a file

Eg: hadoop fs - tail/sanguo/shuguo/zhangsan. TXTCopy the code

21. -rmdir: Deletes an empty directory

Eg: hadoop fs-mkdir /test eg: Hadoop fs-rmdir /testCopy the code

22. -du: collects statistics about the folder size

Hadoop fs -du -s -h /user/itcast/test 2.7k /user/itcast/test eg: hadoop fs -du -s -h /user/itcast/test 2.7k /user/itcast/test Hadoop fs - du - h/user / 1.3 K/user/itcast itcast/test/test/README. TXT 15 / user/itcast/test/jinlian. TXT 1.4 K /user/itcast/test/nihao.txtCopy the code

Common Hadoop tuning parameters

The following parameters take effect when configured in the user’s own MR application (mapred-default.xml)

Configuration parameters	Parameters that
mapreduce.map.memory.mb	Maximum number of resources (unit :MB) that can be used by a MapTask. The default value is 1024. If the MapTask uses more resources than this value, it is forcibly killed.
mapreduce.reduce.memory.mb	A ReduceTask The upper limit (MB) of resources that can be used, which is 1024 by default. If the amount of resources actually used by ReduceTask exceeds this value, it will be forcibly killed.
mapreduce.map.cpu.vcores	Maximum number of CPU cores per MapTask. Default value: 1
mapreduce.reduce.cpu.vcores	ReduceTask Maximum number of CPU cores that can be used. The default value is 1
mapreduce.reduce.shuffle.parallelcopies	The parallel number of data that each Reduce obtains from the Map. The default value is 5
mapreduce.reduce.shuffle.merge.percent	What percentage of the data in Buffer starts to be written to disk. The default value of 0.66
mapreduce.reduce.shuffle.input.buffer.percent	Ratio of the Buffer size to the available memory of Reduce. The default value of 0.7
mapreduce.reduce.input.buffer.percent	Specifies the percentage of memory used to store the data in Buffer. The default value is 0.0

The configuration takes effect in the configuration file of the server before YARN starts (yarn-default.xml).

Configuration parameters	Parameters that
yarn.scheduler.minimum-allocation-mb	Minimum amount of physical memory that can be applied for a single task. Default value: 1024
yarn.scheduler.maximum-allocation-mb	Maximum amount of physical memory that can be applied for a single task. Default value: 8192
yarn.scheduler.minimum-allocation-vcores	Minimum number of CPU cores applied for a single task. Default value: 1
yarn.scheduler.maximum-allocation-vcores	Maximum number of CPU cores applied for a single task. Default value: 32
yarn.nodemanager.resource.memory-mb	Specifies the total physical memory that can be used by YARN on a server node. Default value: 8192

Shuffle performance optimization parameters (mapred-default.xml) must be set before YARN starts.

Configuration parameters	Parameters that
mapreduce.task.io.sort.mb	Shuffle ring buffer size. The default value is 100m
mapreduce.map.sort.spill.percent	Ring buffer overflow threshold, default 80%

Fault-tolerant parameters (MapReduce performance optimization)

Configuration parameters	Parameters that
mapreduce.map.maxattempts	Maximum number of Map Task retries. If this parameter is exceeded, the Map Task fails to run. The default value is 4.
mapreduce.reduce.maxattempts	Maximum number of retries for each Reduce Task. If the retry parameter exceeds this value, the Map Task fails to run. The default value is 4.
mapreduce.task.timeout	Task timeout, a parameter that needs to be set frequently. If a Task does not receive any new data or output data for a certain period of time, the Task is considered to be in the Block state and may be stuck, or may be stuck forever. To prevent the user program from being blocked forever, a timeout (in milliseconds) is imposed. The default is 600,000. If your program takes too long to process each input data (such as accessing the database, pulling data from the network, etc.), it is recommended to increase this parameter. AttemptID: attempt_14267829456721_123456_M_000224_0 Timed out after 300 secsContainer killed by the error message AttemptID: attempt_14267829456721_123456_M_000224_0 is often displayed when this parameter is too small ApplicationMaster. “.