The role of 1 / Streaming
<1>Hadoop Streaming framework, the biggest benefits are: Map/Reduce programs written in any language can be run on a Hadoop cluster. <2> Map/Reduce programs read data from the standard input stdin and write data to the standard output stdout. <3> Can be simulated by means of before and after the pipe connect streaming, finish the adjustment of the map/reduce program in the local cat inputfile | mapper | sort | reducer > output < 4 > in the end, The Streaming framework also provides rich parameter control for job submission, directly through the Streaming parameters, without the need to use Java language modification; Many advanced functions of MapReduce can be accomplished by setting and adjusting the steaming parameter.Copy the code
2 / Streaming limitations
<1>Streaming can only process Textfile by default. For binary data, a better method is to base64 encode binary keys and values and transform them into text <2>Mapper and reducer before and after standard input and standard output. There is some overhead involved in copying and parsing dataCopy the code
3/Streaming command form
Hadoop jar hadoop_streaming. Jar [common option] [Streaming option] Note: Common option must be written before [Streaming option]Copy the code
4/ What are the common options
Reference website: https://www.cnblogs.com/shay-zhangjin/p/7714868.html - D common options, use the most advanced parameters, the alternative - jobconf (parameters will be discarded), -d mapred.job.name="Test001" -d mapred.map.tasks=2 -d mapred.map.tasks=2 -d mapred.reduce.tasks=2 # Specifies the number of reduce tasks. -d mapred.reduce.tasks=0 # Mapper output directly as the output of operation - D stream. The map. The output. The field. The separator =. # separator - D stream. Num. Map. The output. The key. The fields = 4 # in previous columns as the key, The rest is used as the key for the output of the value specified Mapper, the value delimiter Mapper. Do the separator, and the fourth. The previous part is the key, and the remaining part is the value (including the rest). If the mapper output does not contain four., the entire line is used as the key and value is null. Use \t as delimiter, the part before the first \t as key, and the rest as value. If mapper does not print \t, The whole line as the key, the value is empty - D stream. Reduce. The output. The field. The seperator. = - D stream. Num. Reduce. The output. The key. The fields = 4 specified according to reduce output. -d mapred.job. Priority =HIGH # MapRed.job. Priority =HIGH # MapReD.job. You can set VERY_LOW, LOW, NORMAL, HIGH, and VERY_HIGH priorities. - D mapred. Job. Map. Capacity = 5 # running at the same time the map tasks - D mapred. Most job. Reduce. The capacity = 3 # up to run at the same time reduce task - D Mapred.task. timeout=6000 # Specifies the maximum number of columns for which a task cannot respond. Partition points barrels - D num. Key. Fields. For the partition = 1 # 1 only listed the key points barrels - D num. Key. The fields. The for. Partition = 2 # 1, 2, a total of two key do key points barrels specify certain fields - D mapred. Text. Key. The partitioner. Option = k1, column 2 # 1, 2 key points barrels - D mapred. Text. The key. The partitioner. Option = - k2, 2 # column 2 key points barrels - D Graphs. Reduce. The memory. The MB for M = 512 # unit, specify each reduce task to apply for the amount of memoryCopy the code
5 / streaming options
-input # Support * wildcard, specify multiple files or directories, multiple -input, specify multiple input files/directories. Mapper input data, files must be manually uploaded to HDFS before the job submission. - Output the path must not exist, otherwise it is considered as the output of another job. -mapper # mapper, for example -mapper "python mapper.py" -reducer # mapper -file # Local mapper, reducer program files, and other files required by the reducer program are local files and distributed to compute nodesCopy the code
6/Mapper Input/output buckets are divided by keys and sorted by keys
Mapper input: When each Mapper starts running, the input file is converted into multiple lines (TextInputformat is segmented according to \n) and each line is passed to Stdin as input to mapper, which directly processes each line in STdin. Mapper output delimiters: By default, Hadoop sets the output key of mapper, and the value is separated by TAB. Can specify - D stream. The map. The output. The field. The seperator. = # specified mapper key, each output value delimiter - D stream. Num. Map. The output. The key. The fields = 4 # 4. Mapper output goes through the following steps: 1. Before partition, separate key and value according to Mapper output delimiter. - D stream. The map. The output. The field. The separator. = # specified mapper key, each output value delimiter - D stream. Num. Map. The output. The key. The fields = 4 # 4. Before for the key, and the rest of the value - D map. The output. The key. The field. The separator =. # set the map output, the key internal delimiter 2, according to the barrel separator "points", Determine which key is used as a partition (default is all keys, only 1 column; Or Mapper output all key was isolated from the separator is used to Partition) based on the specified key points barrels, play tag specifies the number of columns - D num. Key. The fields. The for. Partition = 1 # only 1 column key do barrels, Is the first column 3-d num. Key. Fields. For. Partition = 2 # 1, 2, a total of two columns do key points barrels (the number of columns) Reducer input: Each input from each Reducer is the content after removing the Partition label (Partition label separated according to Mapper separator), which is the same as the content output by Mapper to STDOUT, but the different records have been sorted. Therefore, if Mapper output delimiters are respecified, the Reducer program should be modified to separate keys and values according to the new Mapper output delimiters. Output of Reducer: The default output of Reducer is to separate keys and values according to TAB, which is why the Reducer program combines keys according to TAB and outputs values to STdout. When the Reducer output delimiter is respecified, The output to stdout in the Reducer program is also constructed with new delimiters (Reducer->stdout-> OutputFormat ->file). Outputformat separates keys, values, and values according to the Reducer output delimiters. And write files) - D stream. Reduce. The output. The field. The seperator. = # reducer output key, the value of separator - D stream. Num. Reduce. The output. The key. The fields = 4 # The fourth. The previous content is key, the other content is valueCopy the code
7/ Hadoop cluster Run Python scripts
Upload files to the HDFS distributed file storage system. HDFS dfs-put test. TXT /user_hadoop/houzhen03/input Execute hadoop tasks on the cluster. Hadoop jar hadoop_streaming. Jar -input Input file -output Output directory -mapper "python mapper.py" -reducer 'python reducer. -file /home/hadoop/mapper.py -file /home/hadoop/rreducer.py -file Must be an absolute path After the job is complete, check the output: HDFS dfs-cat Directory for storing output files /part-00000Copy the code
8 / hadoop script
Import OS hadoop_path = "/home/hadoop/hadoop-2.9.2/bin/hadoop jar" Pay attention to space hadoop_streaming_path = "/ home/hadoop/hadoop - 2.9.2 / share/hadoop/tools/lib/hadoop - streaming - 2.9.2. Jar" input_path = "/ user_hadoop houzhen03 / input / 1. TXT" # input path, can be a file, Can also be a directory that contain a lot of files or input_path = "/ user_hadoop houzhen03 / input" # a directory output_path = "/ user_hadoop/houzhen03 / output / 999 Def main(input_path,output_path): command = hadoop_path + hadoop_streaming_path + \ " -input " + input_path + \ " -output " + output_path + \ " -mapper \' mapper.py '" + \ # mapper.py "+ \ # mapper.py" + \ # mapper.py "+ \ # mapper.py" Py "-file /home/hadoop/mapper.py" -file /home/hadoop/mapper.py" -file /home/hadoop/reducer. #print(" execute command as :\n",command) os.system(command) main(input_path,output_path)Copy the code