The hadoop tutorial - graphs - Moment For Technology

What is the graphs

MapReduce is a computing model adopted by Hadoop for multi-node computing, and to put it more simply, it is a set of methodology for Hadoop to split tasks. When I first came into contact with the concept of MapReduce, it was difficult to understand it at the moment, and I also searched a lot of information, because everyone understood it differently, and the more I looked at it, the more confused I became. In fact, it is essentially a very simple thing. Here is an example to help you understand, since most of the web is an example of Hadoop’s official word calculation (wordcount), here is an example of another scenario.

Imagine the following report card

1, zhang SAN, 78,87,69 2, li si, 56,76,91 3, fifty, 65,46,84 four, six, 89,56,98...

The columns are numbered, student names, Chinese scores, math scores, English scores, now required statistics of the highest scores in each subject, assuming that the transcript is very long, there are tens of millions of lines, without the help of any computer system, how to rely on manual to solve this problem?

Single statistics

The advantage of assigning a single person to carry out statistics is that it is simple, but the disadvantage is also obvious. It takes a very long time, and even when the amount of data reaches a certain level, it may take a single person a lifetime to complete the statistics

Many statistical

If there are enough people to do the work, how do you coordinate them? Assume that the report card has 100W lines and 1000 people can be counted

1. Set up an administrator, the administrator divides the report card into 1000 equally to 1000 people, each person needs to count 1000 lines of data
1. The administrator made a form and asked everyone to fill in the results of his statistics. The form format is as follows

subjects	Person 1 Results	Person 2 Results	.	1000 personnel results
Chinese language and literature
mathematics
English

1. The administrator finally gets the following data
Subject 2 results | | | 1 results personnel… | 1000 results

Chinese language and literature	80	85	.	76
mathematics	89	90	.	88
English	94	85	.	90

1. Each section has 1000 results, and the administrator split the table into 100 small tables to 100 people for statistics, so that each small table has 10 data, the format of the small table is as follows

The first person got a little form

subjects	Person 1 Results	Person 2 Results	.	Staff 10 Results
Chinese language and literature	80	85	.	76
mathematics	89	90	.	88
English	94	85	.	90

The second little form I got

subjects	Staff 11 Results	Staff 12 Results	.	Person 20 Results
Chinese language and literature	83	75	.	88
mathematics	79	95	.	58
English	94	85	.	90

1. The administrator collected the results of each person again, and obtained another 100 pieces of data. If the administrator wanted to, he could divide the data and hand it over to multiple people for statistics, so that he could finally get a maximum result. The administrator could also complete the final statistics by himself, because the amount of data was not large.

So in this process, we see, a huge report card after the following steps, finally we get the results we want

The report card is split into multiple copies
Each is counted separately
Register the results
The results of the statistics can be split again, or can be directly counted
So after the final results are obtained

This process can be described in MapReduce language as follows:

Split multiple transcripts – Split
Each copy is counted separately – MAP
And register the results – shuffle
The statistical results can be split again – combiner
You can also do statistics-reduce directly

In addition, in the administrator’s form, the three subjects are recorded behind

The development of

Let’s solve the above problem with real Java code, assuming you have installed the Hadoop cluster environment as per the previous tutorial

Create a project

You can create a normal Java project using your familiar IDE. It is recommended that you use Maven for package management and add the following package dependencies

< the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - client < / artifactId > < version > 2.5.1 < / version > </dependency>

Create Mapper

Mapper corresponds to the map process in MapReduce, in the following Mapper code:

StudentMapper.java

public class StudentMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter) throws IOException { String[] ss = text.toString().split(","); OutputCollector. Collect (new Text(" Text "), new Intwritable (Integer.ParseInt (SS [2])))); OutputCollector. Collect (new Text(" Math "), new Intwritable (Integer.ParseInt (SS [3])))); OutputCollector. Collect (new Text(" English "), new Intwritable (Integer.ParseInt (SS [4])))); }}

StudentMapper implements the Mapper

interface. There are four parameters. The four parameters mean as follows
,>

LongWritable: Hadoop will split the TXT file by line. This represents the position of the line in the file, not normally used
Text: line content, such as the first line1, zhang SAN, 78,87,69
Text: As mentioned above, we need to sum up the subjects and calculate the highest score, so the name of the subjects is the key, and the result of each calculation is the value, so we use the Text to indicate the key, because we need to store the name of the subjects
Intwritable: Stores the calculation result, here refers to the highest score of the subject based on the statistics

Method map(LongWritable LongWritable, Text Text, OutputCollector

OutputCollector, Reporter Reporter). Note that the OutputCollector is an array, which means that multiple results can be written here, and that the Reporter can report the progress of the task to Hadoop. In this Mapper, we don’t do any calculations, we just parse the grades in the text and put them into the OutputCollector by subject. It’s like everyone didn’t do any work the first time, just put the data together. After passing through the mapper, the data from
,>

1, zhang SAN, 78,87,69 2, li si, 56,76,91 3, fifty, 65,46,84 four, six, 89,56,98...

Turned out to be

–				–	–
Chinese language and literature	78	56	65	89	.
mathematics	87	76	46	56	.
English	69	91	84	98	.

StudentReducer.java

public class StudentReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text text, Iterator<IntWritable> iterator, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter) throws IOException { StringBuffer str = new StringBuffer(); Integer max = -1; while (iterator.hasNext()) { Integer score = iterator.next().get(); if (score > max) { max = score; } } outputCollector.collect(new Text(text.toString()), new IntWritable(max)); }}

Now the Reducer starts to actually perform the calculations, Reducer function reduce(Text, Iterator

Iterator, outputCollector

outputCollector, Reporter Reporter) parameters are as follows:
,>

Text: is the third parameter of the Mapper
Iterator: The data written to the OutputCollector in the Mapper, combined with the first parameter, is the OutputCollector in the Mapper
OutputCollector: Reducer The calculated result needs to be written to this parameter, and here we write something similarKey: the language value: 90Structure, so the type is<Text, IntWritable>

As mentioned earlier, Mapper will organize the data and write the scores into the OutputCollector according to subjects. When it comes to the Reducer step, Hadoop will summarize the data written by Mapper according to keys (i.e. subjects) and deliver it to the Reducer. The Reducer is responsible for calculating the highest score inside and also writing the results into the OutputCollector.

StudentProcessor

public class StudentProcessor { public static void main(String args[]) throws Exception { JobConf conf = new JobConf(StudentProcessor.class); conf.setJobName("max_scroe_poc1"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(StudentMapper.class); conf.setReducerClass(StudentReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }}

We also need a startup class that contains the main function, execute the MVN package command to package, let’s say the package name is hadoop-score-job.jar, and upload the jar to the server directory via some tool like FTP.

Upload data

Hadoop uses the HDFS distributed file system to store large files on multiple nodes. With the HDFS CLI tool, we feel like we are working with local files. In the code above FileInputFormat. SetInputPaths (conf, new Path (args [0])); The user specifies the directory in which the files are used as the data source. The directory here is the directory in HDFS, and the directory must exist, and the data should be uploaded to the directory. Execute the following command to create the directory

hadoop fs -mkdir poc01_input

Execute the following command to import the data into HDFS

hadoop fs -put score.txt poc01_input

Score. TXT for content

1, Zhang San,78,87,69 2, Li Si,56,76,91 3, Wang Wu,65,46,84 4, Zhao Liu,89,56,98

Use the ls command to check whether the file was uploaded successfully

$ hadoop fs -ls poc01_input
Found 1 items
-rw-r--r--   1 hadoop supergroup         72 2020-12-13 15:43 poc01_input/score.txt

To perform the job

Execute the following command to start running the Job

$ hadoop jar hadoop-score-job.jar com.hadoop.poc.StudentProcessor poc01_input poc01_output 20/12/13 16:01:33 INFO Client.RMProxy: Connecting connection to resourceManager at master/172.16.8.42:18040 20/12/13 16:01:33 INFO Client.RMProxy: Connecting connection to resourceManager at master/172.16.8.42:18040 20/12/13 16:01:33 INFO Connecting to the ResourceManager at master / 172.16.8.42:18040 20/12/13 16:01:34 WARN graphs. JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 20/12/13 16:01:34 INFO mapred.FileInputFormat: Total input files to process : 1 20/12/13 16:01:35 INFO mapreduce.JobSubmitter: number of splits:2 20/12/13 16:01:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1607087481584_0005 20/12/13 16:01:35 INFO conf.Configuration: resource-types.xml not found 20/12/13 16:01:35 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 20/12/13 16:01:35 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE 20/12/13 16:01:35 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE 20/12/13 16:01:36 INFO impl.YarnClientImpl: Submitted application application_1607087481584_0005 20/12/13 16:01:36 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1607087481584_0005/ 20/12/13 16:01:36 INFO mapreduce.Job: Running job: job_1607087481584_0005 20/12/13 16:01:43 INFO mapreduce.Job: Job job_1607087481584_0005 running in uber mode : false 20/12/13 16:01:43 INFO mapreduce.Job: map 0% reduce 0% 20/12/13 16:01:51 INFO mapreduce.Job: map 100% reduce 0% 20/12/13 16:01:57 INFO mapreduce.Job: map 100% reduce 100% 20/12/13 16:01:57 INFO mapreduce.Job: Job job_1607087481584_0005 completed successfully 20/12/13 16:01:57 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=84 FILE: Number of bytes written=625805 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=316 HDFS: Number of bytes written=30 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time  spent by all maps in occupied slots (ms)=12036 Total time spent by all reduces in occupied slots (ms)=3311 Total time spent by all map tasks (ms)=12036 Total time spent by all reduce tasks (ms)=3311 Total vcore-milliseconds taken by all map tasks=12036 Total vcore-milliseconds taken by all reduce tasks=3311 Total megabyte-milliseconds taken by all map tasks=12324864 Total megabyte-milliseconds taken by all reduce tasks=3390464 Map-Reduce Framework Map input records=4 Map output records=12 Map output bytes=132 Map output materialized bytes=90 Input split bytes=208 Combine input records=12 Combine output records=6 Reduce input groups=3 Reduce shuffle bytes=90 Reduce input records=6 Reduce output records=3 Spilled Records=12 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=395 CPU time spent (ms)=1790 Physical memory (bytes) snapshot=794595328 Virtual memory (bytes) snapshot=5784080384 Total committed heap usage (bytes)=533200896 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=108 File Output Format Counters Bytes Written=30

Hadoop-score-job.jar is the JAR package packaged above. You need to CD to the JAR directory to execute the command
Com. Hadoop. Poc. StudentProcessor contains the main function of class
POC01_INPUT data source directory
POC01_OUTPUT data output directory

When the job is finished, the results are stored in the directory poc01_output

$ hadoop fs -ls poc01_output2 Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2020-12-13 16:01 poc01_output2/_SUCCESS -rw-r--r-- 1 Hadoop SuperGroup 30 2020-12-13 16:01 poc01_output2/part-00000 $Hadoop FS-Cat poc01_output2/part-00000 Math 87 English 98 Chinese 89

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The hadoop tutorial – graphs

What is the graphs

The development of

The hadoop tutorial – graphs

What is the graphs

The development of

Related Posts

HDFS – The four roles of HDFS

How is the HDFS-Checkpoint mechanism implemented

The history of Dataphin, a platform for intelligent data construction and management: the origin