Wordcount, the first MapReduce program I wrote

Quote:

We’ve run the first hadoop example wordCount, but this time we’ll write our own one, which is similar to helloWorld in the programming language. First of all, let’s take a look at what part of MapReduce we’re going to write. We know that Hadoop processes files by splitting them into many parts, processing them separately, and finally aggregating the results. Let’s use a word count example to see what parts of the MapReduce process we’re writing.

Examples of specific MapReduce processes

First, we have such a file, the content of the file is as follows:

hello world hello java  
hello hadoop
Copy the code

It’s a very simple document with only two lines. So how does Hadoop do word statistics? Let’s describe it in steps:

< Java,1>

< Hadoop,1> < Hadoop,1>

< Hello,1> < Hello,1> < Hello,1> < Java,1>

< Java,1>

Step 4: Aggregation result < Hadoop,1> < Hello,3> < Java,1>
,1>
,1>
,1>
,1>
,1>
,1>
,1>
,1>

By the end of step 4, the word count is actually complete. After looking at this specific example, you must have a clear understanding of mapReduce processing process. Then we need to know that steps 2 and 3 are what the Hadoop framework helps us do, and where we actually need to write the code are steps 1 and 4. The first step corresponds to the Map process and the fourth step corresponds to the Reduce process.

Write MapReduce code

Now all we need to do is complete the code for steps 1 and 4: create the project

3. After importing the package, we will create a Java file called WordCount and start typing code. __ is easy to mistake because several classes with the same name come from different jars.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

/ * * *@author wxwwt
 * @sinceThe 2019-09-15 * /
public class WordCount {

    /** * Object: input file contents * Text: input data of each line * Text: output key type * IntWritable: output value type */
    private static class WordCountMapper extends Mapper<Object.Text.Text.IntWritable> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                context.write(new Text(itr.nextToken()), new IntWritable(1)); }}}/** * Text: key input by Mapper * IntWritable: value input by Mapper * Text: Key output by Reducer */ IntWritable: value output by Reducer
    private static class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable item : values) {
                count += item.get();
            }
            context.write(key, newIntWritable(count)); }}public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // Create the configuration
        Configuration configuration = new Configuration();
        // Set jobName to WordCount
        Job job = Job.getInstance(configuration, "WordCount");
        / / set the jar
        job.setJarByClass(WordCount.class);
        // Set the Mapper class
        job.setMapperClass(WordCountMapper.class);
        // Set the Reducer class
        job.setReducerClass(WordCountReducer.class);
        // Set the output key and value types
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // Set the input/output paths
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // Exit the program after the job is executed
        System.exit(job.waitForCompletion(true)?0 : 1); }}Copy the code

Mapper program:

/** * Object: input file contents * Text: input data of each line * Text: output key type * IntWritable: output value type */
private static class WordCountMapper extends Mapper<Object.Text.Text.IntWritable> {
    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            context.write(new Text(itr.nextToken()), new IntWritable(1)); }}}Copy the code

Context is a global context, and a StringTokenizer is used to break the value(that is, the data in each row) into multiple pieces by space. If a StringTokenizer does not pass in the specified separator, it defaults to “\t\n\r\f”. Write (new Text(ITr.nextToken ()), new IntWritable(1)); write(new Text(ITr.nextToken ()), new IntWritable(1)); Write the key/value to the context. Note: In Hadoop programming, String is Text and Integer is IntWritable. This is a class that Hadoop encapsulates itself. Just remember, it’s going to work pretty much the same as the original class where we have words with key as Text and value as Writable as 1.

Reduce the program:

/** * Text: key input by Mapper * IntWritable: value input by Mapper * Text: Key output by Reducer */ IntWritable: value output by Reducer
private static class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable item : values) {
            count += item.get();
        }
        context.write(key, newIntWritable(count)); }}Copy the code

Reduce completes step 4. If we look at the example above, we can see that the input parameter is something like

, so there is a traversal of values, which is the sum of the three ones.
,1,1,1>

Program entry:

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    // Create the configuration
    Configuration configuration = new Configuration();
    // Set jobName to WordCount
    Job job = Job.getInstance(configuration, "WordCount");
    / / set the jar
    job.setJarByClass(WordCount.class);
    // Set the Mapper class
    job.setMapperClass(WordCountMapper.class);
    // Set the Reducer class
    job.setReducerClass(WordCountReducer.class);
    // Set the output key and value types
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // Set the input/output paths
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    // Exit the program after the job is executed
    System.exit(job.waitForCompletion(true)?0 : 1);
}
Copy the code

Set some parameters and paths required by MapReduce and so on. Now, a little bit of caution here

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Copy the code

Hadoop jar WordCount. Jar /input/ WordCount /file1 /output/wcoutput The last two parameters are the input path and output path of the file, if our code changes the position of the parameter or has other parameters operation. You have to get the arGS subscript right.

Select File -> Project Artifacts -> + -> jar -> From modules with Dependencies

Hadoop runs the first instance, WordCount

Matters needing attention:

It is possible to run directly

  hadoop jar WordCount.jar /input/wordcount/file1  /output/wcoutput
Copy the code

An exception will fail:

Exception in thread "main" java.io.IOException: Mkdirs failed to create /XXX/XXX
  at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:150)
Copy the code

Something like this.

At this moment need to delete the jar package License folder and the contents inside, you can refer to this link: stackoverflow view under the License files and folders in the jar jar TVF XXX. Jar | grep -i License and then deleted Contents of meta-INF /LICENSE zip -d xxx. jar meta-INF /LICENSE

Conclusion:

1. Understand the operation steps of mapReduce, so that we only need to write Map and Reduce. The Hadoop framework has handled the intermediate steps, and other programs can refer to this step in the future Hadoop: String (Text) Integer (IntWritable) Mkdirs failed to create /XXX/XXX check whether the path is faulty. If not, delete meta-INF /LICENSE file in jar package

References:

1. hadoop.apache.org/docs/stable… 2. stackoverflow.com/questions/1…

Wordcount, the first MapReduce program I wrote

Quote:

Examples of specific MapReduce processes

Write MapReduce code

Matters needing attention:

Conclusion:

References:

Related Posts

Spring transactions, non-transactional methods call transactional methods, and transactions do not take effect

A Brief Introduction to MVC Design Pattern (Example)

Design a fault-tolerant microservices architecture