Quote:
We’ve run the first hadoop example wordCount, but this time we’ll write our own one, which is similar to helloWorld in the programming language. First of all, let’s take a look at what part of MapReduce we’re going to write. We know that Hadoop processes files by splitting them into many parts, processing them separately, and finally aggregating the results. Let’s use a word count example to see what parts of the MapReduce process we’re writing.
Examples of specific MapReduce processes
First, we have such a file, the content of the file is as follows:
hello world hello java
hello hadoop
Copy the code
It’s a very simple document with only two lines. So how does Hadoop do word statistics? Let’s describe it in steps:
< Java,1>
< Hadoop,1> < Hadoop,1>
< Hello,1> < Hello,1> < Hello,1> < Java,1>
< Java,1>
Step 4: Aggregation result < Hadoop,1> < Hello,3> < Java,1>
By the end of step 4, the word count is actually complete. After looking at this specific example, you must have a clear understanding of mapReduce processing process. Then we need to know that steps 2 and 3 are what the Hadoop framework helps us do, and where we actually need to write the code are steps 1 and 4. The first step corresponds to the Map process and the fourth step corresponds to the Reduce process.
Write MapReduce code
Now all we need to do is complete the code for steps 1 and 4: create the project
3. After importing the package, we will create a Java file called WordCount and start typing code. __ is easy to mistake because several classes with the same name come from different jars.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
/ * * *@author wxwwt
* @sinceThe 2019-09-15 * /
public class WordCount {
/** * Object: input file contents * Text: input data of each line * Text: output key type * IntWritable: output value type */
private static class WordCountMapper extends Mapper<Object.Text.Text.IntWritable> {
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(new Text(itr.nextToken()), new IntWritable(1)); }}}/** * Text: key input by Mapper * IntWritable: value input by Mapper * Text: Key output by Reducer */ IntWritable: value output by Reducer
private static class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable item : values) {
count += item.get();
}
context.write(key, newIntWritable(count)); }}public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// Create the configuration
Configuration configuration = new Configuration();
// Set jobName to WordCount
Job job = Job.getInstance(configuration, "WordCount");
/ / set the jar
job.setJarByClass(WordCount.class);
// Set the Mapper class
job.setMapperClass(WordCountMapper.class);
// Set the Reducer class
job.setReducerClass(WordCountReducer.class);
// Set the output key and value types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// Set the input/output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Exit the program after the job is executed
System.exit(job.waitForCompletion(true)?0 : 1); }}Copy the code
Mapper program:
/** * Object: input file contents * Text: input data of each line * Text: output key type * IntWritable: output value type */
private static class WordCountMapper extends Mapper<Object.Text.Text.IntWritable> {
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(new Text(itr.nextToken()), new IntWritable(1)); }}}Copy the code
Context is a global context, and a StringTokenizer is used to break the value(that is, the data in each row) into multiple pieces by space. If a StringTokenizer does not pass in the specified separator, it defaults to “\t\n\r\f”. Write (new Text(ITr.nextToken ()), new IntWritable(1)); write(new Text(ITr.nextToken ()), new IntWritable(1)); Write the key/value to the context. Note: In Hadoop programming, String is Text and Integer is IntWritable. This is a class that Hadoop encapsulates itself. Just remember, it’s going to work pretty much the same as the original class where we have words with key as Text and value as Writable as 1.
Reduce the program:
/** * Text: key input by Mapper * IntWritable: value input by Mapper * Text: Key output by Reducer */ IntWritable: value output by Reducer
private static class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable item : values) {
count += item.get();
}
context.write(key, newIntWritable(count)); }}Copy the code
Reduce completes step 4. If we look at the example above, we can see that the input parameter is something like
, so there is a traversal of values, which is the sum of the three ones.
Program entry:
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// Create the configuration
Configuration configuration = new Configuration();
// Set jobName to WordCount
Job job = Job.getInstance(configuration, "WordCount");
/ / set the jar
job.setJarByClass(WordCount.class);
// Set the Mapper class
job.setMapperClass(WordCountMapper.class);
// Set the Reducer class
job.setReducerClass(WordCountReducer.class);
// Set the output key and value types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// Set the input/output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Exit the program after the job is executed
System.exit(job.waitForCompletion(true)?0 : 1);
}
Copy the code
Set some parameters and paths required by MapReduce and so on. Now, a little bit of caution here
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Copy the code
Hadoop jar WordCount. Jar /input/ WordCount /file1 /output/wcoutput The last two parameters are the input path and output path of the file, if our code changes the position of the parameter or has other parameters operation. You have to get the arGS subscript right.
Select File -> Project Artifacts -> + -> jar -> From modules with Dependencies
Hadoop runs the first instance, WordCount
Matters needing attention:
It is possible to run directly
hadoop jar WordCount.jar /input/wordcount/file1 /output/wcoutput
Copy the code
An exception will fail:
Exception in thread "main" java.io.IOException: Mkdirs failed to create /XXX/XXX
at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:106)
at org.apache.hadoop.util.RunJar.main(RunJar.java:150)
Copy the code
Something like this.
At this moment need to delete the jar package License folder and the contents inside, you can refer to this link: stackoverflow view under the License files and folders in the jar jar TVF XXX. Jar | grep -i License and then deleted Contents of meta-INF /LICENSE zip -d xxx. jar meta-INF /LICENSE
Conclusion:
1. Understand the operation steps of mapReduce, so that we only need to write Map and Reduce. The Hadoop framework has handled the intermediate steps, and other programs can refer to this step in the future Hadoop: String (Text) Integer (IntWritable) Mkdirs failed to create /XXX/XXX check whether the path is faulty. If not, delete meta-INF /LICENSE file in jar package
References:
1. hadoop.apache.org/docs/stable… 2. stackoverflow.com/questions/1…