The premise

This article will focus on Hadoop data types and then write an introductory example to count the number of words in wordcount. Learn about the differences between Hadoop and Java data types and get started with MapReduce.

I. Introduction to built-in data types

Java type Hadoop Writable type
Boolean BooleanWritable
Byte ByteWritable
Int IntWritable
Float FloatWritable
Long LongWritable
Double DoubleWritable
String Text (stored Text in UTF8 format)
Map MapWritable
Array ArrayWritable
Null NullWritable (used when [key,value] is null)

Use of data types:

package xl; import org.apache.hadoop.io.*; import org.junit.Test; public class HadoopDataTypeTest { @Test public void testText(){ System.out.println("==== text ==="); Text text=new Text("lei study hadoop"); System.out.println(text.getLength()); System.out.println(text.find("o")); System.out.println(text.toString()); System.out.println("=======ArrayWritable======"); ArrayWritable arrayWritable = new ArrayWritable(IntWritable.class); IntWritable year = new IntWritable(2021); IntWritable mouth = new IntWritable(4); IntWritable date = new IntWritable(24); arrayWritable.set(new IntWritable[]{year,mouth,date}); System.out.println(arrayWritable.get()[0]); System.out.println(arrayWritable.get()[1]); System.out.println(arrayWritable.get()[2]); System.out.println("========== MapWritable ======="); MapWritable mapWritable = new MapWritable(); Text k1 = new Text("name"); Text k2 = new Text("passwd"); mapWritable.put(k1,new Text("lei")); mapWritable.put(k2, NullWritable.get()); System.out.println(mapWritable.get(k1)); System.out.println(mapWritable.get(k2)); }}Copy the code

Output result:

16
13
lei study hadoop
=======ArrayWritable======
2021
4
24
========== MapWritable =======
lei
(null)
Copy the code

Introduction to WordCount

2.1 Development of programming specifications for MapReduce

The program we wrote was divided into three parts: Mapper, Reduce, and Drive

1. Mapper stage

  • A user-defined Mapper inherits its parent class
  • The output data of a Mapper is in key-value format
  • The business logic in Mapper is written in the Map method
  • The output data of the Mapper is in the form of KV pairs
  • The map method is called once for each k-V

2. Reduce phase

  • User-defined Reduce inherits its parent class
  • The input data type of Reducer should correspond to the output data type of Mapper, which is also K-V
  • The Reduce business logic is written in the Reduce method
  • The ReduceTask process invokes the Reduce method once for each set of the same K-V

3. Driver stage

The client of the YARN cluster is used to submit the entire program to the YARN cluster. The submitted job object encapsulates the operation parameters of the MapReduce program.

2.2 Introduction to Wordcount

You can prepare two text files: file1. TXT and file2. TXT. You need to count the number of occurrences of each word in the text. The text is as follows:

Split files into split fragments, so each file is split into a split pair, and the files are split into <key,value> pairs by line. Key is the offset calculated by MapReduce automatically, and contains the number of characters in carriage returns. For example, the offset of Hello LEI is 0 and the offset of bye lei is 12. The map output is in the form of < word, 1>, and the key-value pair is sent to Reduce. Reduce accumulates the values of the same words to obtain the final key-value pair and outputs it to a file.

2.3 Environment Preparation

1. Create maven project

2. Add the following dependencies to the pom. XML file

<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> < version > 2.12.0 < / version > < / dependency > < the dependency > < groupId > org.. Apache hadoop < / groupId > The < artifactId > hadoop - client < / artifactId > < version > 3.1.3 < / version > < / dependency > < / dependencies >Copy the code

3, In the resources directory of the project, create a new file named “log4j2.xml” and fill in the file.

<? The XML version = "1.0" encoding = "utf-8"? > <Configuration status="error" strict="true" name="XMLConfig"> <Appenders> <! <Appender type="Console" name="STDOUT"> <! -- Layout is PatternLayout, [INFO] [2018-01-22 17:34:01][org.test.Console]I'm here --> <Layout Type ="PatternLayout" pattern="[%p] [%d{yyyy-MM-dd HH:mm:ss}][%c{10}]%m%n" /> </Appender> </Appenders> <Loggers> <! <Logger name="test" level="info" additivity="false"> <AppenderRef ref="STDOUT" /> </Logger> <! -- root loggerConfig Settings --> < root level="info"> <AppenderRef ref="STDOUT" /> </ root > </Loggers> </Configuration>Copy the code

2.4 Case practice

1. Write the Mapper class

/** * a custom xxxMapper class, need to inherit Hadoop provided Mapper class, rewrite map method * four generics: two KV pair * according to the current wordcount program analysis: * KEYIN, : LongWritable specifies the offset from which data is read from the file. It is simply known as the position from which data is read * VALUEIN, : Text Specifies the data actually read from the file * * KEYOUT: Text, which represents a word * VALUEOUT: IntWritable, */ public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> { private Text k=new Text(); private IntWritable v=new IntWritable(1); /** * @param key KEYIN Input data key * @param value VALUEIN Input data value * @param context context object, Override protected void map(LongWritable key, Text value, Context Context) throws IOException, InterruptedException {//1, get a line of input data convert value to Java 2, cut 3, spell each word into KV write String value2=value.toString(); String[] words = value2.split(" "); for (String word : words) { k.set(word); context.write(k,v); }}}Copy the code

2. Compile the Reducer class

/** * The self-defined xxxReducer class needs to inherit the Reducer class provided by Hadoop * According to the current wordcount program analysis, four generic types * input KV type * KEYIN, : Text a word written on the map end * VALUEIN, : IntWritable indicates a word occurrence * output KV type * KEYOUT, : indicates a word * VALUEOUT> : Said a word the number of occurrences of * / public class WordCountReducer extends Reducer < Text, IntWritable, Text, IntWritable > {int sum = 0; IntWritable v=new IntWritable(); /** * @param key keyin, which represents a word * @param values iterator object, which represents the same word, which encapsulates the number of occurrences of the current word. Reducer Reducer Reducer Reducer Reducer Reducer Reducer reducer Iterable<IntWritable> values,Context Context) throws IOException, InterruptedException {// 1 sum = 0; Iterable<IntWritable> values,Context Context) throws IOException, InterruptedException {// 1 sum = 0; for (IntWritable count : values) { sum += count.get(); } // 2 output v.set(sum); context.write(key,v); }}Copy the code

3. Write Driver Driver classes

package com.atguigu.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {// 1 Obtain Configuration information and encapsulate tasks. Configuration Configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 Set the jar loading path job.setJarbyclass (wordcountdriver.class); // 3 Set the Map and reduce classes job.setmapperclass (wordcountmapper.class); job.setReducerClass(WordcountReducer.class); // 4 Set the map output job.setMapOutputKeyClass(text.class); job.setMapOutputValueClass(IntWritable.class); // 5 Set the final output type job.setOutputKeyClass(text.class); job.setOutputValueClass(IntWritable.class); / / 6 set the input and output Path FileInputFormat. SetInputPaths (job, new Path (" D: / cs/input ")); FileOutputFormat.setOutputPath(job, new Path("D:/cs/output")); // 7 Submit Boolean result = job.waitForcompletion (true); System.exit(result ? 0:1); }}Copy the code

4. Local testing

The local output is shown below.

5. Cluster testing

Package the program as a JAR and submit it to the server. Execute command:

Class name + Input + Output Hadoop jar wc.jar xl.WordCountDriver /wcinput/wcOutputCopy the code

Three, a small problem

An error occurs when the MapReduce program is executed, as follows

Attempt_1562862697087_0005_m_000003_1001, Status: attempt_1562862697087_0005_m_000003_1001 FAILED [2019-07-12 00:51:52.484]Exception from container-launch. Container ID: container_1562862697087_0005_02_000011 Exit code: 127

[2019-07-12 00:51:52.490]Container exited with a non-zero exit code 127. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : /bin/bash: /bin/java: No such file or directory  

It is well understood that /bin/java cannot be executed

/bin/java cannot be executed on the shell terminal

You need to create a /bin/java soft link that actually points to the real JDK directory

My real JDK directory is /opt/sofeware/java8/

So it’s executed at the shell terminal

ln -s  /opt/sofeware/java8/ /bin/java

To execute the/bin/Java

The discovery command can run

If Hadoop is a clustered environment, you need to create soft connections on each machine

No errors will be reported when MapReduce is executed