I. Introduction of MR framework
1. Mr Refers to MapReduce, which is first a programming idea and second a distributed computing framework. Spark and Flink urban distribution computing framework facilitate the development of our data operations into a distributed computing program. M refers to Map. The function of Map is to read original data and Map it into key-value. R refers to reduce. The main task of mapTask is (1)** reads data **, which reads a row by default. (2) For each row read, ** calls user-defined Mapper's map method for grouping. ** (3) Collects key-value pairs returned by map methods. In the framework of data partitioning, the main implementation of the program in the reduce stage is reduceTask. The main task of reduceTask is (1) to pull data from the storage results returned by Map, ** Pull data from the partition ** Merge (2)** Call the user's customized Reduce method for aggregation ** operation (3) Collect the key value returned by reduce method and write it to the target storage file (HDFS by default)Copy the code
2. A Simple MapReduce program (Word Count)
1. To customize a Mapper classCopy the code
/** * LongWritable specifies the type of key returned by calling mapTask. * Text specifies the type of value returned by maptak. The default is Text. * IntWritable indicates the number of words to be counted, Public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split(" "); for (String word : words) { context.write(new Text(word),new IntWritable(1)); }}}Copy the code
2. Customize a Reducer classCopy the code
/** * Text indicates the type of the key returned by mapTask. * IntWritable indicates the type of the value returned by mapTask. * Textreducetask Indicates the type of the key to be output IntWritable indicates the type of the value to be returned */ public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> iter, Context context) throws IOException, InterruptedException { int count = 0; for (IntWritable v : iter) { count += v.get(); } // Iterator<IntWritable> iterator = iter.iterator(); // while (iterator.hasNext()){ // IntWritable v = iterator.next(); // count += v.get(); // } context.write(key,new IntWritable(count)); }}Copy the code
3. Define an MRAppMaster class to start the jobCopy the code
public class MrAppMaster { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); Job job = job. GetInstance (conf); // Create a job. Job.setmapperclass (wcmapper.class); // Specify the map and reduce classes to run; job.setReducerClass(WcReducer.class); // Specify the map output key value type and reduce output key value type job.setMapOutputKeyClass(text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); / / set the input and output data Path FileInputFormat setInputPaths (job, new Path (" E: \ \ test \ \ b.t xt ")); FileOutputFormat.setOutputPath(job,new Path("E:\\test\\out")); TextOutputFormat Job.setinputFormatClass (textinputFormat.class); TextOutputFormat Job.setinputFormat.class; job.setOutputFormatClass(TextOutputFormat.class); // Start job job.waitForcompletion (true); }}Copy the code
3. Introduction of some concepts to be used (for the following)
Resourcemanager provides the following functions: 1. Processes client request 2. Monitor NodeManager 3. Start or monitor Application 4. Resource allocation and scheduling NodeManager 1. Manage resources of a node 2. The concept of request containers for processing ResourceManager and MRAppMaster is as follows: NodeManager divides the resources on the server into N virtual containers, which is equivalent to N local VMSCopy the code
4. How YRAN works
Mapreduce is controlled by YARN. Learn about the working mechanism of YARN 1. Start the client 2. Apply to Resourcemanager for running an appliction 3. Resourcemanager returns a jobid and a resource path 4. Job. jar------ The client traverses the resource path and slices the resource path. If a file is larger than 128 x 1.1, the file is sliced into one piece (each piece is 128 MB in size). Check whether the remaining files are larger than 128*1.1, obtain a FileSplit object, put it in the ArrayList, convert it into an array, and serialize it into a job. Split job. XML ------ serialize the job object, read the Configuration, and output it as job. XML job.jar------ If the program is running on a cluster, output the jar package to the resource path to form job.jar 5. The client applies for an MRAppMaster container (1.5 gb +1core) but the unit is 1 gb. 6.RM(resourcemanager) checks the idle container task queue and allocates tasks to create the MRAppMaster container (2 gb +1core) 7. The idle container will be heartbeat detected, collect the task and create MRAppMaster container, copy the files under the resource path to its own container working directory 8. Client sends shell instruction to MRAppMaster, java-cp.. MrAppMaster package name to create a process, the start MRAppMater container 9. MrAppMaster according to the job. The split all calculate need more MapTask container and want to apply RM 10. The RM check free container task queue, assigned tasks, 11.MRAppMaster sends shell instructions to the container where MapTask resides and starts yarnChild process 12. After the task is completed, nodeManager releases resources, reports to Resourcemanager that the container is idle, and reports to MRAppMaster that the task is complete. MR Reports to the client that the task is complete, and to RM that the task is complete. Idle NodeManager periodically sends heartbeat checks to Resourcemanger 1. Prove oneself alive 2. Report oneself usage, how many free containers 3. Pick up your own tasksCopy the code
There are incorrect places welcome to point out, thank you!