This paper follows the introduction of sina weibo tf-idf algorithm of hadoop2.5.2 learning 13-mr
In the last microblog, the first Mappreduce is realized, and the word frequency TF and the total number of microblog N are counted. In this paper, DF will be counted, that is, the number of articles in which each entry appears. We only need to count the word frequency of the output result of a MapReduce to obtain DF
It mainly reads four files of a MapReduce and distinguishes three files of TF data from them
By taking Filesplit fragments,
FileSplit fileSplit = (FileSplit) context.getInputSplit();
Copy the code
In MapReduce, raw data is split into split data as input data for maps. In turn, each map has a split corresponding to it, and each split belongs to a file.
So by filtering split, not considering split as part-R-00003, and the output file of total weibo statistics, we can go to the other three files.
package com.chb.weibo2; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; /** * count idF * input data: w+"_"+id */ public class SecondMapper extends Mapper<Text, Text, Text, IntWritable>{ @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { FileSplit fileSplit = (FileSplit) context.getInputSplit(); /** * The output of the first MR is divided into four reduce files. In the custom partition, the last partition is to calculate the total number of tweets. fileSplit.getPath().getName().equals("part-0003")) { if (key.toString().split("_").length == 2 ) { String w = key.toString().split("_")[0]; String id = key.toString().split("_")[1]; context.write(new Text(w), new IntWritable(1)); }}}}Copy the code
In reduce, you do a statistical sum
package com.chb.weibo2; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SecondReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable iw : values) { sum += iw.get(); } context.write(key, new IntWritable(sum)); }}Copy the code
Execution code:
package com.chb.weibo2; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SecondRunJob { public static void main(String[] args) throws Exception { System.setProperty("HADOOP_USER_NAME", "chb"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(); job.setJobName("SecondRunJob"); job.setJar("C:\\Users\\12285\\Desktop\\weibo.jar"); job.setJarByClass(SecondRunJob.class); job.setMapperClass(SecondMapper.class); job.setReducerClass(SecondReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setInputFormatClass(KeyValueTextInputFormat.class); Path in = new Path("/user/chb/output/outweibo1"); FileInputFormat.addInputPath(job, in); Path out = new Path("/user/chb/output/outweibo2"); if (fs.exists(out)) { fs.delete(out, true); } FileOutputFormat.setOutputPath(job, out); boolean f = job.waitForCompletion(true); If (f) {system.out.println (" Second job executed... ); }}}Copy the code
DF statistical results:
Calculate the DF of each entry, that is, how many microblogs it appears in.
Some 1 excited 2 trading 1 trading day 2 to 1 also 1 products 4 enjoy 14 enjoy 6 enjoy 7 Jingdong 1 bright 1 appearance 7 relatives 10 parent-child 2 intimate 1Copy the code
3 The third MapReduce calculates the weight of terms
TF, N is calculated after the first MapReduce
The second MapReduce calculates DF
Among the three data, TF data is the most. If TF data reaches T level and each fragment is 1G, thousands of MapTasks are also needed.
Loading N and DF data thousands of times is very resource-intensive,
First of all, the data of N and DF is very small, N only has one data, there are not many common words in the thesaurus, DF is not big,
We load the small table into the cache,
/ / the small table loaded into memory, the total number of weibo job. AddCacheFile (new Path ("/user/CHB/output/weibo1 / part - r - 00003 "). The toUri ()); //df job.addCacheFile(new Path("/user/chb/output/weibo1/part-r-00001").toUri());Copy the code
Read in the Map setup method, which is initialized only once for each mapTask.
// Use DistributedCache to read the total number of tweets and DF statistics into memory as a map to store HashMap<String, Integer> countMap = null; HashMap<String, Integer> dfMap = null; @Override protected void setup(Context context) throws IOException, InterruptedException {// Obtain the uri of the file in the cache through Contex uri [] uris = context.getcachefiles (); if (uris ! = null) {for (URI URI: uris) {// How to read file contents? BufferedReader br = new BufferedReader(new FileReader(uri.getPath())); If (uri.getPath().endswith ("part-r-00003")) {countMap = new HashMap<String, Integer>(); String line = br.readline (); if (line.startsWith("count")) { countMap.put("count", Integer.parseInt(line.split("\t")[1])); }}else if(uri.getPath().endswith ("part-r-00000")){// Get df file String line = br.readline (); String word = line.split("\t")[0]; String word = line.split("\t")[0]; String count = line.split("\t")[1]; dfMap.put(word, Integer.parseInt(count)); } br.close(); }}}Copy the code
There is a problem with the above code, it keeps reporting errors/user/chb/output/outweibo1/part-r-00003
Is not present,
I’ll separate out the row that gets path, but I don’t understand why.
// How to read file contents? If (uri.getPath().endswith ("part-r-00003")) {// getPath = new Path(uri.getPath()); BufferedReader br = new BufferedReader(new FileReader(path.getName())); countMap = new HashMap<String, Integer>(); String line = br.readline (); if (line.startsWith("count")) { countMap.put("count", Integer.parseInt(line.split("\t")[1])); } br.close(); }Copy the code
reduce
Reduce is simple, just output the weight of each entry in each microblog:
package com.chb.weibo3; import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class LastReducer extends Reducer<Text, Text, Text, Text>{ @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { StringBuilder sb = new StringBuilder(); for (Text text : values) { sb.append(text.toString()+"\t"); } context.write(key, new Text(sb.toString())); }}Copy the code
Execute a program
package com.chb.weibo3; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class LastRunJob { public static void main(String[] args) throws Exception { System.setProperty("HADOOP_USER_NAME", "chb"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(); job.setJar("C:\\Users\\12285\\Desktop\\weibo3.jar"); job.setJarByClass(LastRunJob.class); job.setJobName("LastRunJob"); job.setMapperClass(LastMapper.class); job.setReducerClass(LastReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setInputFormatClass(KeyValueTextInputFormat.class); / / the small table loaded into memory, the total number of weibo job. AddCacheFile (new Path (" HDFS: / / TEST: 9000 / user/CHB/output/outweibo1 / part - r - 00003 "). The toUri ()); //df job.addCacheFile(new Path("hdfs://TEST:9000/user/chb/output/outweibo2/part-r-00000").toUri()); FileInputFormat.addInputPath(job, new Path("hdfs://TEST:9000/user/chb/output/outweibo1/")); Path out = new Path("hdfs://TEST:9000/user/chb/output/outweibo3/"); if (fs.exists(out)) { fs.delete(out, true); } FileOutputFormat.setOutputPath(job, out); boolean f = job.waitForCompletion(true); If (f) {system.out.println (" Last MapReduce execution completed "); }}}Copy the code
The results are as follows:
3823890201582094 ME :7.783640596221253 Delicious :12.553286978683289 more :7.65728279297819 Sleep :8.713417653379183 Soybean milk :13.183347464017316 Morning :8.525359754082631 want to about :10.722584331418851 drink :10.722584331418851 can :10.525380377809771 cook :10.525380377809771 About 7.16703787691222 rice cooker :11.434055402812444 :2.1972245773362196 on :14.550344638905543 good :13.328818040700815 After :8.37930948405285 automatic :12.108878692538742 Today :5.780743515792329 Churros :9.54136924893133 Rice :11.166992617563398 Soymilk machine :4.1588830833596715 one hour :10.927663610051221 several hours :13.130529940070723 natural :13.130529940070723 also :8.580918882296782 Let :7.824046010856292 :13.183347464017316 get up :9.436997742590188 3823890210294392 about :3.58351893845611 I :3.8918202981106265 Soybean milk :4.394449154672439 today :5.780743515792329 :4.394449154672439 churros :9.54136924893133 3823890235477306 For a while :15.327754517406941 to :5.991464547107982 Son :9.911654115202522 approx. :3.58351893845611 Zoo :12.553286978683289 From :9.043577154098081 3823890239358658 Continue :11.166992617563398 Support :7.522400231387125 3823890256464940 Times :13.130529940070723 up :7.052721049232323 about :3.58351893845611 meal :11.166992617563398 to :5.991464547107982 3823890264861035 About :3.58351893845611 I :3.8918202981106265 :4.394449154672439 eat :9.326878188224134 Oh :7.221835825288449 3823890281649563 And family :11.166992617563398 together :6.089044875446846 eat :12.108878692538742 meet :8.788898309344878 meal :11.166992617563398 3823890285529671 :4.394449154672439 Today :5.780743515792329 square :12.553286978683289 Roller skating :15.327754517406941 Together :6.089044875446846 about :3.58351893845611 3823890294242412 Joyang :2.1972245773362196 I :3.8918202981106265 Global :12.553286978683289 Breakfast :6.516193076042964 Double :5.780743515792329 You :6.591673732008658 together :6.089044875446846 First :6.516193076042964 Reservation :5.545177444479562 la :6.8679744089702925 coming :11.434055402812444 eat :6.8679744089702925 Offer :10.187500401613525 Soymilk machine :4.1588830833596715 3823890314914825 together :6.089044875446846 about :3.58351893845611 Go to :5.991464547107982 shopping :10.047761041692553 sisters :11.744235578950832 today :5.780743515792329 weather fine :13.94146015628705 From :9.043577154098081 to :7.700295203420117 3823890323625419 Mail :11.166992617563398 Nationwide :12.108878692538742 Jyl - : 15.327754517406941:9.54136924893133 joyoung: sun 9.656627474604603:2.1972245773362196 3823890335901756 Of :2.1972245773362196 this year :12.553286978683289 warm :12.553286978683289 decisive :15.327754517406941 out :11.434055402812444 Shopping :10.047761041692553 one day :9.780698256443507 most :8.37930948405285 Today is :10.722584331418851 3823890364788305 Out :11.166992617563398 come :8.15507488781144 go :13.94146015628705 go :5.991464547107982 Friends :12.108878692538742 For flowers :11.744235578950832 Outing :9.780698256443507 About :3.58351893845611 Together :6.089044875446846 Spring :7.16703787691222 3823890369489295 Let :7.824046010856292 practice :26.261059880141445 download :13.130529940070723 Joyang :2.1972245773362196 bar :6.664409020350408 I :3.8918202981106265 Nine Yin Zhenjing :13.94146015628705 free :12.553286978683289 Hang :15.327754517406941 bar :15.327754517406941 Pinghu :15.327754517406941 misfire :11.434055402812444 True :10.927663610051221 boy :15.327754517406941 open :10.525380377809771 You :6.591673732008658 Trigeminal nerve :15.327754517406941 in :6.802394763324311 destroyed :15.327754517406941 changed :15.327754517406941 3823890373686361 together :6.089044875446846 about :3.58351893845611 haircut :12.553286978683289 :4.394449154672439 partner :9.436997742590188 Go to :5.991464547107982 3823890378201539 eat :6.8679744089702925 very :13.130529940070723 ah :8.1886891244442 today :5.780743515792329 Sister :11.744235578950832 happy :8.788898309344878 to :5.991464547107982 play :8.439015410352214 Weekend :8.317766166719343 Shopping :10.047761041692553 :4.394449154672439 about :3.58351893845611 food :8.954673628956414Copy the code