Application Case 1: Find the total salary of each department

Problem analysis: MapReduce joins are divided into Reduce Side Join, Map Side Join, and semi Join.

  • Reduce Join: When a large number of transfers are performed in the Shuffle phase, a large number of NETWORK I/OS are inefficient.
  • Map Side Join: Useful when dealing with large tables associated with multiple small tables.

Map Side Join application scenario: Two tables to be joined. One table is very large and the other table is very small, so that the small table can be directly placed in memory. For each record key/value in the large table, check the hashTable to see if there are any records with the same key. If there are any records with the same key, join them and output them.

To support file replication, Hadoop provides a DistributeCache class that can be used as follows: DistributeCache

  • Users using static methods DistributedCache. AddCacheFile () to specify which file to replicate, whose parameters are file URI (if it is on HDFS files, can be like this: HDFS: / / jobtracker: 50030 / home/XXX/file). JobTracker retrieves this list of URIs before the job starts and copies the corresponding files to each TaskTracker’s local disk.
  • Users use the DistributedCache. GetLocalCacheFiles () method to obtain the file directory, and use the standard file to read and write API to read the corresponding files.

In the following code, the table with a small amount of data (Department Dept) is cached in the memory. In the Mapper phase, the department number of the employee is mapped to the department name, which is output to Reduce as a key. In Reduce, the total salary of each department is calculated by department

Processing flow chart:

The test code