Hadoop MapReduce program in Python

Abstract: Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process large amounts of data.

This article is written by Donglian Lin from Hadoop Streaming: Writing a Hadoop MapReduce Program using Python.

With the advent of digital media, the Internet of Things and more, the amount of digital data generated every day is growing exponentially. This situation presents challenges in creating the next generation of tools and technologies to store and manipulate this data. This is where Hadoop Streaming comes in! The chart below depicts the annual growth in global data generated since 2013. IDC estimates that by 2025, the amount of data generated each year will reach 180 Zettabytes!

IBM says nearly 25 gigabytes of data are created every day, with 90% of the world’s data created in the last two years! Storing such a large amount of data is a challenging task. Hadoop can process large amounts of structured and unstructured data more efficiently than traditional enterprise data warehouses. It stores these huge data sets across distributed clusters of computers. Hadoop Streaming uses the MapReduce framework, which can be used to write applications to process large amounts of data.

Since the MapReduce framework is based on Java, you may be wondering how the developer will work if he or she has no Java experience. Well, developers can write Mapper /Reducer applications in their favorite language without much Java knowledge, using Hadoop Streaming instead of switching to new tools or technologies like Pig and Hive.

1. What is a Hadoop stream?

Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute big data analysis programs. Hadoop streams can be executed using Python, Java, PHP, Scala, Perl, UNIX, and other languages. The utility allows us to create and run Map/Reduce jobs using any executable file or script as a mapper and/or simplification. Such as:

$HADOOP\_HOME/bin/hadoop jar$ HADOOP_HOME/hadoop-streaming.jar

-input myInputDirs

– Output my output directory

– Folder/trash/cat

– Reducer /bin/ WC

Parameter Description:

Python MapReduce code:

mapper.py #! /usr/bin/python import sys #Word Count Example # input comes from standard input STDIN for line in sys.stdin: line = line.strip() #remove leading and trailing whitespaces words = line.split() #split the line into words and returns  as a list for word in words: #write the results to standard output STDOUT print'%s %s' % (word,1) #Emit the wordCopy the code

reducer.py

#! /usr/bin/python import sys from operator import itemgetter # using a dictionary to map words to their counts current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: line = line.strip() word,count = line.split(' ',1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s %s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s %s' % (current_word,current_count)Copy the code

Run:

1. Create a file with the following contents and name it word.txt.

Cat mouse lion deer tiger lion elephant lion deer

2. Copy the mapper.py and reducer.

3. Open the terminal and locate the file directory. Command: ls: lists all files in the directory CD: changes the directory/folder

4. View the file content.

Run the cat file_name command

> the contents of mapper.py

Run the cat mapper.py command

> reducer. The content of the py

Command: cat reducer.py

We can run mapper and Reducer on local files (for example: word.txt). To run Map and Reduce on a Hadoop distributed file system (HDFS), we need the Hadoop Streaming JAR. So before we run the scripts on HDFS, let’s run them locally to make sure they work.

> Run the mapper

Command: catword. TXT | python mapper. Py

> running reducer. Py

Command: cat word. TXT | python mapper. Py | sort – k1, 1 | pythonreducer. Py

We can see that the mapper and reducer are working as expected, so we don’t face any further problems.

Run Python code on Hadoop

Before running MapReduce on Hadoop, copy local data (word.txt) to HDFS

> Example: HDFS DFS -put source_directory hadoop_destination_directory

Command: HDFS DFS – put/home/edureka/graphs/word. TXT/user/edureka

The path to copy the JAR file

Hadoop Streaming JAR – based version:

/ usr/lib/hadoop – 2.2 X/share/hadoop/tools/lib/hadoop streaming – 2.2 X.j ar

So, find the Hadoop Streaming JAR on your terminal and copy the path.

Command:

The ls/usr/lib/hadoop – 2.2.0 / share/hadoop/tools/lib/hadoop – streaming – 2.2.0. Jar

Run a MapReduce job

Command:

Hadoop jar/usr/lib/hadoop – 2.2.0 / share/hadoop/tools/lib/hadoop – streaming – 2.2.0. Jar file/home/edureka/mapper. Py – mapper mapper.py -file /home/ edureka/reducer.py-reducer reducer.py -input /user/edureka/word -output /user/edureka/Wordcount

Hadoop provides a basic Web interface for statistics and information. While the Hadoop cluster is running, open http://localhost:50070 in a browser. This is a screenshot of the Hadoop Web interface.

Now browse the file system and find the generated WordCount file to see the output. Here is a screenshot.

We can use this command to see the output on the terminal

Command: hadoopfs – cat/user/edureka Wordcount/part – 00000

You have now learned how to use Hadoop Streaming to execute MapReduce programs written in Python!

Click to follow, the first time to learn about Huawei cloud fresh technology ~

1. What is a Hadoop stream?

Related Posts

JVM learning Notes: Stack frames

Nacos configuration center source code analysis

The singleton pattern