The overall process of developing a Hadoop job is as follows:
- Install Hadoop on your development machine (personal or lab computer)
- Compile the Hadoop job, create a JAR file
- Run the Hadoop job JAR file on your development machine, for testing and debugging
1. Installing Hadoop
This section shows you how to download Hadoop and prepare it for use on a Mac machine. Note: Hadoop versions after 0.19.2 require Java version 1.6. The following instructions take this into account.
- Obtain the latest stable Hadoop release. The file is named hadoop-version.tar.gz and can be obtained here. Unzip the downloaded file and place the resulting folder on your Desktop (or other location).
- To make hadoop run on a Mac, you will need to edit two files. Open the file conf/hadoop-env.sh within the hadoop folder you just unzipped in your favorite text editor. Find the following line in the file:
# export JAVA_HOME=/usr/lib/j2sdk1.6-sun
and change it to:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/
Save the file. Second, open the file bin/hadoop within the hadoop folder in your favorite text editor. Search the file for the following line:
JAVA=$JAVA_HOME/bin/java
and change it to:
JAVA=$JAVA_HOME/Commands/java
Save the file and exit the editor. You have now set up Hadoop for development purposes on your computer.
2. Compiling a Hadoop job into a JAR file
This section guides you through compiling the WordCount example available in the Hadoop Map-Reduce Tutorial. This section assumes you are using the Eclipse IDE. If this is not the case, you should be able to adapt these instructions for your IDE.
- Create a new Java Project.
Launch Eclipse, and from the File Menu select New, then use the Wizard to create a new Java Project. Enter a project name, in this example WordCount. Make sure you that the selected JRE is of version 1.6.0. Click Finish. - Add hadoop library to project
In Eclipse, right-click (control-click), on your project, go to Build Paths then Add External Archives. Browse to the hadoop folder on your desktop and select the file hadoop-version-core.jar, click Open. - Add source code file
From the File Menu, select New, then File. Select the parent folder WordCount/src (make sure this is right or you will encounter trouble when exporting the JAR file below.) and name the new file WordCount.java click Finish. Copy this code and paste it into the new file and save it. Eclipse will compile the file as soon as you save it. - Export JAR file
From the File Menu, select Export. From under Java select JAR file, click Next. Select all resources to be exported. In this case, select the entire WordCount project. Make sure the export classes checkbox is checked. Select an export destination for your JAR file - you can use your Desktop, or some other directory. For simplicity, name the file WordCount.jar and export it to your Desktop.
3. Running a Hadoop job on your development machine
This section shows you how to run your job on your own machine, for testing purposes. Hadoop will run in "standalone mode", which means that it will run within a single process, not taking advantage of any parallel processing. This will be much slower than running on the cluster, so you may want to reduce the data size set for testing.
- Create or obtain test data
For this example, the input data will be this web page. Copy this entire web page, and using your favorite text editor save it as a plain text file named testing.txt. Place this file within a folder called input on your Desktop. - Run the job
First, go to the command line. (To access the command line, go to the finder, then to "Applications", then "Utilities" and finally launch "Terminal"). If you are not familiar with the UNIX command line, here is a basic guide. Change into your hadoop directory ~/Desktop/hadoop-0.19.2 or similar. Execute the following command
./bin/hadoop jar ~/Desktop/WordCount.jar WordCount ~/Desktop/input ~/Desktop/output
You may need to alter the paths if any of the files were saved to different places. - Retrieve the results
The results have been written to a new folder called output on your Desktop. There should be one file, named part-00000 which lists all the words on this web page, along with their occurrence count. Note, that before running hadoop again you will need to delete the entire output folder, since hadoop will not do this for you.
No comments:
Post a Comment