Programming Assignment - Distributed Systems
CSCI/ECEN 5673, University of Colorado at Boulder, Spring 2011
Prof. Rick Han

Due Friday March 4 by 11:55 pm

The students in this class created the following code for their Hadoop programming assignments on Amazon's Elastic Compute Cloud.  We have placed all of the class's code at http://code.google.com/p/univ-colorado-distributed-systems-spring2011-amazon-cloud/.  Thanks to Amazon for funding these efforts.

--

The goal of this programming assignment is to enable you to gain experience programming with:

* Amazon Web Services, specifically the EC2 cloud, and S3

* the Hadoop open source framework

* and breaking down a task into a parallel distributed mapreduce model

You may form groups of two to complete this assignment.

Sign up for an account on Amazon Web Services.  Use your Amazon key code for $100 credit.  Redeem it at http://aws.amazon.com/awscredits.  Learn about EC2 at http://aws.amazon.com/ec2/.

Find a dataset.  There are some publicly available datasets on Amazon at http://aws.amazon.com/datasets.  For example, there's a global weather data set at http://aws.amazon.com/datasets/2759 (20 GB), the 2000 Census, transportation, economics, and Wikipedia page statistics (320 GB).  Wikipedia itself can be obtained at http://dumps.wikimedia.org/.  You don't have to use a complete data set.  Let's say 10-100 MB should be sufficient, though you're welcome to try GB of data.

Install Hadoop on your Amazon instance AMI.  Process this data set using the Hadoop framework on EC2.  See http://hadoop.apache.org/ for the source code.  You are not allowed to use Amazon's Elastic MapReduce service.  You will have to install hadoop on your own VM instances.  See an example of how to install Hadoop on AWS at http://www.cloudera.com/blog/2009/05/using-clouderas-hadoop-amis-to-process-ebs-datasets-on-ec2/.

Calculate the most frequent data item in this data set using Hadoop for a given category, say temperature for the weather data set, or the most popular Wikipedia page.  Sort the data set from most frequent item to least frequent item.  You should use at least 10 instances to perform the mapreduce calculations.  Feel free to vary the # of instances and type of instance to see how the performance varies.  Is it linear across number of instances for a given type?

Here's a mapreduce example http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v2.0.   Also see http://code.google.com/edu/parallel/mapreduce-tutorial.html.

Store the results persistently in Amazon S3 or another persistent data store.  Visualize the sorted results in a suitable way, so it can be observed that the data items were properly sorted.

Grading: upload a zip file of your source code.  We'll arrange grading meetings in early to mid March to go over the results of your Hadoop implementation on EC2.

Extra Credit: use mapreduce to compute a histogram of your data frequencies.  I think binning would be helpful here.  You may need to have multiple stages of mapreduce, feeding the output of one stage into the input of the next, in order to sort/order the histogram.