Skip Links

Blog

Posts tagged with "mapreduce".

Managing large datasets using AWS Elastic MapReduce

Sid

Sid

20 Jul 2011 14:02

Recently we’ve seen a growing number requests from clients and potential clients to help tame their data mountain to get new insights and also report on day-to-day business activity.

We’ve used Infobright and latterly Hadoop to achieve this and have been able to serve up some good results quickly using both tools.

One slight headache has been the learning curve in setting up and configuring Hadoop. We’ve made our own Hadoop AMIs which we can spin up on Amazon Web Services but there’s still a little bit of work to do in manually setting up job configurations, spinning
up servers etc.

Amazon provide their own managed Hadoop service, Elastic MapReduce which takes away a fair bit of the pain for you although at a slightly increased cost. We got out our calculators and figured that the time saved was worth the money as it meant we would get to the business issues earlier.

This post is a brief intro to setting up an example Elastic MapReduce job based on the UFO sightings dataset distributed by Infochimps (we’re using the TSV file if you want to reproduce).

There’s a perfectly good
and well-written intro
on the AWS site so if you’re just
getting started then best read that first as this post dives in to an example.

The UFO dataset is our data “Hello World!” and we already have several Hadoop Streaming tasks that manipulate this.

In true example style, the one here is the simplest – it takes the sightings since 1900 and counts them by year.

There are four steps to running the job on AWS:

  1. Upload your data to S3: the input file, any mappers or reducers
  2. Create a new job flow (using the Elastic MapReduce CLI – see instructions in the AWS guide above)
  3. Check the job has run ok and is terminated (otherwise you’ll be charged as it runs)
  4. Download your data for further processing

Upload data

This speaks for itself. There are only a couple of things to note really. First, make sure that your output folder doesn’t already exist in S3 otherwise Hadoop will fail. Secondly, you can have different buckets in S3 for any of the items. The job flow script specifies where these all live.

Below are snippets from the map and reduce jobs.

Mapper – map_yyyy.rb


POS_SIGHTING_YEAR=0
CUTOFF_YEAR = 1900


STDIN.each_line do |line|
tokens=line.split("\t")  sighting_year=tokens[POS_SIGHTING_YEAR].slice(0,4)
puts "#{sighting_year}" if sighting_year.to_i > CUTOFF_YEAR
end

Reducer – reduce_yyyy.rb

last_year,count=nil,0
puts "year,count"
STDIN.each_line do |year|
year.chomp!
if last_year && last_year != year
  puts "#{last_year},#count}"
  count = 1			
else				
  count+=1			
end
last_year = year  
end
puts "#{last_year},#{count}"

Create a new job flow

This is done via the AWS Elastic MapReduce Command Line Interface (CLI).

Instructions on how to install and use are in the link above but the syntax is pretty self-explanatory:

ruby elastic-mapreduce --create --name "UFO sightings by year" --stream --mapper s3://<path to your bucket>/map_yyyy.rb --reducer s3://<path to your bucket>/reduce_yyyy.rb --input s3://<path to your bucket>/ufo_awesome.tsv --output s3://<path to your bucket>/output

The key points to note are the fact it’s a streaming Hadoop job, and the fact that the output location can be called whatever you want – just make sure it doesn’t already exist.

After running this, it will create a new Elastic MapReduce job on an EC2 instance (by default a small one), run the job using the mapper and reducer specified and store the output in your S3 bucket.

You can track the progress of the job (and terminate if necessary) in the AWS Management Console on the Elastic MapReduce tab.

Download the data!

All that’s left to do is download the data from the location you specified and do any additional processing on it. You’ll notice there’s also a folder in your S3 bucket with job statistics which can be useful when tuning a job.

Here’s a slightly more complex visualisation of the data where we show sightings by US state, year, and month

Happy mapping (and reducing)!

Tagged in: data visualisation, hadoop, mapreduce, aws, infobright, ufo