Amazon Elastic MapReduce
- 11 May, 2009 09:00
- Comments
Have you got a few hundred gigabytes of data that need processing? Perhaps a dump of radio telescope data that could use some combing through by a squad of processors running Fourier transforms? Or maybe you're convinced some statistical analysis will reveal a pattern hidden in several years of stock market information? Unfortunately, you don't happen to have a grid of distributed processors to run your application, much less the time to construct a parallel processing infrastructure.
Well, cheer up: Amazon has added Elastic MapReduce to its growing list of cloud-based Web services. Currently in beta, Elastic MapReduce uses Amazon's Elastic Compute Cloud (EC2) and Simple Storage Service (S3) to implement a virtualized distributed processing system based on Apache Hadoop.
Hadoop's internal architecture is the MapReduce framework. The mechanics of MapReduce are well documented in a paper by J. Dean and S. Ghemawat [PDF], and a full treatment is beyond the scope of this article. Instead, I'll illustrate by example.
Suppose you have a set of 10 words and you want to count the number of times those words appear in a collection of e-books. Your input data is a set of key/value pairs, the value being a line of text from one of the books and the key being the concatenation of the book's name and the line's number. This set might comprise a few megabytes big -- or gigabytes. MapReduce doesn't much care about size.
You write a routine that reads this input, a pair at a time, and produces another key/value pair as output. The output key is a word (from the original set of 10) and the associated value is the number of times that word appears in the line. (Zero values are not emitted.) This routine is the map part of map/reduce. Its output is referred to as the intermediate key/value pairs.
The intermediate key/value pairs are fed to another function (another "step" in the parlance of MapReduce). For this step, you write a routine that iterates through the intermediate data, sums up the values, and returns a single pair whose key is the word and whose value is the grand total. You don't have to worry about grouping the results of like keys (i.e., gathering all the intermediate key/values for a given word), because Hadoop does that grouping for you in the background.
- Bookmark this page
- Share this article
- Got more on this story? Email Computerworld
- Follow Computerworld on twitter
- IBM agility@scale™: Become as Agile as You Can Be
- 10 Ways to Stretch your storage budgets in virtualised, consolidated environments
- Workshifting: How IT is Changing the Way Business is Done
- Synergy gains sustainable competitive edge with HP printers, services and solutions
- Control your Print Environment
-
CSIRO claims world's fastest wireless link
-
CeBit 2012: Social media a legal minefield
-
VOIP a wake-up call for global phone competition
-
CeBIT 2012: Will NBN speed up freight delivery times?
-
HTC announces Titan 4G
-
Teach Yourself Visually Windows 7
-
Windows 7 for Seniors for Dummies®
-
Computers for Seniors for Dummies, 2nd Edition
-
Microsoft Office
-
Office 2007 All-In-One Desk Reference for Dummies
-
MYOB Software for Dummies 6E Australian Edition
-
Excel 2007 All-In-One Desk Reference for Dummies
-
Windows 7 for Dummies®
-
Office 2007 for Dummies









Comments
Post new comment