Of course 20 years later those numbers look hilariously small when I'm carrying around an iPhone 6 Plus with a 64bit, 1.1Ghz processor (11x faster than one SPARCStation), with 1 GB of RAM on-chip. But back in the day, the computing power Pixar was using was astounding.
Also in that time frame were many other experiments. Seti@Home, for example, installed a small client program on each participant's PC that would download data sets and do calculations when the PC was idle for a certain length of time. The idea was to leverage the fact that many people leave their computer on 24x7 but it is idle for a large percentage of that time.
Along comes a Google
In 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published their paper, "MapReduce: Simplified Data Processing on Large Cluster"that detailed how Google was able to process such massive amounts of data. They didn't invent the two algorithms, Map and Reduce, but they put them to work in a new way along with a revolutionary idea: the data is much larger than the code, so instead of storing the data on a big hard drive and then moving it to the machine that has the code, break the data into smaller chunks and store the chunks on the same machines as will do the computation.Although Google didn't release any code with the paper, they gave enough hints that some smart individuals at Yahoo could recreate their success. I'm a little hazy on the exact course of events and the timeline is only marginally relevant to the rest of the discussion. Suffice it to say that eventually the project went to Apache Foundation and it was named Hadoop after a toy stuffed elephant that belonged to the son of one of the developers. That's also why Hadoop's mascot is an elephant.
Map? Reduce? MapReduce!
The MapReduce algorithm as implemented by Hadoop leverages the Map and Reduce algorithms from Functional programming (a style of programming modeled after mathematical functions) on a cluster of computers called "compute nodes."- Map: The map algorithm "maps" some function to each element of an input group. For example, say you have a list of 100 employees and you that you need to give a 3% raise. The Map function could be used on that list to apply the raise.
- Reduce: Some type of summation of a list. Continuing with our list of 100 employees, this time the list identifies all workers who put in overtime last week. A reduction could be used to sum up all the overtime hours for an end of week report.
MapReduce, as it is implemented in Hadoop, processes data in many steps. The "hello world" example is counting word occurrences in some large body of text, such as the collected works of Shakespeare. For example:
- Data is loaded onto the Hadoop Distributed File System (HDFS) where it is segmented into blocks and the blocks stored on compute nodes (with redundancy for fault-tolerance).
- The MapReduce job is started from a master node and the code is distributed to the compute nodes.
- The mappers read their local data files, breaking sentences apart at the spaces.
- The mappers send a big list of each word it encountered along with the number 1 for each, even if there are repeats.
- An intermediate process shuffles the output from all the mappers and sorts it into lists containing multiple occurrences of a single word and the number 1 for each occurrence.
- These shuffled lists go to reducers which simply sum all the 1s in the list and outputs one word and the sum.
- All the reducer outputs are gathered and presented as the final answer.
Image courtesy of Wikimedia Commons: https://commons.wikimedia.org/wiki/File:WordCountFlow.JPG
In the next few posts we'll look at why we need tools like MapReduce and some alternatives that have been developed in the last few years.