Thursday, May 5, 2016

Hadoop - Not a Sha Na Na Song

We can trace modern distributed computing back decades to the first "clusters" of computers that were networked directly together and set working on parts of problems. A good example of which can be seen in the early days of Pixar 3D rendered movies. According to a SunWorld Magazine article from 1995, Toy Story was rendered on 87 dual processor and 30 quad processor 100MHz SPARCStation 20 computers, each having between 192 to 384MB of RAM. The 110,000 frame movie took the equivalent of 46 days of continuous rendering (http://sunsite.uakom.sk/sunworldonline/swol-11-1995/swol-11-pixar.html).

Of course 20 years later those numbers look hilariously small when I'm carrying around an iPhone 6 Plus with a 64bit, 1.1Ghz processor (11x faster than one SPARCStation), with 1 GB of RAM on-chip. But back in the day, the computing power Pixar was using was astounding.

Also in that time frame were many other experiments. Seti@Home, for example, installed a small client program on each participant's PC that would download data sets and do calculations when the PC was idle for a certain length of time. The idea was to leverage the fact that many people leave their computer on 24x7 but it is idle for a large percentage of that time.

Along comes a Google

In 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published their paper, "MapReduce: Simplified Data Processing on Large Cluster"that detailed how Google was able to process such massive amounts of data. They didn't invent the two algorithms, Map and Reduce, but they put them to work in a new way along with a revolutionary idea: the data is much larger than the code, so instead of storing the data on a big hard drive and then moving it to the machine that has the code, break the data into smaller chunks and store the chunks on the same machines as will do the computation. 

Although Google didn't release any code with the paper, they gave enough hints that some smart individuals at Yahoo could recreate their success. I'm a little hazy on the exact course of events and the timeline is only marginally relevant to the rest of the discussion. Suffice it to say that eventually the project went to Apache Foundation and it was named Hadoop after a toy stuffed elephant that belonged to the son of one of the developers. That's also why Hadoop's mascot is an elephant.

Map? Reduce? MapReduce!

The MapReduce algorithm as implemented by Hadoop leverages the Map and Reduce algorithms from Functional programming (a style of programming modeled after mathematical functions) on a cluster of computers called "compute nodes."
  • Map: The map algorithm "maps" some function to each element of an input group. For example, say you have a list of 100 employees and you that you need to give a 3% raise. The Map function could be used on that list to apply the raise.
  • Reduce: Some type of summation of a list. Continuing with our list of 100 employees, this time the list identifies all workers who put in overtime last week. A reduction could be used to sum up all the overtime hours for an end of week report.
Map and Reduce have been used like this for a long time. Dean and Ghemawat's major contribution was the idea of breaking up large data files into smaller chunks and storing those chunks on the compute nodes -- the workers. All of the overhead for this operation is incurred when the data is stored, not as part of the processing. Later, at the time of processing, a scheduler program coordinates which nodes are doing the Map process and which are doing the Reduce process and moves the Map or Reduce to the node containing the data, if needed.

MapReduce, as it is implemented in Hadoop, processes data in many steps. The "hello world" example is counting word occurrences in some large body of text, such as the collected works of Shakespeare. For example:
  • Data is loaded onto the Hadoop Distributed File System (HDFS) where it is segmented into blocks and the blocks stored on compute nodes (with redundancy for fault-tolerance).
  • The MapReduce job is started from a master node and the code is distributed to the compute nodes.
  • The mappers read their local data files, breaking sentences apart at the spaces.
  • The mappers send a big list of each word it encountered along with the number 1 for each, even if there are repeats.
  • An intermediate process shuffles the output from all the mappers and sorts it into lists containing multiple occurrences of a single word and the number 1 for each occurrence.
  • These shuffled lists go to reducers which simply sum all the 1s in the list and outputs one word and the sum.
  • All the reducer outputs are gathered and presented as the final answer.
    Image courtesy of Wikimedia Commons: https://commons.wikimedia.org/wiki/File:WordCountFlow.JPG
The image above showing the workflow a a MapReduce job could be considered a type of "Directed Acyclic Graph." Wait a minute! Any idiot can see there's no X-Y axis - how can that be called a graph? Well, it's a different kind of graph that comes from "Discrete Math" - a branch of mathematics that Computer Scientists study and is kind of a mish-mash of many disciplines. The point is, each box in the image is called either a "node" or a "vertex" and each line is called an "edge" and with the arrows you can see that processing starts and follows a well-defined path.

In the next few posts we'll look at why we need tools like MapReduce and some alternatives that have been developed in the last few years.