Biozenformatics: 2016

Thursday, May 5, 2016

Hadoop - Not a Sha Na Na Song

We can trace modern distributed computing back decades to the first "clusters" of computers that were networked directly together and set working on parts of problems. A good example of which can be seen in the early days of Pixar 3D rendered movies. According to a SunWorld Magazine article from 1995, Toy Story was rendered on 87 dual processor and 30 quad processor 100MHz SPARCStation 20 computers, each having between 192 to 384MB of RAM. The 110,000 frame movie took the equivalent of 46 days of continuous rendering (http://sunsite.uakom.sk/sunworldonline/swol-11-1995/swol-11-pixar.html).

Of course 20 years later those numbers look hilariously small when I'm carrying around an iPhone 6 Plus with a 64bit, 1.1Ghz processor (11x faster than one SPARCStation), with 1 GB of RAM on-chip. But back in the day, the computing power Pixar was using was astounding.

Also in that time frame were many other experiments. Seti@Home, for example, installed a small client program on each participant's PC that would download data sets and do calculations when the PC was idle for a certain length of time. The idea was to leverage the fact that many people leave their computer on 24x7 but it is idle for a large percentage of that time.

Along comes a Google

In 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published their paper, "MapReduce: Simplified Data Processing on Large Cluster"that detailed how Google was able to process such massive amounts of data. They didn't invent the two algorithms, Map and Reduce, but they put them to work in a new way along with a revolutionary idea: the data is much larger than the code, so instead of storing the data on a big hard drive and then moving it to the machine that has the code, break the data into smaller chunks and store the chunks on the same machines as will do the computation.

Although Google didn't release any code with the paper, they gave enough hints that some smart individuals at Yahoo could recreate their success. I'm a little hazy on the exact course of events and the timeline is only marginally relevant to the rest of the discussion. Suffice it to say that eventually the project went to Apache Foundation and it was named Hadoop after a toy stuffed elephant that belonged to the son of one of the developers. That's also why Hadoop's mascot is an elephant.

Map? Reduce? MapReduce!

The MapReduce algorithm as implemented by Hadoop leverages the Map and Reduce algorithms from Functional programming (a style of programming modeled after mathematical functions) on a cluster of computers called "compute nodes."

Map: The map algorithm "maps" some function to each element of an input group. For example, say you have a list of 100 employees and you that you need to give a 3% raise. The Map function could be used on that list to apply the raise.
Reduce: Some type of summation of a list. Continuing with our list of 100 employees, this time the list identifies all workers who put in overtime last week. A reduction could be used to sum up all the overtime hours for an end of week report.

Map and Reduce have been used like this for a long time. Dean and Ghemawat's major contribution was the idea of breaking up large data files into smaller chunks and storing those chunks on the compute nodes -- the workers. All of the overhead for this operation is incurred when the data is stored, not as part of the processing. Later, at the time of processing, a scheduler program coordinates which nodes are doing the Map process and which are doing the Reduce process and moves the Map or Reduce to the node containing the data, if needed.

MapReduce, as it is implemented in Hadoop, processes data in many steps. The "hello world" example is counting word occurrences in some large body of text, such as the collected works of Shakespeare. For example:

Data is loaded onto the Hadoop Distributed File System (HDFS) where it is segmented into blocks and the blocks stored on compute nodes (with redundancy for fault-tolerance).
The MapReduce job is started from a master node and the code is distributed to the compute nodes.
The mappers read their local data files, breaking sentences apart at the spaces.
The mappers send a big list of each word it encountered along with the number 1 for each, even if there are repeats.
An intermediate process shuffles the output from all the mappers and sorts it into lists containing multiple occurrences of a single word and the number 1 for each occurrence.
These shuffled lists go to reducers which simply sum all the 1s in the list and outputs one word and the sum.
All the reducer outputs are gathered and presented as the final answer.

Image courtesy of Wikimedia Commons: https://commons.wikimedia.org/wiki/File:WordCountFlow.JPG

The image above showing the workflow a a MapReduce job could be considered a type of "Directed Acyclic Graph." Wait a minute! Any idiot can see there's no X-Y axis - how can that be called a graph? Well, it's a different kind of graph that comes from "Discrete Math" - a branch of mathematics that Computer Scientists study and is kind of a mish-mash of many disciplines. The point is, each box in the image is called either a "node" or a "vertex" and each line is called an "edge" and with the arrows you can see that processing starts and follows a well-defined path.

In the next few posts we'll look at why we need tools like MapReduce and some alternatives that have been developed in the last few years.

Monday, April 25, 2016

Spreading the... love?

I originally started this blog as a place to explore and discuss topics that interest me and projects I'm working on but aside from the scarce nature of posts, I haven't upheld that mission very well. In the beginning I was reticent to discuss my dissertation project for fear of someone else beating me to the punch but at this point I think I've narrowed it enough that I'll be unique even if someone were to find the general outlines.

Speaking of spreading something

So what am I really going to waste your time with? While at least half a dozen smart-ass remarks instantly spring to mind, I'll leave those remarks as an exercise for the readers and mention that much of my research time the past three years has been spent teaching myself the Python language and distributed computing, with particular emphasis on Big Data analytics.

But why distributed?

A few months ago on a rare occasion when I happened to come up for air I noticed that processor speed (X.XX GHz) wasn't really mentioned much in advertising any more but every ad talked about how many cores the computer has. This seemed curious to me so armed with a two year degree in electronics, I set about researching this shift.

In reasonably short order I found a research paper that explained the shift. As with so many things in tech, it all boils down to physics - specifically, the physics of transistors.

Review of computers and electronics

First, recall that three of the most important computer subsystems -- CPU, RAM, and ALU (Arithmetic Logic Unit) -- are composed in large part of transistors and transistors have been getting smaller on a fairly regular basis (Moore's Law).

Now if we look at a transistor (right) we see four areas of interest:

G - Gate
S - Source
D - Drain
B - Body

Transistors are made through a process similar to photography: areas of the silicon are masked and unmasked areas are treated in some way, then the mask is washed off and a different layer is masked.

On our transistor to the right, imagine that the S and D areas have been bombarded with negative ions making a negatively charged region. Then the area under the gate is bombarded with positive ions and an insulator is placed on top. This creates an NPN field effect transistor (FET). Metal wires are connected to the S, D, and G areas. When a positive voltage is applied to the gate, positive charge in the area under the gate is repelled creating a negative channel and current flows from S to D.

The problem: as transistors get smaller all of those regions get smaller. Remember that the gate is the actual "switch" part of the switch and it has an insulator under it to help control the voltage applied to it. But if that insulator gets too thin then some voltage will leak through even when it isn't wanted. Thus the transistor can not be definitively "on" or "off." That's where we are right now. We do not have the ability to make transistors any smaller and still control the gate leakage.

For this reason, we've stopped trying to stuff more transistors into a smaller space and started to focus on ways to do more things at the same time. Thus multi-core CPUs.

Divide and Conquer

Computer science has acknowledged the power of dividing a problem into smaller pieces and then solving the small pieces simultaneously. This is the driving force behind two bedrock algorithms: Binary Search and Quicksort.

Now, let's imagine you have a task going to the CPU that can be broken up into pieces - and a surprisingly large number of tasks can be - it might look something like below:

Perhaps each grey rectangle is a part of a list of numbers and the task is to see if a specific number exists in the whole list. It is reasonably easy to see that searching 1/4 of the list takes less time than the whole list, and doing all four of those quarters simultaneously is again an optimization. Thus the idea of multiprocessing with multi-core CPUs.

Next time we'll talk about Hadoop and MapReduce.

Tuesday, January 26, 2016

Thoughts on Teaching Computer Programming

First, let me say that I firmly believe that anyone can learn to write code. If a person is alive, upright, and able to do things like get dressed in the morning or feed themselves they have the skills - namely problem solving and the ability to formulate and then follow steps in an orderly fashion. That being said, I often struggle with helping newbies connect the "real life" skills to the world of programming. I started learning programming more than 30 years ago - I have trouble understanding the problems that new programmers have.

I often think on this problem and I am constantly trying new approaches to help students learn how to approach programming tasks. One such effort is to point out to students that the computer must go through exactly the same steps that they do, just that the steps are smaller.

One example I use is to sum a column of numbers. Usually when I ask students how to perform the task I get answers like, "Just add up the numbers!" Well, yes, but is that the way you really do it? You can look at a column of numbers and instantly know the sum? Or do you perhaps do some other steps that you're taking for granted?

This morning I had a bit of an epiphany on the topic while taking a shower (where many of my best epiphanies happen). Earlier in the quarter we had an exercise where I asked students to write instruction steps for the new "Lunch-A-Matic 2000 Peanut Butter and Jelly Sandwich Making Robot" keeping in mind that a robot ONLY knows how to do what you tell it. When all the students had handed in their instructions I pulled out bread, a knife, peanut butter and jelly and we attempted to make a sandwich based on student instructions. The key word is "attempted." Only 1 student remembered to take the twist-tie off the bread container :).

The epiphany this morning was in the realization that humans think in terms of functions. To put it another way, if you were to try to teach a 6 year old child how to make a PB & J sandwich when they had never been allowed in the kitchen before, you would have to say every small step:

Walk to the cupboard
Open the cupboard
Look inside the cupboard
Choose bread type
Reach dominant hand into the cupboard
Grasp the bread
Remove bread from the cupboard
Close the cupboard
Walk to the counter

etc.

But after making a few PB & J sandwiches, it is possible to tell that same child "Go make a PB & J sandwich" without guiding them through all of the steps. In other words, the set of steps have become a function call!

The same type of thing can be seen with driving. When we first learn how to drive, it looks and feels like an insurmountable task with all the details one must keep in mind. But as we get experience driving becomes automatic and we don't really think about all the steps and details that go into the activity. "Driving" to work or to the store becomes a parameterized function call.

Armed with this analogy, it looks like my task as a professor of Computer Science is to help students look inside those functions and remember all the little steps.