Friday, August 16, 2013

Genome Assembly For Dummies

Assembly vs. Alignment

I promised in my first blog entry to explain what I'm interested in and why.  I mean, we sequenced the human genome a decade ago (An Overview of the Human Genome Project), right?  Why are we even discussing this?  

The first relevant fact to note is that there are really two operations in which bioinformaticians are interested:  assembly and alignment, but what do these terms really mean?  In order to answer that question we need some background on the process that the actual DNA is put through and before that we have to have some info on genetics.

Genetics 101

I think most folks are at least superficially aware that DNA looks a lot like a twisted ladder - called a double helix.  The "rungs" of the ladder are where the real excitement happens because each one is composed of two chemicals, one from each side, that meet in the middle.  DNA has four such chemicals (adenine, thymine, cytosine, and guanine), called bases, and they are always paired A-T* and C-G.  Each of these "pairings" across the ladder is called a "base pair" and the (estimated) 20,500 human genes are composed of about 3.5 billion base pairs.  

Step 0

Before anything really fun happens, the DNA strand is cut up into little pieces for reasons explained below and injected into bacteria which are then allowed to reproduce, thus giving us many copies of each piece.  

Step 1

The first step to the process happens on the chemical/physical level.  The has to be run through a machine (called a sequencer) that reads each and every base pair but we do not have the technology to read all of the base pairs at once so we chop the DNA strand into smaller pieces.  How long these pieces are is dependent on the sequencing machine and technology being used, but for our purposes there are two types: long read and short read.  Long read sequencers work with strands several thousand base pairs long, whereas short read sequencers give data only a few hundred bp (base pairs) long.  Long read is more reliable and easier to assemble but it is much slower and more expensive.  Short read sequencers can now read a genome in only a few hours for under $1000.  Short read sequencing is many times called "next generation" sequencing and is the method of choice these days.

The drawback is that there is no method for ensuring the exact length of the DNA segments and there is no way to keep them "in order."  That is where I come in.

Step 2

The result from the sequencer is a data file composed of long strings of ACTG... etc. In order for the genome to be considered "sequenced" these strings have to be put in the correct order.  The IEEE Spectrum article entitled The DNA Data Deluge by Michael Schatz and Ben Langmead sums up this step very well when they compare sequencer output to running the book "A Tale of Two Cities" through a paper shredder and then trying to reassemble the shreds.  The process is made even more complicated by the fact that typically not just one strand of DNA is run through the shredder, rather for redundancy and error checking anywhere from 50 to 100 copies of the DNA strand are run at a time.  

So to sum up, you've run 50 to 100 copies of "A Tale of Two Cities" through a paper shredder and gotten a bin full of paper slips with a random number of words on each, and you need to reassemble one copy of the book from that big mess.

The output from this step is a text file with a bunch of contiguous lengths of base pairs (called contigs) which can then be manipulated and annotated further.

Back Where We Started

Assembly and alignment both work with the Step 2 output.

  • Alignment takes the resultant paper slips and compares them, word by word, to a fully assembled version of "A Tale of Two Cities," aligning each slip to its' proper place.  Aligning is by far the much easier process.
  • Assembly, in contrast, attempts to order the slips without a reference.  Many times you will also see the Latin phrase de novo, meaning "for the first time" or to "begin again," associated with assembly.  An example would be a de novo assembly of different humans than the reference.

As I mentioned in the last article, we are having some trouble with de novo assembly of short read data of more complicated organisms like mammals.  An MIT article from 2011, High-quality draft assemblies of mammalian genomes from massively parallel sequence data, states they developed software that can assemble a mouse genome in 3 weeks and a human genome in about 3 1/2 weeks.  

Always listen to experts. They'll tell you what can't be done, and why.Then do it.  
-- Lazarus Long, from Robert A. Heinlein's "Time Enough For Love"

Possibly because I'm new to this whole "bioinformatics" thing or possibly because I'm just too stupid to listen to experts who say "it can't be done," I think I've found a loophole that has been overlooked.  I'm still not ready to talk about the loophole - primarily because I don't want to listen to said "experts" until I have some data to back up my claims.  Until that time, y'all are just going to have to wonder what I'm going to pull out of my... hat.

* Actually, another genetic structure called RNA substitutes uracil for thymine, but that's a slightly different story and not relevant to what I'm showing :).


Schatz, M. C., & Langmead, B. (2013). The DNA data deluge. Spectrum, IEEE50(7), 28-33.

Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., ... & Jaffe, D. B. (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences108(4), 1513-1518.

Friday, August 9, 2013

"For Everything There is a First Time..."

A Star Trek quote? Really? That's how I'm going to start my stuffy, stodgy, scholarly blog? Yep! Sure is!

So what is this "Biozenformatics," anyway? Well, it's like this: I've used the word/name "Zen" in email addresses, gaming character names, nicknames and online forum names for almost two decades now, so it seemed fairly natural that when I went back to school for my doctorate and decided to make my "computer science" dissertation project focus on genome sequencing that the related blog should have the word "zen" in it somewhere.

Wait, what? 

Now that I'm almost three-quarters done with my first year of doctoral studies I thought it might be a good idea to have a place to discuss my research interests and other topics that have relevance to my studies. And the occasional book/TV/movie quote.

Wow, that sounds really pretentious.  OK, the truth is I'm really just starting out in the field and I'm having to teach myself a lot of biology, genetics and some advanced topics in computer science because unlike M.I.T., or Berkeley, etc., Colorado Technical University doesn't have any advanced biological studies.  In those other schools there are whole research teams with their own labs to work on these topics together.  I'm doing this all on my own.

The Story So Far

My master's project was about grid computing.  You remember grid computing, right?  That was "The Next Big Thing" before Cloud Computing.  I worked with grid computing because before that I was intrigued by cluster computing.  You see, I grew up on a farm, and on the farm you frequently have to make do with what is at hand.  Sometimes that means using a shovel when a hoe would work better.  Very often it meant "repurposing" a piece of equipment - usually with a cutting torch, grinder and arc welder.

How does cluster computing and growing up on a farm relate?  Ever since those days I've been a bit obsessed with doing more with less.  More computing power with less money.  When I first read about Pixar's "render farm" - a cluster of 2000 Sun workstations used to render the graphics of their early movies I was hooked.  I love the idea of taking old computers and getting more "horsepower" out of them.

Of course the other big buzzword these days is "Big Data."  Well, for all you non-computer geeks out there, let me let you in on a secret:  computer scientists have been doing "big data" for at least a decade under the name "High Performance Computing."  As soon as I figured this part out, I knew the general direction of my dissertation.

But why genome sequencing, especially when my school doesn't even have any advanced biological studies?  Well, something I know about myself: if a project is going to go to completion, it has to be interesting to me and I spent most of the first 18 years of my life wanting to be a veterinarian.  It seemed like a very natural step that when I started looking for Big Data topics to study, genome sequencing would be at the top of the list.

Where Do We Go From Here?

In a few days I'll post a summary of why genome sequencing is a "big data" topic.  For now, suffice to say that the age of tabletop gene sequencers is beginning, and any one sequencer is capable of putting out terabytes of data a week on dozens of samples, be it animal, vegetable, fungus, or human.  Even with all of our amazing "i7 quad-core 64 bit" desktop computers, our ability to generate the data is growing faster than our ability to process it.

Case in point, I was reading a scholarly paper from late 2012 by a research group out of M.I.T. that very proudly proclaimed their software could sequence a human genome from scratch in "only" three and a half weeks.  One sample.  Three and a half WEEKS!

I think I've found a research avenue that could make a difference.  Even 10% would save days per sample.  I want to do a little more "proof of concept" work before I publicly reveal that avenue.  Hopefully soon I'll have some hard data to post which may prove surprising to a few people.