Monday, November 4, 2013

de Bruijn Graphs or How I Learned to Stop Worrying and Love Discrete Mathematics

I wasn't exactly a math-phobe in junior high and high school, I just didn't like math and consequently wasn't very good at it.  When I got to college and took calculus and physics, that all changed.  Suddenly it all made sense, conceptually and I really liked math.  I still wasn't very good at it but I liked it.

On a related note, all Computer Science undergrads have to take a class called "Discrete Mathematics" in about their sophomore year.  For many it is a weed-out class - it almost was for me and I confess to having to take it more than once to get past it.  Now, most CS students tend to be heavily oriented to the "applied science" end of the spectrum but Discrete Math is a very abstract hodgepodge of logic, probability, functions and relations, and graph theory.  At the time I saw no reason whatsoever to be taking that stuff.  25 years later I can look back and see exactly why I was taking it.

Case in point: while studying genome assembly methods I found out that for data produced by short read sequencers, the kind that are fast and cheap to use and popping up in every lab and hospital, a building a type of graph is one of the most common assembly methods.

Not Your Father's Kind of Graph

I'm not talking about a graph that is an X, Y plot of a parabola or something.  In graph theory you have "nodes" and "edges."  Still essentially dots and lines, but it's not drawn on an XY axis and many times the graph's lines have a direction, pointing to the next dot (node).  In 1735, Leonard Euler (pronounced "Oiler" of all things), worked on this problem called "The Seven Bridges of Koenigsberg" in which the city was essentially an island surrounded by the Pregel River.  The city had seven bridges to the mainland and the problem was to start at any bridge and walk a path that went over every bridge exactly once.
Modified by Bogdan Giuşcă - for Wikipedia

Euler said that the path over land to the bridge is irrelevant.  This leaves 4 land areas and 7 lines connecting them.  That information could be redrawn like this:

Created by [http://en.wikipedia.org/wiki/User:Mark_Foskey Mark_Foskey]
That's the basis of what a graph is.  Now, the specific application in genomics is based on tracking overlapping sequences.  You may remember from my previous post that the DNA strands get chopped into small pieces and then copied many times and we have no good method of marking the divisions for reassembly.  We also can't guarantee the exact size of these fragments.

Somewhat counterintuitively, the de Bruijn method divides the short read data into a uniform size and then builds a graph out of the overlaps of some specified size.  Maybe an example will help.  

Imagine we have a sequence CTGAAGTCCAGTACA and our uniform size (k-mer) is 5 (or 5-mer) then overlaps would look something like this:

CTGAAGTCCAGTACA
CTGAA
.TGAAG
..GAAGT
...AAGTC
....AGTCC
.....GTCCA
......TCCAG
.......CCAGT
........CAGTA
.........AGTAC
..........GTACA

As you can see, each node is shifted one space to the right.  Representing it more like a graph:

CTGAA -> TGAAG -> GAAGT -> AAGTC -> AGTCC -> GTCCA -> TCCAG ------> CCAGT -> CAGTA -> AGTAC ->GTACA

Part of the key to all of this is that there are many copies of each read.  If we add in the concept of a "weight" for each line, then every time we draw an edge (line) from one node to the next we can increase the weight and when it is all over, the path with the heaviest weight is the final assembly.

In that last paragraph I skipped a couple of ideas for simplicity's sake.  One of which is that a full genome has sections that repeat, thus making it much harder to place a k-mer.  The second is that the reading process isn't perfect and errors creep in.  Thus the reason for many copies and for the weights.  

If, for example, the second sequence above, the "TGAAG" had an error on the first base and it read as "GGAAG" we can see that it wouldn't fit in this sequence.  Rather, it would make a "loop" with a low weight.  Low weight branches are pruned off in the final assembly.

Even though all of these algorithms are based on the same theory, as in so many things, the devil is in the details.  Enough so that there is an annual event where researchers pit their against each other in an "Assemblathon."  

Assemblathon 2 (2013) had 43 submissions by 21 teams assembling a fish, a bird and a snake.  They narrowed down 100 metrics into 10 categories used for evaluation.  The conclusion?  

"... the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly..."
Perhaps I'll have an entry in 2015 or 2016 :).

References:

Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, İ., ... & MacCallum, I. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406.

Homologus. (2011-13). de Bruijn Graphs for NGS Assembly. In Homolog.us. Retrieved 11/4/2013, from http://www.homolog.us/Tutorials/index.php?p=1.1&s=1.

Wikipedia. (2013). Seven Bridges of Königsberg. In Wikipedia, the free encyclopedia. Retrieved 11/4/2013, from http://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg. 

Friday, August 16, 2013

Genome Assembly For Dummies

Assembly vs. Alignment


I promised in my first blog entry to explain what I'm interested in and why.  I mean, we sequenced the human genome a decade ago (An Overview of the Human Genome Project), right?  Why are we even discussing this?  

The first relevant fact to note is that there are really two operations in which bioinformaticians are interested:  assembly and alignment, but what do these terms really mean?  In order to answer that question we need some background on the process that the actual DNA is put through and before that we have to have some info on genetics.

Genetics 101

I think most folks are at least superficially aware that DNA looks a lot like a twisted ladder - called a double helix.  The "rungs" of the ladder are where the real excitement happens because each one is composed of two chemicals, one from each side, that meet in the middle.  DNA has four such chemicals (adenine, thymine, cytosine, and guanine), called bases, and they are always paired A-T* and C-G.  Each of these "pairings" across the ladder is called a "base pair" and the (estimated) 20,500 human genes are composed of about 3.5 billion base pairs.  


Step 0


Before anything really fun happens, the DNA strand is cut up into little pieces for reasons explained below and injected into bacteria which are then allowed to reproduce, thus giving us many copies of each piece.  

Step 1


The first step to the process happens on the chemical/physical level.  The has to be run through a machine (called a sequencer) that reads each and every base pair but we do not have the technology to read all of the base pairs at once so we chop the DNA strand into smaller pieces.  How long these pieces are is dependent on the sequencing machine and technology being used, but for our purposes there are two types: long read and short read.  Long read sequencers work with strands several thousand base pairs long, whereas short read sequencers give data only a few hundred bp (base pairs) long.  Long read is more reliable and easier to assemble but it is much slower and more expensive.  Short read sequencers can now read a genome in only a few hours for under $1000.  Short read sequencing is many times called "next generation" sequencing and is the method of choice these days.

The drawback is that there is no method for ensuring the exact length of the DNA segments and there is no way to keep them "in order."  That is where I come in.

Step 2


The result from the sequencer is a data file composed of long strings of ACTG... etc. In order for the genome to be considered "sequenced" these strings have to be put in the correct order.  The IEEE Spectrum article entitled The DNA Data Deluge by Michael Schatz and Ben Langmead sums up this step very well when they compare sequencer output to running the book "A Tale of Two Cities" through a paper shredder and then trying to reassemble the shreds.  The process is made even more complicated by the fact that typically not just one strand of DNA is run through the shredder, rather for redundancy and error checking anywhere from 50 to 100 copies of the DNA strand are run at a time.  

So to sum up, you've run 50 to 100 copies of "A Tale of Two Cities" through a paper shredder and gotten a bin full of paper slips with a random number of words on each, and you need to reassemble one copy of the book from that big mess.

The output from this step is a text file with a bunch of contiguous lengths of base pairs (called contigs) which can then be manipulated and annotated further.


Back Where We Started


Assembly and alignment both work with the Step 2 output.

  • Alignment takes the resultant paper slips and compares them, word by word, to a fully assembled version of "A Tale of Two Cities," aligning each slip to its' proper place.  Aligning is by far the much easier process.
  • Assembly, in contrast, attempts to order the slips without a reference.  Many times you will also see the Latin phrase de novo, meaning "for the first time" or to "begin again," associated with assembly.  An example would be a de novo assembly of different humans than the reference.

As I mentioned in the last article, we are having some trouble with de novo assembly of short read data of more complicated organisms like mammals.  An MIT article from 2011, High-quality draft assemblies of mammalian genomes from massively parallel sequence data, states they developed software that can assemble a mouse genome in 3 weeks and a human genome in about 3 1/2 weeks.  

Always listen to experts. They'll tell you what can't be done, and why.Then do it.  
-- Lazarus Long, from Robert A. Heinlein's "Time Enough For Love"

Possibly because I'm new to this whole "bioinformatics" thing or possibly because I'm just too stupid to listen to experts who say "it can't be done," I think I've found a loophole that has been overlooked.  I'm still not ready to talk about the loophole - primarily because I don't want to listen to said "experts" until I have some data to back up my claims.  Until that time, y'all are just going to have to wonder what I'm going to pull out of my... hat.




* Actually, another genetic structure called RNA substitutes uracil for thymine, but that's a slightly different story and not relevant to what I'm showing :).


References:

Schatz, M. C., & Langmead, B. (2013). The DNA data deluge. Spectrum, IEEE50(7), 28-33.

Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., ... & Jaffe, D. B. (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences108(4), 1513-1518.

Friday, August 9, 2013

"For Everything There is a First Time..."

A Star Trek quote? Really? That's how I'm going to start my stuffy, stodgy, scholarly blog? Yep! Sure is!

So what is this "Biozenformatics," anyway? Well, it's like this: I've used the word/name "Zen" in email addresses, gaming character names, nicknames and online forum names for almost two decades now, so it seemed fairly natural that when I went back to school for my doctorate and decided to make my "computer science" dissertation project focus on genome sequencing that the related blog should have the word "zen" in it somewhere.

Wait, what? 


Now that I'm almost three-quarters done with my first year of doctoral studies I thought it might be a good idea to have a place to discuss my research interests and other topics that have relevance to my studies. And the occasional book/TV/movie quote.

Wow, that sounds really pretentious.  OK, the truth is I'm really just starting out in the field and I'm having to teach myself a lot of biology, genetics and some advanced topics in computer science because unlike M.I.T., or Berkeley, etc., Colorado Technical University doesn't have any advanced biological studies.  In those other schools there are whole research teams with their own labs to work on these topics together.  I'm doing this all on my own.

The Story So Far


My master's project was about grid computing.  You remember grid computing, right?  That was "The Next Big Thing" before Cloud Computing.  I worked with grid computing because before that I was intrigued by cluster computing.  You see, I grew up on a farm, and on the farm you frequently have to make do with what is at hand.  Sometimes that means using a shovel when a hoe would work better.  Very often it meant "repurposing" a piece of equipment - usually with a cutting torch, grinder and arc welder.

How does cluster computing and growing up on a farm relate?  Ever since those days I've been a bit obsessed with doing more with less.  More computing power with less money.  When I first read about Pixar's "render farm" - a cluster of 2000 Sun workstations used to render the graphics of their early movies I was hooked.  I love the idea of taking old computers and getting more "horsepower" out of them.

Of course the other big buzzword these days is "Big Data."  Well, for all you non-computer geeks out there, let me let you in on a secret:  computer scientists have been doing "big data" for at least a decade under the name "High Performance Computing."  As soon as I figured this part out, I knew the general direction of my dissertation.

But why genome sequencing, especially when my school doesn't even have any advanced biological studies?  Well, something I know about myself: if a project is going to go to completion, it has to be interesting to me and I spent most of the first 18 years of my life wanting to be a veterinarian.  It seemed like a very natural step that when I started looking for Big Data topics to study, genome sequencing would be at the top of the list.

Where Do We Go From Here?


In a few days I'll post a summary of why genome sequencing is a "big data" topic.  For now, suffice to say that the age of tabletop gene sequencers is beginning, and any one sequencer is capable of putting out terabytes of data a week on dozens of samples, be it animal, vegetable, fungus, or human.  Even with all of our amazing "i7 quad-core 64 bit" desktop computers, our ability to generate the data is growing faster than our ability to process it.

Case in point, I was reading a scholarly paper from late 2012 by a research group out of M.I.T. that very proudly proclaimed their software could sequence a human genome from scratch in "only" three and a half weeks.  One sample.  Three and a half WEEKS!

I think I've found a research avenue that could make a difference.  Even 10% would save days per sample.  I want to do a little more "proof of concept" work before I publicly reveal that avenue.  Hopefully soon I'll have some hard data to post which may prove surprising to a few people.