Document 17864859

advertisement
>> Bill Bolosky: I am very pleased to introduce to you Kristal and Matei who are both fifth-year
grad students at Berkeley, although, this does not mean that they are equidistant from
graduating because Matei is finishing next year and Kristal just told me two, but they have
spent quite a bit of time working with a bunch of people including me on gene sequencing. I
can't speak for them, but for me this is one of the most fun things I have ever done. It's just
amazingly cool to learn that knowledge about hash tables translates into biology. They are
going to get to tell you about the project that we worked on and they are going to team speak
and I don't know which order they are in.
>> Matei Zaharia: I am Matei and I will start off. This is a project that a bunch of people who
were interested, a lot of the people you see up here are from the amplab at UC Berkeley which
is a new lab that started about a year ago which is focused on big data and large-scale machine
learning and data processing and we are also working with Bill and Harvey from Microsoft and
with some folks from UCSF including Taylor Sittler who is here today. He is a medical
researcher there. Let me just jump into this, why we're doing this. I think if you have seen
some biology in the past you will know that DNA is the molecule and the code that orchestrates
all of the activity of a living cell, and because of this, DNA is a central element in a lot of
diseases. It either has differences in the DNA that causes the disease or it is actually involved in
propagating it somehow. In particular, what DNA does is it encodes how to build these worker
molecules called proteins in the cell and also signaling information about essentially when to
build them, so everything the cell does is guided by that. Two ways in which it ties to diseases
is first of all in cancer, cancer is essentially caused by the DNA in some cells accumulating
enough mutations that the regulatory mechanisms break down and those cells start multiplying
without the right regulation and start consuming all of the resources of the body. On top of
that there are hereditary diseases which are suggested that there is some mutation that is
passed down that causes a problem. DNA also affects susceptibility to various drugs and it's an
important thing to understand what's going on in the cell. One of the things that's happened in
the past decade is that the cost of actually reading the DNA from a cell has fallen dramatically
and it's fallen enough that it's starting to actually be used in clinical medicine. Here is a picture
of that and this compares against Moore's law. This is the cost of sequencing one human
genome and if you go back to 1999 and 2000, the human genome project cost several billion
dollars to sequence the first human genome and that was a great accomplishment. If you fast
forward a bit, this year you can actually get your genome sequence for about $3000 and next
year a couple of companies have announced that they will do it for $1000, so this sort of
magical thousand dollar genome is really on the horizon and I think if you work out the rate, the
cost has been falling by something like a factor of three or four per year and it's actually quite a
bit faster than Moore's law, so we are able to actually read this stuff now. Being able to read
this stuff can impact in a lot of areas of medicine and of course of biological research as well.
Here are just some examples. In cancer the most exciting thing about this is that cancer is
typically caused by a bunch of mutations that affect many different gene pathways, like these
are groups of genes that work together to perform a function in the cell and often cancers that
look very different like a skin cancer and a lung cancer might involve the same pathway, and by
actually sequencing it you might be able to figure out which pathway is involved and also which
drugs you can use to target that particular pathway, so there are many targeted cancer drugs
that will work really well if one of the possible things that is going on then will have no effect if
something else is happening. For infectious disease like just common viruses and bacteria one
of the nice things you can do with sequencing is you can really quickly find out which pathogen
a patient is infected with. You don't have to do these sort of crazy diagnostic tests where you
look at the symptoms. You can actually get some DNA and see what's going on, and another
thing this has been used for is identifying new diseases such as the H1N1 flu by actually
sequencing and saying hey, this looks different from stuff that we've seen before. Finally you
can personalize medicine to an individual’s genome so you can identify things they are
susceptible to in advance so for example, some drug allergies or side effects only happen to
people with specific genes, you can know which inheritable diseases they have then you can
estimate the risk of various things going on. There are some examples of this happening
already. Just a couple of weeks ago there were a series of New York Times articles about
sequencing driven personalized treatment for cancer, so this was a case at Washington
University where they sequenced the cancer that, actually one of the researchers there had and
they figured out that it was actually susceptible to a drug that had been designed for a
completely different type of cancer and they applied it and actually managed to induce a
remission in that patient. There is another one that was, well this was the case where in the
end it didn't work out. The remission was only for a short amount of time, but similarly they
were able to use a drug from one cancer to treat a different one that it normally wouldn't have
been prescribed for. Some examples in noncancer things as well, for example, diagnosing
metabolic diseases which is usually very hard. They can sometimes be due mutations in
mitochondrial DNA and here they were able to just sequence mitochondria and figure out some
of these diseases. So this is all great in terms of affecting medicine, but we are here talking
about a computer research lab. So what is the computational challenge? It turns out that
although the cost of the wet lab part of sequencing is dropping dramatically; you have to do a
lot of processing to actually put this data to use, both to understand what is happening, like
how to actually put together the sequence for one human and of course beyond that to
understand which genes affect which diseases. We are not even going to get into that in this
talk, but that is also a very hard kind of machine learning problem. The reason that the
processing is difficult is that the way the sequencing machines work is essentially by using
massive parallelism by reading many small random substrings of the genome in parallel and you
get together these little random strings kind of like puzzle pieces and you have to put them
together see the whole picture of what was actually going on in that person's genome, and I
will show what that looks like and what are the steps you need to do that. The bottom line is
that the current pipelines for doing this take multiple days for each genome and costs, if you
actually work out the compute time, it would cost on something like Amazon or Azure or to
some extent even if you buy your own machines, it can cost into the thousands of dollars just
for one genome, so this is starting to exceed the cost of the actually sequencing itself.
Question?
>>: [inaudible] just the chemistry purpose?
>> Matei Zaharia: Yes. Just the chemistry part, exactly. So this is not great for clinical use, but
it's especially bad if you want to scale up the number of genomes you sequence for research
and like really understand these diseases, because people have to pay that many times over to
sequence hundreds of cancer patients and see what is going on. The goal of our team is to
build a faster and more scalable and more accurate pipeline that does this genome
reconstruction and that can be used both in actual medicine and in research where you get a
lot of different genomes and you want to compare them or you just want to do this kind of
computation at scale. We have a bunch of people involved from a bunch of different areas.
We have a lot of systems people. This started out essentially with a bunch of systems people
talking with Taylor and figuring out that we can do some of these steps faster and Bill and
Harvey from Microsoft. We have Taylor who is from UCSF and we have some folks in machine
learning and in computational biology and theory that are helping out as well at Berkeley. So
what will we have in this talk? I am going to first have an overview of how the sequencing
process works and what are the computational problems and then we will mostly talk about the
first processing step which is sequence alignment and also it turns out to be the most expensive
in terms of CPU time. We will talk about a new algorithm we developed called SNAP that cuts
down the cost of this step by a factor of 10 to 100 compared to the existing tools, so this is
something that takes a problem that was essentially CPU bound initially and took more than a
day for a human genome and it turns it into a problem that's now basically I/O bound and we
can do it in an hour and a half. They we'll talk about some ideas we have for further improving
alignment, especially the accuracy by taking advantage of the structure of the genome and
finally we will also talk about some downstream processing steps beyond alignment. Hopefully
this will give you a taste of like the different types of problems that exist. Let me just show you
quickly with some pictures how the actual sequencers work and what you have to do with the
data. DNA itself that you have is a long molecule, well it's actually a bunch of long molecules
but in total it is 3 billion of these letters or bases long that encodes the sequence. We have a
sequence of these 3 billion things and the first thing you do is you replicate it so you have a
bunch of copies or you just create a bunch of copies from a cell. Next to actually sequence it,
everything is done in parallel, so you split it randomly into fragments and you can just do it by
heating it up or something like that and you get these little fragments of DNA. After that you
can read the fragments in parallel. There are machines that do this by doing some clever
chemistry to be able to read the sequences of each of them and the way it works is they kind of
attach it to a little thing on a plate and they float around the complementary bases to the ones
that are part of your DNA and they read what actually sticks to it in what order and as they do
this they end up reading the sequence. These little fragments are called reads and about 100
bases or letters long today, but the whole thing as I said is 3 billion. In a typical human genome
sequencing run you are going to get 10 to the nine of these reads which is basically covering
each location in the genome about 30 times and they are each 100 bases long. Now, how do
you actually put these things back together? You can view it as a puzzle. You just get this
sequence of reads and one way to treat it as a puzzle is like let's see what the picture on the top
of the puzzle boxes, you can use a reference genome which is the genome that was sequenced
back in 1999, 2000 and has been refined since then where people use longer read technology
and know how everything ties together. When you look at the reads from the person they are
going to have some differences from the reference genome of course because it is a different
person and they will also have differences because of variances in the sequencing itself, but
what you can do is you can take each read and see where in the reference genome it matches
sequence best, and then just place it at that location, like placing a piece of the puzzle on the
sheet in front of you. So this step is called sequence alignment. This is the one that we will talk
a lot about. And what this lets you do is that once you've lined up all of the reads at each
location, you'll get something like this where a bunch of them are aligned together and, you
know, they might show some difference from the reference genome or you might have some of
them show a different. Some of them show noise because there is some error in the actual
reading process, but then you can do kind of a voting algorithm to actually figure out what the
base at that location was. This is really simple, but it turns out that there is a lot of stuff that
makes this hard, so this step is called variant calling, and we'll talk about a little bit later. Just to
give you a sense of the rate of differences between these, two people only differ in about one
in 1000 bases from each other, so that's pretty small, but the sequencing machines can have
error rates of up to a few percent although it depends and it turns out some of the errors are
also pretty biased. So this is kind of the error rate you are looking at. Let's jump into the first
step of this which is alignment and what we ended up doing for that. Alignment problem as I
said is given one of these reads and the reference genome which is a big string, tying the
position in there that minimizes the edit distance to the genome and again to give you a sense,
the genome is 3 billion bases and the reads are about 100. If you look at the current status,
alignment is today the most expensive step of processing and it's also important because you
really have to map the reads to the right location in order to do anything downstream of that,
so depending on the accuracy of the tools you use it can take a few hundred to a few thousand
CPU hours and if you work out the math it can take sort of hundreds to a thousand dollars of
compute time and the issue there is also that the faster aligners lose accuracy so they typically
don't align as many of the reads and they support fewer differences inside each read. The
problem with that is if you have a place in the genome where a person really has like five
differences in a row and your aligner doesn't support that, you are going to systematically miss
all of the reads that map there and you are not going to see that difference in the downstream
analysis. So we built SNAP, scalable nucleotide alignment program, is a tool that is 10 to 100
times faster than the current ones and at the same time it improves the accuracy, so it has
higher accuracy and it also has a richer error model that allows for more types of differences
and basically you can give it a parameter k the number of edits you allow from the reads to the
reference genome and it will find a location with the best location with at most k edits. As a
result as I was saying we cut down this step from about 1 1/2 days to 1.5 hours and this is done
while reducing the number of errors in half on sort of real human data. What do current
aligners do and how do we do this better? There are two methods to do alignment. One of the
earliest ones, if you've heard of BLAST, this is what it does. It's based on seeds and so the idea
here is you have the genome, you index just short substrings of it, of say 10 characters and
what you do is then you take your read and you take every 10 characters in here and look for
an exact match in the index. Here we have just four. Say these are the first four of them and
this is where it matches and so you've got some candidate locations to try and then you place
the read on each one and compute the edit distance and you end up picking up the best one at
the end. So this is the seed-based method. And of course you don't know like maybe the first
10 bases match somewhere but actually that's just by chance. Maybe there were some errors
in there, so you have to try multiple seeds to really find the best location and so you keep doing
this work continuing with the seeds. This is one method. The other method people do, many
of the faster tools today uses the Burrows Wheeler transform to encode kind of a prefix tree of
the genome of like all of the substrings in the genome and then they search through this tree
using backtracking. This is you are just going down the tree and trying to insert this into
different locations, but I'm not going to talk much more about this because it's a bit more
complicated to explain. What we do in SNAP is we have actually taken the seed-based method
but we've done a bunch of algorithmic changes and also sort of systems changes that reduce
the cost of the most expensive check, the most expensive step which turns out to be the local
edit distance strings. On the one hand we leverage just improving resources. One of these is
the actual read length, so the read lengths used to be about 25 bases and now they've gotten
longer to about 150 and using this it turns out you can really change the form of hash index you
have and do quite a bit better. We also leverage higher memory so our algorithm is designed
for servers with about 50 gigabytes of memory, so we can use that. On the algorithm side we
have a way to prune the search to reject most of the local alignment locations without fully
computing the score and this turns out to save a lot of time as well. Let me just explain the first
part. We're going to use these seeds to match a hash table of just exact matches in the
reference genome but we've chosen to use longer seeds then a lot of the previous aligners and
the reason for this is there is a trade-off in general between the seed size and the probability of
actually finding a seed in the probability that matches the genome and the amount of false
positive hits you have. The human genome is about 4 to the 16 bases and if you have a seed of
10 bases and if you have a 2% sequencing error, it turns out there is a 19% chance that your
seed contains an error. Otherwise 80% of the time it will match, but also you will have four to
the six or 4000 matches just by chance against the whole genome just on average, so this is why
people have used it in the past and it made sense when you have very short reads. You
couldn't take much longer seeds than that and expect it to match. If you go up to a 20-based
seed, you have a higher chance of an error, but you also have almost 0 chance of it matching
just randomly at a particular location. At least if the genome is a random string, you expect to
test a lot fewer candidate locations. The reason this makes sense for a longer read is because
the read is longer you can take many more independent seeds to try. Yes?
>>: Have you made any assumptions about high repeat regions?
>> Matei Zaharia: Right. This is very simplified, so actually high repeat regions mess this up, so
some regions, some seeds are much more common than others and just big regions are
replicated. Yes. So you still have to search a lot; you have to search more than one hit for each
seed usually, yeah. We will actually talk about that idea. That's one of the things that we are
looking at. So just to show this, so you have this short read of 25 bases, you have an error
there. If you take a 20 based seed, there is pretty much no place you can put it without
touching that error, but if you have a long read of 100 bases, there are a bunch of varies even
with a 2% error rate you do expect some seeds to work well and so you can actually find
matches for this in the index. So this is just an observation with some math of something we
can do. Yes?
>>: What technology allows you to go from 25 to 100 [inaudible] thousand [inaudible]?
>> Matei Zaharia: Yeah, definitely. Regions are improving and basically what it is is I showed
that picture at the bottom like if you attach one end of the DNA string and then you float these
molecules around and there is actually like they are fluorescent and there's a camera pointed at
it and see which one is touched, and what happened is they've been able to attach more of
them before they have to stop. Before like after about 25 of them there was too much noise to
be able to attach more and like see which one actually got put there, and they've improved
both chemistry of how those things float around and the camera technology to be able to do
more.
>>: [inaudible] technology as well [inaudible] electronic reads [inaudible] there are other
technologies that are more electronic reads so you have way less complications so we will see
longer reads coming in the not too distant future.
>>: I was wondering what is the order of magnitude we can expect?
>> Matei Zaharia: I think 10,000.
>>: 10,000?
>>: There are other people working on mega based reads.
>> Matei Zaharia: Yeah.
>>: Yeah [inaudible] [laughter].
>>: It has a ways to go but it will be interesting.
>>: Yeah, the 10,000 based pairer, they seem to be doable. There are some prototypes that
are out there.
>>: Turns out that back track is a bad strategy for like 10,000 or higher. [laughter].
>> Matei Zaharia: Yeah, even for 100. We'll see. Yep. Yeah, that's a good question. These
things are going to be getting longer. The other thing we do is very simple on the index site, but
it actually helps a lot. A lot of the existing tools were designed at a time when server memories
were a lot lower and for example, if you index 10 base pair of seeds, they only took the nonoverlapping ones, like the one position 0 to 9 and then 10 to 19, whatever, all of these disjoint
seeds. What we do is we just do a sliding window and index every substring of that length and
if you think about it this means that our index has 10 times, or actually 20 times more seeds in
it because we are using 20 base pair of seeds, but it turns out that if you pack the bytes nicely
into a hash table, you can fit that into a very reasonably sized memory, so we are able to do this
with 39 gigabytes of memory for the human genome. This is important because looking for a
seed and not finding it in the hash table is actually really expensive. It's like hundreds of cycles
because it's an LP cache base because even with a small hash table than they had before it's not
going to fit in your processing cache, so it really helps to actually have this. The other part, the
algorithmic part that is really different beyond this is the way we do the local alignment check.
As we, as I mentioned before, it turns out that the genome is not a random string. There are a
lot of areas that are quite similar to each other, so for many seeds you will find them in a lot of
locations and in general for many reads you will have to test them a bunch of candidates before
you find the best one. This is where all of the cycles in the algorithm go essentially, at least 90%
are going into this. What is our insight here? The thing we figured out is that in the end of the
day to actually map the read you only care about the best and second based locations where
are the lines. If the place where the line is best, say that it has edit distance one and the second
best place has edit distance five, you can be pretty sure that it came from that best one and you
are just going to align it there. If the best is at distance one and the second best is at distance
two, then maybe you are not sure and you are just going to tell the downstream analysis okay, I
am not sure where this read goes. You might give it both locations or something like that, but
these are kind of the confidently caller reads; you just need to know best and second best
locations. How can we use this? We replace the traditional edit distance algorithm which
always computes the full edit distance at each location and is quadratic time algorithm with one
where you can give it a limit on the edit distance and you can tell it if the distance is bigger than
this, I don't care how big it is. Just stop early and tell me it's bigger, and so we have this
algorithm, the complexity is only n times the distance limit and we lowered the limit as we find
more hits and improve the best and second best locations. Yes?
>>: [inaudible].
>> Matei Zaharia: Yeah, n is the, it is the larger X, so if we are only comparing locally at each
place, so n is like a hundred basically; it is the length of the read. We've mapped it to one place.
With a seed we take those a hundred and the one hundred that we have, yeah. With this
algorithm you can lower the limit and you can also arrange things so that the first hit you find
actually has a low edit distance and start out with a low one.
>>: Basically abandon edit search, edit distance?
>> Matei Zaharia: It's a little kind of like filling in only the diagonal of the matrix, but it's
actually a bit nicer than that because it only uses like order d space, so it only tracks how far
you can go down each diagonal, so it's actually kind of cool because it fits in the L1 cache of the
processor as well. That's what it is. Here is like the actual algorithm. I'll just step through this
to show you the way we actually do the pruning. Basically what we do, when we start, we start
the d limit to be the maximum distance plus this confidence threshold c which is how far away
we want the best and second best rate to be to call it unambiguous. We go out and extract
seeds from it and for each seed, so this is the confidence threshold. For each seed we go out
and find the location where it matches in the genome and we actually prune seeds that have
too many locations because some things are just too repetitive, we look for a better seed that
doesn't have that property. We add candidates to a list. Once we've tested a minimum
number of seeds, we look at the candidates that match the most of them and the idea here is
we are going to an expensive edit distance computation. Let's find a candidate that will give us
a low d limit for the next one, so we score that candidate. Next thing we do is we update the
distance limit. There are two cases for this one. We look at the best and second best hit and
we keep tracking them as we go along. There are two possibilities. If the best is much better
than the second best and we have this confidence threshold c, then we only care about finding
other hits within best plus c, because the idea is if there are hits with a bigger edit distance than
this one, it's not going to change our results, but if we find one in this window of up to c more
we’re going to say maybe we are not confident about the match. That's the first case. Second
case is if the best and second are already within distance c then there is no way that finding
guys bigger than the best is going to help us. The only thing that will help us is if we find
something much better than the best that is up in this window here and there is nothing
between it and best so we can be confident about it, so we only search up to best -1 in this
case. So yes, that's what we do with that. The final thing is there is a trick that you can do to
stop early. If you found that the best is in distance two and maybe only carry things out
distance four beyond that, what you can do is if you have done at least five disjoint seeds from
the read, you can actually stop right there because if you've tested a bunch of disjoint seeds
you know that any read that you haven't yet put in your candidate list, so it didn't match any of
the seeds, has at least that many errors in it. So if you've tested five seeds and there is a
location that you haven't yet found through the exact matches on those seeds, then it must
have at least one error in each one of these seeds and so it must be at least distance five. So
you can actually stop early over here, so this lets us stop this search as well. In terms of the
results, here are some numbers comparing SNAP to two of the commonly used aligners today,
BWA and SOAP. They are both actually based on the Burrows Wheeler approach. We can see
where the existing aligner is. We're showing three numbers here. This is on simulated data
which lets us know where each read actually came from and we are showing the percentage we
aligned the error rate and the speed in reads per second. Here you can see BWA and SOAP give
you a trade-off between accuracy and speed. SOAP is a little faster but makes a lot of errors
and SNAP actually matches the percent aligned, beats the percent aligned of BWA. It has half
the error rate and it's also going about 30 times faster.
>>: [inaudible] and we also scaled better.
>> Matei Zaharia: Yes. That is true. So this is hundred based reads with 2% differences. This is
kind of the current reads you have today. Another cool thing is in SNAP you can actually tune
some of the parameters, the max hits you check for each seed and trade-off between accuracy
and the speed, so if the speed is too fast for you and you were okay waiting for a day to align
this stuff, then you can tune SNAP to give you higher accuracy, so you could align with only a
.01% error if you go about three times slower. Or if you are okay with the error rate of BWA
you could tune it to get higher speed, so you can trade-off between these. Yes?
>>: [inaudible] aligned so you are mining 8 to 9 percent of the errors [inaudible]?
>> Matei Zaharia: They are, no, they are just reads we couldn’t align, so the main reason why
we couldn't align them is--these are the ones that we could confidently turn one location for.
For the other ones there were multiple locations where they matched well. There is also
usually like half a percent where they have too many differences and we don't find any location
for them. This is how many we are confident about.
>>: So if you had a perfect [inaudible] would you have a hundred percent alignment and zero
errors?
>> Matei Zaharia: No, it would actually only the like, I think in this case it would only be 93% or
something like that, or 94, because some regions are exactly identical and you can't know with
accuracy, so we're not counting those, but it would have zero errors. Actually, even the error,
because of the model you might have two errors that puts you closer to something than where
you actually came from, but I think you could calculate where that is.
>>: [inaudible] errors [inaudible] so the idea is if you get a higher mutation or an error in the
read by the machine it may move this string from the place that it came from to be actually
closer in edit distance space to a wrong location and then it's impossible to do this. You might
be able to just say I'm not going to call it because it's too close to too many of them and we
actually did that to some.
>> Matei Zaharia: Yeah. So this is kind of with today's data. One other neat thing is that if you
have reads with a lot more errors which some of the future technologies will have and also
which will happen if the person has a lot of mutations in one place, SNAP still performs pretty
well, so here with 10% error the existing aligners kind of fall over. They align less than 20% of
the reads and they actually go a little faster because it's easy when you are not aligning
anything to go fast, but we still do okay. Yes?
>>: So I ask if you can combine this into just one number percent error you would need to
incorporate the downstream processing into…
>> Matei Zaharia: Yeah, you would do it yeah.
>>: So what happens when you take just an off the shelf downstream processor and…
>> Matei Zaharia: If you compare SNAP with a… That's a really good question. Yeah. We don't
actually have a complete answer for that yet. We looked at [inaudible] alignment effect that
downstream callers and we found places where it does, like where BWA has misaligned some
reads consistently and you call mutations that aren't actually there, but I don't have a good
sense yet. Part of the issue is that there are other things that are hard in the downstream caller
and you have to right first before this makes a huge difference. But we think it will help
eventually. Yeah?
>>: [inaudible] there's a, in most of the current processing lines there is a second step
[inaudible] alignment that's done after the initial alignment and that's because there's about 30
to 35% of the Reed space expect don't match a location. They are actually [inaudible] but not
yet.
>>: This is now looking at cancer genomes which are a little bit more complicated than
[inaudible] significant [inaudible].
>> Matei Zaharia: In cancer also a lot of the DNA like replication and sort of checking
mechanisms break down so you get a lot more mutations the way things happen. Let me see
what else I wanted to show you. This is another thing we wanted to show is that as reads get
longer as they will in the future, SNAP actually scales better as well, so the existing tools, the
fastest ones today they have this backtracking which is a bad idea when the reads are longer
because you have to backtrack at more locations. Whereas, SNAP actually does better because
you can use longer seeds and you can get more disjoint seeds and filter out a lot of the
locations just by number of seeds matching. We also did some analysis of the speed up. One of
the nice things is this heuristic where we test the read with the most exact matches of seeds it
means that usually the first candidate we score is actually the best one we will find and we have
to load the limit at the start and we eliminate 94% of the locations without actually scoring
them and 40% are just because the number of seeds didn't match, so this cuts down on the
time and the adaptive edit distance threshold also helped by a factor of four. Finally, just to
wrap up this part, we have been doing a lot of this stuff after and I don't have too much time to
go into it. One of the things is we generalize the algorithm to work with what I call paired end
reads. This is actually the most common type of read. You get a bigger molecule like 500 bases
and the machine reads 100 bases off one end and 100 off the other and it can go all the way
into the middle so now you get aligned kind of two strings in a place and you know some
constraint on how far they can be from each other and there is a bit, you can use similar ideas
in this problem to actually align them as well. As Bill was saying we have spent a bunch of time
into making this scale well and it does. This is up to 32 cores and part of it is because we are
careful about how many memory accesses and cache misses we do. We are careful to prefetch
data and basically like having systems people look at this actually does help improve the speed
to the point where it matters to practitioners, so that's kind of cool. We've also run it on some
real data. This is numbers that I showed before. These are real reads. The interesting thing
with real data is that a lot fewer of them align because there is some contamination and just
like stranger things happening in them, but in this one we are able to again sort of marry the
results in the simulation and get higher accuracy and also grow about 20 times faster. So that's
the part of the talk that I had. Kristal is going to talk about what we are doing next. Yeah.
>>: Would you consider this a solved problem given the low error rates and the speeds that
you are saying you are at I/O bound? Are you guys done?
>> Matei Zaharia: That's a good question. I think the accuracy can be improved even further.
We want to explore this more to see when it matters, but especially for things like cancer there
will be more mutations and also there will be like pieces of the chromosomes that usually aren't
together get cut and pasted next to each other and you might want to detect that, so now it's
one has to read the lines in one place and half in another. In terms of speed I think the speed is
pretty good. Although there is always, you know, this is one genome in one hour but there are
people that have 1000 genome data set. If they want to realign that using SNAP or using higher
accuracy, that might take a while. So we will talk a little bit about what we are doing for
accuracy, but we also want to look at the downstream steps next because there are more
unknowns in there.
>> Kristal Curtis: As Matei mentioned, we are kind of capping out on speed but the accuracy is
actually something that we still want to look out more, because as far as alignment goes there
are two parts of the process. One is finding good candidates to track the read against and the
other is quickly tracking against all of the candidates so that you can find the best match. In the
first case we have pretty much narrowed that down and we are getting a good list of
candidates, but in the second one we still want to make some improvements there. What
we've noticed in doing a sensitivity analysis to what parameters tend to affect the performance
of SNAP is that really the only parameter that makes a big difference is this max hits and so as
Matei mentioned this is the kind of the cut off for when you have a seed, how many hits it has
in the table as to whether you will consider it or not. And so what we noticed when we varied
that max parameter is that the error rate does go way down as you get a higher max hits and
you can also align a higher percentage of reads. So what this means is that at first cut you
would think that he see that a c that matches in a bunch of places is just indicating that your
read is just matching two, it's going to be ambiguous because it's matching so many places, but
what we actually find when we are willing to consider more and more places is that for some of
those reads we actually can find an unambiguous best match in the genome and the reason this
happens this way is as you were kind of alluding to in your question is the genome is not a
random string. It is actually highly redundant and what makes it difficult as well is that it is not
exact duplication that is really the main factor; it is similar duplications. This is something that
we really want to be able to consider more hits per seed, but then of course the downside is if
we do consider more hits per seed this speed takes a big hit. How can we improve our
accuracy, reduce our errors, get a higher percent align while avoid dropping off to the bottom
right of that curve? That's the part of this that we are looking at now. As I was mentioning, we
have these similar regions in the genome that make alignment difficult. Back to the seeds
explanation and the false positives, the odds of finding too many matches in the genome are so
low if we had the random string assumption, but the fact that we have similar regions are what
makes alignment take longer for us. This is an example here. You see that it is colored by the
type of base, so when we see a solid color column, that means that all of the strings are
matching exactly, and when we see different colors in a column that means that some of the
strings have differences from each other. These are some strings that we found via clustering
in chromosome 22 and they are all unique. There are around 400 of them. However, if we find
the consensus string for that group of substrings and we find the average distance from each
string in the cluster to that consensus, it's very low, only about six edits. This is the
phenomenon that is causing us to spend most of the cycles in the alignment process.
>>: Are those all from the same person?
>> Kristal Curtis: This is from the reference genome.
>>: People are almost identical. There is one part per thousand difference between people so
[inaudible].
>> Kristal Curtis: Yeah.
>>: [inaudible] person?
>> Kristal Curtis: No it's actually kind of a mix because different sequencing centers were
working on it so they all kind of submitted part of it and it's all kind of stitched together. What
we really wanted to be able to do is test our read against the entire group of strings, however,
that takes a really long time. So we have this trade-off then which we kind of looked at and if
we are too aggressive in lowering the max hits then we are going to be paying the price in error.
Just to illustrate this graphically, what we have in many cases are these reads that are matching
against these similar regions and so based on how we set max hits we have some different
problems. If we take it too low that means we won't try any of the locations that the read is
going to match against because all of the seeds have too many hits, so we won't be able to align
that one at all. If we take it too high, we will try all of the locations, but then we will spend an
inordinate amount of time trying to align that read, so what happens in practice is because we
have some middle-of-the-road setting for max hits, what we end up doing is we test some of
the locations where that we would match, but not all of them and therefore this is where most
of our alignment errors are coming from. What happens is we test it against a few locations.
We find one that works well and we record that you however, there was actually a better one
somewhere else that we didn't test. So this is the main thing that we are trying to address.
This is important definitely for the downstream processing, so we really want to get this right.
What is our approach to fixing this? Well, first we are working on pre-computing those similar
regions in advance so that we can use that information quickly during the alignment process.
Second, we are exploring an algorithm to rather than, you know, you get the cluster and then
you compare individually against each string in that cluster, we are working on a representation
that kind of reuses the work in that comparison so that you can efficiently do that. Currently
this is a work in progress, but we have so far is we have a parallelizable algorithm for detecting
a similar region using SPARK which that is a project that Matei has worked on with some folks
at Berkeley and it is a really good framework for cluster computing and we also have a group
edit algorithm that gets quite a good speedup over comparing against all of the strings fully.
The way that we are detecting these similar regions is very simple, but I'll just quickly illustrate
it graphically. What we have is for each substring in the genome, we represent it as a node in
the graph and then we look at the other strings in the genome and we compute adjacencies. So
this one only has two edits from the original substring so we draw an edge. However, in some
cases that have too many edits we don't draw an edge. Then whenever we find that two
clusters share a member that, or has members that match each other well, we are going to
merge those clusters so we continue to do that, and eventually we end up disjoint clusters and
each one of those would be the group that we want to compare a read against. Of course this
is expensive because we have this n by n adjacency matrix that we want to compute over the
whole genome, so it takes too long to do it naïvely, so the way that we have worked on this is
we have partitioned this matrix so that each SPARK pass will be working on a separate piece
and what we do is we index this, only the short part of the genome that this task is responsible
for. Then what we do is we run our union find so that we get a set of clusters and we do this
for each of our tasks. Then, now we want to be able to merge these clusters so that we get the
actual set for the entire genome and the way that we do that, remember that each cluster has a
bunch of different strings in it and whenever we find strings that match across clusters we will
merge those, and we continue that for the whole batch and then we get some final set of
clusters over the entire genome. Very simple. Yes?
>>: Are you going to do this one per species?
>> Kristal Curtis: Definitely, you would. So far…
>>: Why not just do it, take a few weeks and [laughter].
>> Kristal Curtis: Yes, I guess you could certainly do that. We've been focusing on only human
[laughter]. We have been focusing on only human. If you, yeah, you could certainly run this on
whatever reference genomes you wanted to work with, but, so we focused on getting a parallel
implementation just because it's proved to be intractable work on for in a single node.
>>: I thought you only needed to learn it once per species when it's done when you're
developing it [laughter].
>>: [inaudible] once per species for Kristal’s own sanity.
>>: Yeah, the length of the clusters might change. You don't know what the best way
[inaudible].
>> Kristal Curtis: Yeah, we are still working on fine tuning the parameters, so we would like to
have a shorter cycle for running it. Yeah, this is something that we precompute so it's not
something that's going to affect alignment performance. How do we use this information in
SNAP once we’ve gotten it? Back to the case where we have a read that is going to match
against a bunch of different places in the genome, what we do is we notice that one seed is
matching against one of these flagged locations, so we can kind of flag them in advance if they
are belonging to a cluster. Then what we do is we just grab all of the fellow cluster members of
that location and we just use our, that edit distance kind of aggregate algorithm I was
mentioning to test against the entire group rather than testing it against some fraction, like the
current version of SNAP does. In this way we can kind of get some of the best of both worlds in
that we are saving time but we are still able to compare against all of the strings in that group,
so we are able to avoid errors and, in fact, what we see is that once we've incorporated this
into SNAP, if we compare against the standard form of SNAP, we get a big reduction in error.
This confirms the fact that really the errors that we were getting from the standard version of
SNAP are caused by these similar regions. As I'll kind of get into in the last part of the talk,
these alignment errors can really be important downstream, so even though it's kind of a small
fraction overall, it can lead us to drawing some bad conclusions if we have those errors. What
is the rest of the pipeline? Back in the beginning of the talk, Matei kind of talked about how the
idea of reconstructing a genome is taking all of these it reads, aligning them. That's kind of
what we have been talking about so far and then the main goal is to know what that individual
actually has for their genome in each position. That is a process called variant calling. In the
kind of simplest view it is just this idea of taking a consensus of all of the reads that map to a
particular location and seeing what they indicate about the genome. Actually this is a very
difficult process and we really are focusing on trying to improve the accuracy of this. So why is
variant calling difficult? There are several factors. First, there are some that are just inherent
to the genome that make it difficult. These aren't going to be going away. The other ones have
to do with technology, so they can potentially be targeted, but there are things that we have to
cope with for now. One, when we are actually taking the DNA out of the cells and producing
the sample that is going to be sequenced by the machine, we have this process called PCR,
polymerase chain reaction, and this is not an unbiased process for kind of amplifying the DNA
so that we have enough to sequence. Again, we also have systematic problems that the
sequencers make when they are producing the DNA reads and as we talked about the
alignment errors; that's also going to confound variant calling somewhat. The heterozygosity,
what this means is that for all of your chromosomes you have pairs of chromosomes that are
the same length and you get one from your mother and one from your father. At each point on
those pairs of chromosomes if you have the same value at each, on both of the strands of that
chromosome, then this is called homozygosity, but if you have different values on the two then
this is called heterozygosity. So why is that a problem? Well, when we get reads that map to
that same location on the genome, they could tell you different things about what the person
actually has at that site, so it's not always straightforward to know is this actual like true
heterozygosity or is this just that we have gotten some sequencing errors from a couple of the
reads? So this is one of the challenges. Another challenge is, as Matei was alluding to, is we
have something called structural variants which our larger scale changes. So far we have been
talking about mostly mutations that are one, maybe up to a few bases in length, but these are
changes that can be arbitrarily large even up into the mega bases. Especially in the case of
cancer you really see a lot of these large-scale changes, so what do they look like? Well, first
you can have a deletion from the reference genome and one way that you can detect this is
that you have a bunch of reads that map normally, but then you have a segment of your
reference genome were no reads map at all. Also, Matei mentioned that you get these reads in
pairs. Usually you know how far apart those pairs are and sometimes you see that the distance
between pairs is off from what it has to be given the parameter of the sequencer, so in this case
reads that actually come from, you know, a certain amount are looking like they are way too
close together. That's one large-scale problem. Another thing is you can have a sequence
inserted. This can be from somewhere else in the genome or it can be a completely novel
sequence that you won't be able to detect at all. In the case of a completely novel sequence
being inserted, you'll have some reads from the patient's genome that can't be mapped
anywhere in the reference genome because it is a novel sequence. Then also you will have
some reads that came from close, should be close together in the reference genome mapping
way too far apart in the patient's genome. These are also some of the things that make variant
calling more difficult. Another is you can have…
>>: How would you know where they mapped in the patient's genome to know that they were
far apart in mapping? Or is that the two paired ends?
>> Kristal Curtis: Yeah, yeah you would definitely always see some paired ends. That's a very
valuable signal. One of the trickier types of structural variants to look at are these duplications.
In this case this is called the tandem duplication because you have part of the sequence
duplicated so it's adjacent to the original sequence and what makes this difficult is that one
thing is that you see too many reads mapping to that portion on the reference genome, but
then you also have some reads that are kind of from the overlap of these two that are not going
to be mappable because they are just going to look funny. These are very difficult to detect and
they are definitely an open problem for developing better tools there and we are looking into
that. Another thing that makes variant calling challenging is that as I was kind of alluding to,
the process of converting DNA to a bunch of short fragments that can be sampled during the
sequencing is biased and what are the biases? One is the molecule length and another is the
composition of the molecule.
>>: Are existing tools trying to find those or are they mostly [inaudible]?
>> Kristal Curtis: There has definitely been a lot of work in finding structural variants, but the
accuracy is still pretty bad and also each tool is kind of like a point solution, so some are good at
finding short insertions. Some are good at finding large deletions. Some can find more types of
structural variants but not very precisely, so it's still kind of not clear how to do it in a way that
is holistic and can find all of the different structural variants. That is why that problem is still
very interesting and especially because structural variants actually have a lot of interesting
influence on cancer and other kinds of diseases.
>>: In a mutation, does it look like kind of heterozygosity, or does a mutation take a complete,
go to both chromosomes?
>> Kristal Curtis: Both can happen. Of course, usually you would have only one, but then in
some cases you might have both of the genes knocked out on both of the chromosomes.
>>: I guess I don't know what causes mutations. [inaudible] would only affect one
chromosome but…
>> Kristal Curtis: Yeah, they can, but that's kind of why cancer takes a long time to get, because
it's just like accumulating the bad luck or like many, many years.
>>: So what happens is like in the human population there were mutations on the [inaudible]
that are now part of the population. There isn't really one gold standard like human genome
and people have mistakes relative to it. There is just diversity.
>>: You can wind up inheriting two copies of--you look at a bit of the population tree and
maybe there is some mutation that happened in chromosome 14 and one individual copying of
that in the germ line and is spread throughout the population and maybe both of your parents
may have gotten that through, you know, the fact that your great, great, great grandparents
sometimes are the same as each other… [laughter].
>>: [inaudible] necessary.
>>: That's not true for everybody, right? The whole thing when you start coming back
generations in you go to the population that is bigger than the planet and people get confused
and say what happened. Well, what happened is somebody had multiple types of grandparent
and that kind of thing can lead to getting mutations and both copies of [inaudible]
chromosomes that came to you.
>> Kristal Curtis: For sample prep what makes this difficult is that there are biases towards the
size and the motif and what this looks like is you start with your original full strand of DNA that
you have replicated, sorry, that you have gathered from the sample. Then you send it through
some process that fragments the DNA, but then what we do is apply PCR to amplify those
fragments so that they can be sequenced throughout the sequencer and then you see that
some fragments are preferentially amplified. This means that like I was showing you here, a
higher coverage depth of that duplication. Well, sometimes a higher coverage depth can just
be because of the way the sample is prepared. Another thing that is pretty subtle is strand bias.
I will walk through this. This is from a visualization tool from some folks that created a lot of
libraries for looking at genomic data and what you see at the top this T, T, T, C et cetera is the
reference genome at this position and this is in chromosome 20. What you see are these little
dots and commas. These are reads. The dots are reads that are mapping in the forward
direction and the commas are reads that are mapping in the reverse direction. We talked
about paired end and what happens in paired end is you read them in the reverse directions, so
one gets read forward and one gets read reverse.
>> Bill Bolosky: DNA is paired; you have the double helix thing that you always see pictures of
and in double helix the bases, there are bases that go together like types of bases that go
together and they are directional. The direction of one helix is one way and one is the other, so
the information in the two copies of helix are exactly the same but they are in opposite
sentences and when the machine goes out and reads them you can't tell which one you've got,
so one of the problems that comes in alignment and mapping that we didn't mention is that
you have to check both sentences, whether it's forward or reverse complement and then this
just tells you what sense the thing was matched.
>> Kristal Curtis: Thanks.
>>: This isn't a class in biology. [laughter].
>> Kristal Curtis: So what we have here, the reads that you can see that are these commas are
showing a bunch of T’s so where you see letters the reads are recording the difference between
the individual and the reference genome. You see that these T’s are only showing up on the
ones that are going in the reverse direction and so this is a case where we don't want to trust
the reads when they are telling us the difference from the reference genome because they are
only showing up in one direction and not the other. This is just one and probably the simplest
example of a sequencing systematic error that you can get, but definitely we have some other
cases as well. Also, these systematic errors in sequencing can also show up based on the motif,
so what we see here is that for one thing we see more errors when the reference sequence is a
T, which we see with the red T there. The other thing is that we see more errors when the
proceeding two bases to the position are both G’s. Just because of the way the chemistry
works in reading off the DNA in the sequencer, it tends to favor, or I guess it tends to have a
harder time in these cases. The fact that these errors aren't uniform makes this process more
difficult. A further thing is that variant calling is also very difficult to evaluate as well because
we have very limited ground truth for this. One thing that you can do is produce samples of the
genome with different sorts of technologies and then compare what you get with the short
reads to what you get with these other technologies, but there really aren't a lot of standard
benchmarks that you can use to evaluate for variant calling and in fact all of the variant calling
works thus far has kind of relied on ad hoc methods of evaluation. One thing that we are
working on is developing more standard benchmarks for our own development just so we can
know if we are actually improving based on the things that we are doing. Another thing that
you can do is if you have the data that is really cool is using this tree of data. So what is a tree?
That is two parents and their child and the reason that this is really valuable is because there
are certain assumptions that you can make that the child's DNA has to be consistent with the
parents DNA. The reason you can't make this assumption is that the rate of novel mutations
from the parents DNA to their child is very low, so therefore if you are seeing something
strange, it is much more likely that it is a problem with the sequencing, the variant calling and
whatnot, rather than a truly novel mutation. In some of these examples, you see that the
child's DNA is consistent with the parents’ DNA because each of the child's sites on their gene is
from one of their parents, but then for example, these cases are invalid because their DNA
doesn't match against one of their parents, so this is another really valuable signal when you
have trio data that we can use when we are evaluating SNP calling. This is still very preliminary
but one thing that we did is we wanted to check against the existing tools and get more of a
clear benchmark as to how they were performing and also be able to try out some simple ideas
and see if we could do as well or even better than existing tools. So what did we do? We took
some real data. It is called the Yoruban trio. It's from some individuals from Africa so we have
two parents and a child. We looked at only one chromosome. So we aligned all of the data
towards the reference. We took the reads out the match from chromosome 20, which is one of
the smallest ones. We realigned it and once we got those reads that filtered onto chromosome
20 and then we did this SNP caller that we developed as a prototype and we evaluated it based
on those Mendelian conflict assumptions that I just mentioned. The other thing that we did is
we looked at simulated data. This is the simulator that we have been working on in our group
and it's kind of meant to be more realistic than existing simulators because it uses some data
with real SNPs that people have found in various sequencing projects. We restricted the
simulator to only include single substitutions, so it doesn't have any short [inaudible] or any
structural invariants. In our thinking this simulated case should have been pretty easy. We did
something similar. But of course in the simulated case, we know what the variants are so we
can actually get that accuracy there. Where are the preliminary results? So we looked at two
different pipelines. This is kind of a preliminary evaluation, but what we looked at is the
CASAVA pipeline which is produced by Alumina which is the main sequencing company that we
have been talking about, and they called out about 128,000 SNPs and they had 162 conflicts.
We also looked at the GATK on the simulated data and the GATK, it stands for genome analysis
toolkit. It comes from the Baroque Institute in Massachusetts and it's pretty much the state of
the art for variant calling. This is the one that is most widely used by practitioners throughout
the field and what we found on the simulated data is that even though it was very simple there
were still 66 false positives in a pretty short amount of DNA just from chromosome 20. It's also
very slow. It took four hours to do this process. So what did we do? On the trio, while calling
about the same number of SNPs we got about half the conflict rate and on simulated data we
got an order of magnitude less of errors. This is something that can run in two minutes instead
of four hours. We think that we have some insights based on this that we should be able to
further improve this to work on the full genome and also extend to other types of mutations
other than just plain SNPs. Looking forward our efforts over the next few months are focusing
on formalizing this feasibility study into a full-scale pipeline that takes us from DNA reads to a
fully reconstructed genome. Looking at the way that this is currently done, people have broken
this up into individual stages and then separately optimized each stage. In order to get the data
from one stage to the next, they use a lot of thresholding, so whenever there is uncertainty,
you have to just put it off, report one value, and then push the data on to let the next state
handle it. Kind of leveraging all of the work that's been done we’re kind of hoping that instead
of having this kind of disjoint pipeline, we are working on producing this end to end system that
will rather than truncating information very early on in the process, we’re working on
propagating it through the entire process, so therefore we are able to leverage that uncertainty
in a better way than just throwing it away early. Just to pop up a level, as Matei mentions, DNA
sequencing is getting really realistic in terms of the affordability and it's improving accuracy and
this is really going to have great impact on medicine in a variety of cases. One of the things that
we are really excited about is producing better treatments for cancer. The cool thing for us is
that it's actually involving a lot of computational challenges in addition to wet lab challenges, so
we feel like we can participate because a lot of the existing tools lack in both speed and
accuracy and that's becoming a big problem if you really want to use this in a clinical setting.
Based on our work mostly with SNAP but also with our initial work with the rest of the pipeline,
we have some evidence that we can help with both of those cases. If you are interested you
can check out our website. We have made this open-source. We have it available for
download. Actually, we presented it at a conference a few days ago and so we've already kind
of gotten a lot of interest, people trying it out so, I don't know if there are any more questions.
>>: How about speaking of the end-to-end process, so do you guys, we saw the error rates for
alignment and I'm curious if you have run any studies to see how that propagates down to
variant calling.
>> Kristal Curtis: We've done some benchmarking but it's been mostly just running BWA since
that's kind of the standard benchmarking and just seeing how that, what our ultimate variant
calling results are, but we haven't really done something where we kind of vary the parameters
so that we get better or worse accuracy and seeing how that gives us the variant calling.
Definitely that is something that we are interested in looking at.
>>: Even the comparison of SNAP versus BWA, can you see that the error rates, the difference
in error rates actually helps with the variant calling?
>> Matei Zaharia: In the thing that was shown there was at least one example like reads that
BWA misread that we didn't that caused it to call mutations, but this is a very like, it's a very
anecdotal setting. We need to do the same thing on whole genome and like, you know, we just
haven't gotten very far with it yet.
>> Kristal Curtis: And part of the obstacle is just that GATK is so slow, right? We haven't even
been able to run it on a whole genome because it's so expensive to run.
>> Bill Bolosky: A conference we were at, I listened to the guy about the thousand genome
project talking about this and the errors that have come out of BWA have caused them such
problems that he was talking about giving up on re-mapping altogether and doing another
assembly for all of the genomes because it was driving him bonkers, so that some evidence that
it's really a problem. We're hoping we can do better, and by building an integrated pipeline you
can correct for the errors later. I think [inaudible]; does GATK do anything with unmapped
reads?
>> Kristal Curtis: No, not with unmapped reads.
>> Bill Bolosky: Our plan is to use unmapped reads and to remap stuff and as you're looking at
the, because there is lots of stuff that you can tell is just wrong. [inaudible] reason you can't
tell [inaudible].
>> Kristal Curtis: Other tools do look at unmapped reads but they are not integrated with GATK
so that's another obstacle. It's just kind of making everything work together and I mean just in
our bench making process it was pretty painful when we tried some of these things out.
>>: I was wondering about something. After seeing part of the talk and [inaudible] part of the
talk, so the approach is to take seeds from the reads and match them to the genome and
depending on the depth of coverage you are going to have many more seeds to look at than the
places in the genome to compare. But the complexity seems to be the same if you, I mean
theoretical complexity. Not real complexity. If you go the other way around, if you actually
take the seeds from the genome and try to find them in the reads. Again, you could do the
same trick with stopping early but now the stopping would be based on a different criteria, that
you want to see a certain number because you know what the coverage is roughly, so you want
to stop after you've matched a certain number of times. Has anybody thought about if that
could be worked out to be almost as efficient as the other way around, because then some of
these structural variant and other issues would perhaps be easier after the fact. So you would
pay a price a little bit, but I don't know. I haven't thought about it much, but it just occurred to
me now.
>> Bill Bolosky: So we have thought about doing something similar to that for trying to do
structural variants so the thought was you go through, get rid of the easy stuff because most of
it’s easy. Most of the stuff maps unambiguously with few errors and you can look at it, call it
and get rid of it, and then go back to the remaining reads and try to build them into context
which is, and you can find overlaps [inaudible]. [inaudible] overlap then you build long streams
of DNA that they seem to represent and you try to fit them into places where you found the
breakpoints in the genomes. At that point you could do that. Whether it makes sense to invert
the process from the beginning, I'm not sure.
>>: The thing that I don't understand is does it, it doesn't really help the structural variants
because it still tells you like these reads look like they came from here, but it seems like
ultimately you need to do this detective work at the end. Maybe it would help you with the
reads that are split across both ends. But some algorithms have done this. For example, some
versions of BLAST do this because they just take the reads like say 1 million at a time and they
index those and then they scan through the reference genome when it's too large for them, so
in a way it's been done. I'm not, I'd have to think more about how it affects what you can do.
>>: I haven't thought about it much but when you think about this problem, you always think
about whether you map the reads.
>>: Yeah, yeah.
>>: But I don't know. In the end if it's through the alignment, then maybe inverting the process
could be worked out. It’s kind of straightforward.
>> Kristal Curtis: There definitely have been some aligners that choose to index the reads
instead of the reference. I mean, they are a little bit old, so I don't know if we could really
compare directly against their performance, but also one aligner whose goal is to find actually
all of the matches for any read index is both of the reference and the reads and then sort of
walks through it that way. Then there's also been some work for detecting structural variants
where you use assembly to reconstruct the patient's genome and then you align the reads both
to the reference and to that reconstructed genome and then that's kind of a signal for how you
can locate these structural variance.
>>: Even though that may have been done in the past, just like the way BLAST was using the
seeds, they were able to use that idea much better because a lot of these things went
unwritten. They are imperfect so they…
>> Kristal Curtis: Right. That could be an interesting thing to incorporate.
>>: Yeah [inaudible] subset of the reads for which it works very well. If you are dealing with
reads that only have a couple of matches in the genome, the current process is probably the
best, right? It really depends on how easy it is to find the match in the reference string.
>>: But for reads of which have a bunch of matches in the genome, the first preference would
be better, so I think that's [inaudible].
>>: Suppose you have a specific goal in mind for one particular patient, you don't need the
entire sequence but you want to know which [inaudible] colon cancers they have, can you
make it holistic even beyond that and say this is the end goal. I want to update the whole
pipeline to optimize for answering that one question I have.
>> Kristal Curtis: I see, yeah, so one thing people do is they narrow their focus when they kind
of gather the initial reads. They get it from what's called the X gnome which is the coding
sequence or the genes that are in your genome because that's actually only about 1% of the
entire genome. You also have all of this other stuff that orchestrates how, which genes get
expressed or could have just been randomly inserted by evolution. We don't really have a good
sense for it. Some people do just pull out that X gnome, sequence that and then, you know, see
if they have mutations where they think they might be, but the problem with doing that is that
there could be a lot of other mutations that are related to that one that you are missing if you
do that. So it's kind of like looking under the light post approach where, which is somewhat
problematic when we still don't understand a lot of the other things that could be related. But
in terms of how the pipeline might change, I mean if you're focusing more on certain areas then
everything becomes lots easier because you have really narrowed down the uncertainty, so you
know which parts of the genome that you're looking at, and then it's also going to make a lot of
the stuff we were talking about in similar regions not as much of an issue.
>> Bill Bolosky: Okay. Anything else? Thank you. [applause].
Download