Document 17844382

advertisement
>> Yuval Peres: We are starting the next talk. I was once in a lecture by Chi Lu and we'll explain
that at least when we talk to the outside world we shouldn't talk about exploration and
exploitation but about learn and earn. Similarly instead of greedy algorithms they should be
locally optimizing algorithms. In the same spirit we'll next here a talk from Miklos Racz who will
talk about sequence assembly from perturbed bow and arrow reads.
>> Miklos Racz: Thanks Yuval. My mom really loves this new title. I'll be talking about joint
work with Shirshendu Ganguly and Elchanan Mossel. The motivation is DNA sequencing. DNA
sequencing has really revolutionized the biotech industry in the past couple of decades because
the DNA sequences of organisms contain so much information about them. There are tons of
applications. Unfortunately, we cannot just read the DNA sequence of an individual from left to
right. That would be great. What the current technologies allow us to do is to get a bunch of
short or not so short substrings of the long sequence that you're interested in. So x is the DNA
sequence and so this is just a sequence composed of A's, C's, G's and T's and what you get is a
bunch of short substrings which are called reads. The problem we are facing is to reconstruct
this original sequence x from the reads r. There are a bunch of sequence technologies that
have been developed in the past couple of decades. Basically, what ideally we would like our
long reads that are error-free or have very few errors. However, this is hard to achieve, so
typically the technologies try to trade-off in these two quantities. Originally, Sanger had
developed a type of sequencing which was unfortunately very expensive. What's really brought
DNA sequencing to widespread use was the development of next generation sequencing
methods such as a Illumina about a decade or so ago. This allowed us to get high throughput
data really cheaply. Unfortunately, in order to get really low error reads, this was restricted to
having very short reads on the order of length of maybe one hundred.
>>: When you say the error rate, is that a percent of wrong symbols?
>> Miklos Racz: Yes.
>>: I'm assuming they are uniformly spread?
>> Miklos Racz: Yeah, so the point is that these different sequencing technologies have
different types of error profiles. There's some noise involved, but there are some biochemical
things going on and so there are some systematic errors that could go wrong depending on how
it's done. Because these next-generation sequencing methods only produce short reads, it led
to highly incomplete assemblies and they were really fragmented because there were big parts
of the genome that you couldn't connect properly. Partially, in order to get around this
problem the new kinds of technologies that are emerging are ones that get much longer reads,
so over 10,000 base pairs, but unfortunately, the current status of these still have high error
rates, so upwards of 10 percent. The two main examples are PacBio's Single Molecular RealTime Sequencing and the Oxford Nanopore Technologies. What we see is that there is a
multitude of technologies and they have different error rates and different error profiles. The
current assembly algorithms typically are tailored towards the particular technology that is
used to produce the reads. This has a good side because if you're using a particular technology,
you want your algorithm to exploit what's going on there. On the other hand, these
technologies are evolving at a rapid rate and so 10 years from now we are probably going to
have different technologies and the question is will these same algorithms be good and how
much tailoring to a particular technology will make it unsuitable for other types of technologies.
We'd like to understand what are the types of general robust sequencing algorithms that can
work well with respect to different sequencing technologies. In order to study this what we're
going to do is not commit to one particular error profile, but instead, we'll study and adversarial
corruption or error model for the reads. What we assume is that instead of getting a true read,
R, what you get is a corrupted read. What was your phrase? I forget already.
>>: Perturbed.
>> Miklos Racz: Perturbed, yeah you get a perturbed read. The types of errors that happen are
deletions, insertions and substitutions. In this example these two A's in red got deleted. This
yellow G turned into an A and this C in agreement was inserted in between the G and A. These
are the typical types of errors that happen. The edit distance is the distance that quantifies
how much of this error has happened. The edit distance between two sequences is the
minimum number of insertions, deletions and substitutions required to take one sequence to
the other. The only thing we're going to assume is that we know that the edit distance of the
perturbed reads is at most something from the true reads. In particular, were going to assume
that the edit distance is at most epsilon times the length of the true read, which I'll denote by
capital L. This means that an epsilon fraction of the read can be arbitrarily perturbed,
corrupted. Given this error model, here is the approximate reconstruction problem. We are
going to model the sequence of interest x, the uniformly random sequence from this four letter
alphabet of length n. Of course, this is not realistic but this makes things simpler. You can
study this problem for arbitrary sequences and so I'll mention something along those lines later.
For simplicity, let's stick to the simple setting. Once you have this sequence x you draw capital
N many reads. What this means is that you just pick uniformly at random positions along the
sequence and then you look at a substring of length L there. Then there's some adversary that
can apply any perturbation they want up to a fraction of epsilon of the read length. Then we
are given the set of corrupted reads and our goal is then to approximately reconstruct the
original sequence x. Our algorithm will output some sequence x hat, not necessarily of the
same length n because we don't even necessarily know what that is. What we want it to satisfy
is that it has close at a distance to the original sequence. So because of this adversarial error
model you cannot really hope to get smaller than linear in epsilon n because what the
adversary could do is. So here is your sequence and you chop it up into length L parts and then
it says that I will never give you this information. These are of length L and this is of length
epsilon L. Then an epsilon fraction of the whole sequence is not given to, so you cannot really
do too much better. What we want is some algorithm that gives you a small constant factor
approximation. Any questions about the set up?
>>: Do we know anything about the [indiscernible] error model? Like if the errors are random
[indiscernible]?
>> Miklos Racz: Yeah. If you just have IID noise then there are our results that say you can
average them and you can do well. I'll refer to that a little bit later.
>>: It's somewhat realistic that the errors might be not just IID because it served a pattern on
the sequence and what goes on?
>> Miklos Racz: Yeah. This problem, not this particular problem but the general problem of
reconstruction from reads has been studied by a lot of us in the past several decades. So the
two main obstructions to reconstruction are very well known. The first is that if your reads are
too short then this will lead to repeats and then this will lead to ambiguity in reconstruction.
Let me explain in more detail what I mean by this. Suppose you have a substring A here that
repeats in the sequence and also a substring B here that repeats and that your reads are
shorter than the length of A and B. Suppose you had all the reads, even then you wouldn't be
able to distinguish between the sequence at the top where Y is on the left and Z is on the right
and the sequence at the bottom where Z is on the left and Y is on the right.
>>: When you get the reads is there no ordering of where they come from?
>> Miklos Racz: No. They are just in a bag. So somehow the technology that gives you this,
you create a bunch of replicas of the DNA sequence and then you put it in this big soup and you
blow it up and you get out these reads. That's how it goes. Okay. Your reads have to be long
enough. And the other main instruction is that you have to have enough of them to cover
sequence. Again, if there are some parts of the sequence that you have no information of then
you have no way of reconstructing what's there. Also, even if you could reconstruct everything
on the left and everything on the right you don't know which is on which side.
>>: Why are you giving them credit for coupon collection?
>> Miklos Racz: This is essentially a coupon collector problem, but that's okay. In this testing
they were the ones who did this, but they also applied it to real sequences and computed these
various quantities and things like that. This is a coupon collector problem. This is the number
of reads you need in order to cover the sequence with probability of at least one minus delta.
>>: I must have missed on this slide. The reads are selected from uniformly random locations?
The adversary only enters when the correct…
>> Miklos Racz: Right. So this first obstruction goes back to Ukkonen and this goes back to
whoever it goes back to. [laughter]
>>: The coupon collector.
>> Miklos Racz: Yeah, the coupon collector. And this is often referred to as the repeat limited
regime and this is referred to as the coverage limited regime. If epsilon is equal to zero so you
have error-free reads, then this problem is the same as if you want to reconstruct this sequence
exactly. So this has been studied in various settings. In particular, in this exact setting where
you have shotgun sequencing and a random string, this was studied by Motahai, Bresler and
Tse who showed that these are the only obstructions to reconstruction. So more precisely,
back to the random sequence, they do more than that. It's more general but let's stick to the
simple setting and the reads are logarithmic in n. So if you have a random sequence of length n
then the longest repeats will be on the order of log n. I don't suppose the error of probability, I
mean the probability that the algorithm is correct, you want that to be greater than a half.
Then if the reads are shorter than this threshold so that you do have repeats in your random
sequence of this length, then this exact reconstruction is impossible because of the
aforementioned fact. And if you are above this threshold so you don't have repeats then
basically if you have enough reads to cover the sequence, then you are good and you can
reconstruct. So the ratio of the minimum number of reads you need and the minimum number
for reconstruction the minimum number of reads you need for coverage goes to one. They
looked also at more general models so the sequence can come from a Markov chain, for
instance. And of course for the real-life application you also are interested in arbitrary
sequences and for arbitrary sequences, so Bresler, Bresler and Tse studied this problem and
they showed threshold based on their repeat statistics of the sequence. So it's very natural for
random sequence you know what the repeat statistics are so you can get the specific thresholds
this way. And approximate reconstruction, basically result is that you can do approximate
reconstruction if you have enough reads and if the read lengths are long enough, so the reads
are long enough. So more precisely and this is a bit of a mouthful, but basically, you can get an
approximation factor of anything greater than 3 if epsilon is small and the read lengths are long
enough, again, on this logarithmic scale and you have enough reads. Let me make some
comments. I'll show in a second that the simple sequential algorithm works and that's good
because it's kind of the most natural thing you might expect to do. You just patch the reads
one by one. Here in the statement I have some dependence on epsilon of the required read
length and the number of reads, so this is not necessary if you just want the finite
approximation factor, but I won't get into this. One thing I should mention is that in the
previous picture when epsilon was equal to zero, then there is this very nice picture of a clear
sharp threshold. This was made easier by the fact that in that case you either reconstructed
the sequence exactly or you didn't. It's a binary thing. Here the measure of success depends
on this approximation factor so it's not something that is binary so it might be the case that the
best achievable approximation factor depends on how many reads you are given and how long
they are. So there might not be such a nice picture. We would like to understand this picture
better. Let me just mention related work. People have looked at various noise and error
models. And this particular Motahari, Ranchardran Tse and Ma looked at exactly this setup of
just adding IID noise and then you can imagine that averaging out this noise, so you can average
out this noise because of the coupon collector fact that once you have coverage most places
are actually covered log n many kinds so you have many reads covering each coordinate, so if
you have some amount of noise then you can still do it with that. And Shomorony, Courtade
and Tse looked at an adversarial model but it was somewhat weak adversarial model so I can
explain more off-line.
>>: If you just use a constant multiple of the coverage then everything is covered a long time.
>> Miklos Racz: Right. Before showing you the algorithm, very briefly that me say that in order
to analyze the algorithm -- the only assumption we're making is how the edit distance can
change. What we need to understand is what is the edit distance of two reads that don't
overlap and what is the edit distance of two reads that do overlap and how it depends on how
much they overlap by. Because the adversary can change things in edit distance so this is
something we really need to understand. The picture is the following. If you take two
independent sequences of length and then their edit distance will be linear in m. This just
follows because of sub additive ergodic theorem, so there's going to be some limiting constants
here and it's a hard problem to determine this constants explicitly, but empirically you can
simulate this and find what -- in particular, when you have four symbols, this concept is about
.51 and using a volume argument you can get a lower bound of .33. This means that if you have
two reads that don't overlap then they are pretty far apart. So if the adversary can only change
that distance by a little bit they will still be far apart. The other end of the picture is when you
have two reads that really overlap by a lot. Let's look at this picture here. If they overlap by a
lot then the edit distance is at most twice the shift between the two reads because you can't
get from this read at the top to the one at the bottom by deleting the blue part on the left and
adding the red part on the right. For random sequences it turns out that this is exactly the edit
distance. You can't do better than this because of the randomness of these sequences. This is
true up until a constant fraction of a shift between the two sequences and empirically, it's true
exactly until half of this constants here. And then the edit distance just becomes as if they were
completely independent. We cannot prove this curve but we can prove that this is true until
some point here and then we have a lower bound of that form and that tells us that if the two
guys are close then they overlap by a lot, then they are close. Then their adversary cannot
make them really far apart in edit distance.
>>: You don't prove that the profile has just two lines?
>> Miklos Racz: No. I'm almost out of time so let me just very briefly say what the algorithm is.
You take a reads and then you look at an appropriate length suffix and then you look into your
bag of reads and try to find a read that has prefix of that length that is really close in edit
distance. There are a couple of things you have to guarantee. You have to guarantee that this
is larger than 2 epsilon now because otherwise you can get rid of the overlap. You also have to
make sure that this is larger than the length of the longest repeat. So there are a couple of
things you have to [indiscernible] so that's why we have the various assumptions on the read
length and the number of reads. But then you basically at each step you make a game that's
linear in L and in particular, if epsilon is small then it's almost L and you only make an error of
about 3 epsilon now.
>>: So even computing edit distance is harder [indiscernible]
>> Miklos Racz: Computing edit distance?
>>: Do we know the complexity [indiscernible]
>> Miklos Racz: [indiscernible] squared.
>>: [indiscernible] squared.
>> Miklos Racz: Let me conclude. Here we've introduced an adversarial model for the read
errors and we show that approximate reconstruction is possible using a very simple algorithm.
There are challenges so we would like to determine what are the really fundamental limits of
this approximate reconstruction. Also we would like results for arbitrary sequences and they
would be roughly of the form of saying if the reads are long enough so that non-overlapping
reads have large edit distance and overlapping reads have small edit distance then you will be
able to do it. So this is the type of result that should be true. More generally, of course the
models were you just have pure noise, those are not realistic. But this is also a bit going
overboard because nature is not necessarily adversarial. I think there is this space for models in
between previous models and this current one in trying to find a good model in between. In
particular, we talk about heterogeneous error rates for instance and various other things. With
that, let me conclude. Thank you. [applause].
>> Yuval Peres: Any questions?
>>: I have a really stupid question. Is DNA directional? I mean can you tell which end is which?
>> Miklos Racz: Okay. I'm not an expert in this, but yeah.
>>: [indiscernible], yes.
>>: Thanks.
>>: Don't occasionally small pieces get reversed?
>>: Oh yes. When you look at the full string, then there is a transitional always happens and
one direction.
>>: But did you just have one of these sections [indiscernible] or whatever, do you know what
round it is?
>>: No. But I believe when you get the reads you know which direction to find. If I just gave
you a sequence you wouldn't know, but the read sequencing [indiscernible], to my
understanding that's [indiscernible] but I'm not an expert.
>>: It's my understanding that the two strands going opposite directions, so when you get a
read it's ambiguous whether it's forward from the left strand or backward from the right strand
or the backward based [indiscernible] from the right strand.
>>: [indiscernible]
>> Yuval Peres: Any other questions? I have one. Can you go back once slide? We said the
edges exactly [indiscernible]. Here were you depressing a little advocate error there?
>> Miklos Racz: No.
>>: [indiscernible]. Less than or equal to.
>> Miklos Racz: Less than or equal to 2k follows this reconstruction.
>>: [indiscernible]
>> Miklos Racz: Yeah, with probability at least one minus epsilon.
>>: So it's full as big is it…
>> Yuval Peres: Any other questions? Then let's thank the speaker again. [applause]
Download