>> Yuval Peres: We are starting the next talk. I was once in a lecture by Chi Lu and we'll explain that at least when we talk to the outside world we shouldn't talk about exploration and exploitation but about learn and earn. Similarly instead of greedy algorithms they should be locally optimizing algorithms. In the same spirit we'll next here a talk from Miklos Racz who will talk about sequence assembly from perturbed bow and arrow reads. >> Miklos Racz: Thanks Yuval. My mom really loves this new title. I'll be talking about joint work with Shirshendu Ganguly and Elchanan Mossel. The motivation is DNA sequencing. DNA sequencing has really revolutionized the biotech industry in the past couple of decades because the DNA sequences of organisms contain so much information about them. There are tons of applications. Unfortunately, we cannot just read the DNA sequence of an individual from left to right. That would be great. What the current technologies allow us to do is to get a bunch of short or not so short substrings of the long sequence that you're interested in. So x is the DNA sequence and so this is just a sequence composed of A's, C's, G's and T's and what you get is a bunch of short substrings which are called reads. The problem we are facing is to reconstruct this original sequence x from the reads r. There are a bunch of sequence technologies that have been developed in the past couple of decades. Basically, what ideally we would like our long reads that are error-free or have very few errors. However, this is hard to achieve, so typically the technologies try to trade-off in these two quantities. Originally, Sanger had developed a type of sequencing which was unfortunately very expensive. What's really brought DNA sequencing to widespread use was the development of next generation sequencing methods such as a Illumina about a decade or so ago. This allowed us to get high throughput data really cheaply. Unfortunately, in order to get really low error reads, this was restricted to having very short reads on the order of length of maybe one hundred. >>: When you say the error rate, is that a percent of wrong symbols? >> Miklos Racz: Yes. >>: I'm assuming they are uniformly spread? >> Miklos Racz: Yeah, so the point is that these different sequencing technologies have different types of error profiles. There's some noise involved, but there are some biochemical things going on and so there are some systematic errors that could go wrong depending on how it's done. Because these next-generation sequencing methods only produce short reads, it led to highly incomplete assemblies and they were really fragmented because there were big parts of the genome that you couldn't connect properly. Partially, in order to get around this problem the new kinds of technologies that are emerging are ones that get much longer reads, so over 10,000 base pairs, but unfortunately, the current status of these still have high error rates, so upwards of 10 percent. The two main examples are PacBio's Single Molecular RealTime Sequencing and the Oxford Nanopore Technologies. What we see is that there is a multitude of technologies and they have different error rates and different error profiles. The current assembly algorithms typically are tailored towards the particular technology that is used to produce the reads. This has a good side because if you're using a particular technology, you want your algorithm to exploit what's going on there. On the other hand, these technologies are evolving at a rapid rate and so 10 years from now we are probably going to have different technologies and the question is will these same algorithms be good and how much tailoring to a particular technology will make it unsuitable for other types of technologies. We'd like to understand what are the types of general robust sequencing algorithms that can work well with respect to different sequencing technologies. In order to study this what we're going to do is not commit to one particular error profile, but instead, we'll study and adversarial corruption or error model for the reads. What we assume is that instead of getting a true read, R, what you get is a corrupted read. What was your phrase? I forget already. >>: Perturbed. >> Miklos Racz: Perturbed, yeah you get a perturbed read. The types of errors that happen are deletions, insertions and substitutions. In this example these two A's in red got deleted. This yellow G turned into an A and this C in agreement was inserted in between the G and A. These are the typical types of errors that happen. The edit distance is the distance that quantifies how much of this error has happened. The edit distance between two sequences is the minimum number of insertions, deletions and substitutions required to take one sequence to the other. The only thing we're going to assume is that we know that the edit distance of the perturbed reads is at most something from the true reads. In particular, were going to assume that the edit distance is at most epsilon times the length of the true read, which I'll denote by capital L. This means that an epsilon fraction of the read can be arbitrarily perturbed, corrupted. Given this error model, here is the approximate reconstruction problem. We are going to model the sequence of interest x, the uniformly random sequence from this four letter alphabet of length n. Of course, this is not realistic but this makes things simpler. You can study this problem for arbitrary sequences and so I'll mention something along those lines later. For simplicity, let's stick to the simple setting. Once you have this sequence x you draw capital N many reads. What this means is that you just pick uniformly at random positions along the sequence and then you look at a substring of length L there. Then there's some adversary that can apply any perturbation they want up to a fraction of epsilon of the read length. Then we are given the set of corrupted reads and our goal is then to approximately reconstruct the original sequence x. Our algorithm will output some sequence x hat, not necessarily of the same length n because we don't even necessarily know what that is. What we want it to satisfy is that it has close at a distance to the original sequence. So because of this adversarial error model you cannot really hope to get smaller than linear in epsilon n because what the adversary could do is. So here is your sequence and you chop it up into length L parts and then it says that I will never give you this information. These are of length L and this is of length epsilon L. Then an epsilon fraction of the whole sequence is not given to, so you cannot really do too much better. What we want is some algorithm that gives you a small constant factor approximation. Any questions about the set up? >>: Do we know anything about the [indiscernible] error model? Like if the errors are random [indiscernible]? >> Miklos Racz: Yeah. If you just have IID noise then there are our results that say you can average them and you can do well. I'll refer to that a little bit later. >>: It's somewhat realistic that the errors might be not just IID because it served a pattern on the sequence and what goes on? >> Miklos Racz: Yeah. This problem, not this particular problem but the general problem of reconstruction from reads has been studied by a lot of us in the past several decades. So the two main obstructions to reconstruction are very well known. The first is that if your reads are too short then this will lead to repeats and then this will lead to ambiguity in reconstruction. Let me explain in more detail what I mean by this. Suppose you have a substring A here that repeats in the sequence and also a substring B here that repeats and that your reads are shorter than the length of A and B. Suppose you had all the reads, even then you wouldn't be able to distinguish between the sequence at the top where Y is on the left and Z is on the right and the sequence at the bottom where Z is on the left and Y is on the right. >>: When you get the reads is there no ordering of where they come from? >> Miklos Racz: No. They are just in a bag. So somehow the technology that gives you this, you create a bunch of replicas of the DNA sequence and then you put it in this big soup and you blow it up and you get out these reads. That's how it goes. Okay. Your reads have to be long enough. And the other main instruction is that you have to have enough of them to cover sequence. Again, if there are some parts of the sequence that you have no information of then you have no way of reconstructing what's there. Also, even if you could reconstruct everything on the left and everything on the right you don't know which is on which side. >>: Why are you giving them credit for coupon collection? >> Miklos Racz: This is essentially a coupon collector problem, but that's okay. In this testing they were the ones who did this, but they also applied it to real sequences and computed these various quantities and things like that. This is a coupon collector problem. This is the number of reads you need in order to cover the sequence with probability of at least one minus delta. >>: I must have missed on this slide. The reads are selected from uniformly random locations? The adversary only enters when the correct… >> Miklos Racz: Right. So this first obstruction goes back to Ukkonen and this goes back to whoever it goes back to. [laughter] >>: The coupon collector. >> Miklos Racz: Yeah, the coupon collector. And this is often referred to as the repeat limited regime and this is referred to as the coverage limited regime. If epsilon is equal to zero so you have error-free reads, then this problem is the same as if you want to reconstruct this sequence exactly. So this has been studied in various settings. In particular, in this exact setting where you have shotgun sequencing and a random string, this was studied by Motahai, Bresler and Tse who showed that these are the only obstructions to reconstruction. So more precisely, back to the random sequence, they do more than that. It's more general but let's stick to the simple setting and the reads are logarithmic in n. So if you have a random sequence of length n then the longest repeats will be on the order of log n. I don't suppose the error of probability, I mean the probability that the algorithm is correct, you want that to be greater than a half. Then if the reads are shorter than this threshold so that you do have repeats in your random sequence of this length, then this exact reconstruction is impossible because of the aforementioned fact. And if you are above this threshold so you don't have repeats then basically if you have enough reads to cover the sequence, then you are good and you can reconstruct. So the ratio of the minimum number of reads you need and the minimum number for reconstruction the minimum number of reads you need for coverage goes to one. They looked also at more general models so the sequence can come from a Markov chain, for instance. And of course for the real-life application you also are interested in arbitrary sequences and for arbitrary sequences, so Bresler, Bresler and Tse studied this problem and they showed threshold based on their repeat statistics of the sequence. So it's very natural for random sequence you know what the repeat statistics are so you can get the specific thresholds this way. And approximate reconstruction, basically result is that you can do approximate reconstruction if you have enough reads and if the read lengths are long enough, so the reads are long enough. So more precisely and this is a bit of a mouthful, but basically, you can get an approximation factor of anything greater than 3 if epsilon is small and the read lengths are long enough, again, on this logarithmic scale and you have enough reads. Let me make some comments. I'll show in a second that the simple sequential algorithm works and that's good because it's kind of the most natural thing you might expect to do. You just patch the reads one by one. Here in the statement I have some dependence on epsilon of the required read length and the number of reads, so this is not necessary if you just want the finite approximation factor, but I won't get into this. One thing I should mention is that in the previous picture when epsilon was equal to zero, then there is this very nice picture of a clear sharp threshold. This was made easier by the fact that in that case you either reconstructed the sequence exactly or you didn't. It's a binary thing. Here the measure of success depends on this approximation factor so it's not something that is binary so it might be the case that the best achievable approximation factor depends on how many reads you are given and how long they are. So there might not be such a nice picture. We would like to understand this picture better. Let me just mention related work. People have looked at various noise and error models. And this particular Motahari, Ranchardran Tse and Ma looked at exactly this setup of just adding IID noise and then you can imagine that averaging out this noise, so you can average out this noise because of the coupon collector fact that once you have coverage most places are actually covered log n many kinds so you have many reads covering each coordinate, so if you have some amount of noise then you can still do it with that. And Shomorony, Courtade and Tse looked at an adversarial model but it was somewhat weak adversarial model so I can explain more off-line. >>: If you just use a constant multiple of the coverage then everything is covered a long time. >> Miklos Racz: Right. Before showing you the algorithm, very briefly that me say that in order to analyze the algorithm -- the only assumption we're making is how the edit distance can change. What we need to understand is what is the edit distance of two reads that don't overlap and what is the edit distance of two reads that do overlap and how it depends on how much they overlap by. Because the adversary can change things in edit distance so this is something we really need to understand. The picture is the following. If you take two independent sequences of length and then their edit distance will be linear in m. This just follows because of sub additive ergodic theorem, so there's going to be some limiting constants here and it's a hard problem to determine this constants explicitly, but empirically you can simulate this and find what -- in particular, when you have four symbols, this concept is about .51 and using a volume argument you can get a lower bound of .33. This means that if you have two reads that don't overlap then they are pretty far apart. So if the adversary can only change that distance by a little bit they will still be far apart. The other end of the picture is when you have two reads that really overlap by a lot. Let's look at this picture here. If they overlap by a lot then the edit distance is at most twice the shift between the two reads because you can't get from this read at the top to the one at the bottom by deleting the blue part on the left and adding the red part on the right. For random sequences it turns out that this is exactly the edit distance. You can't do better than this because of the randomness of these sequences. This is true up until a constant fraction of a shift between the two sequences and empirically, it's true exactly until half of this constants here. And then the edit distance just becomes as if they were completely independent. We cannot prove this curve but we can prove that this is true until some point here and then we have a lower bound of that form and that tells us that if the two guys are close then they overlap by a lot, then they are close. Then their adversary cannot make them really far apart in edit distance. >>: You don't prove that the profile has just two lines? >> Miklos Racz: No. I'm almost out of time so let me just very briefly say what the algorithm is. You take a reads and then you look at an appropriate length suffix and then you look into your bag of reads and try to find a read that has prefix of that length that is really close in edit distance. There are a couple of things you have to guarantee. You have to guarantee that this is larger than 2 epsilon now because otherwise you can get rid of the overlap. You also have to make sure that this is larger than the length of the longest repeat. So there are a couple of things you have to [indiscernible] so that's why we have the various assumptions on the read length and the number of reads. But then you basically at each step you make a game that's linear in L and in particular, if epsilon is small then it's almost L and you only make an error of about 3 epsilon now. >>: So even computing edit distance is harder [indiscernible] >> Miklos Racz: Computing edit distance? >>: Do we know the complexity [indiscernible] >> Miklos Racz: [indiscernible] squared. >>: [indiscernible] squared. >> Miklos Racz: Let me conclude. Here we've introduced an adversarial model for the read errors and we show that approximate reconstruction is possible using a very simple algorithm. There are challenges so we would like to determine what are the really fundamental limits of this approximate reconstruction. Also we would like results for arbitrary sequences and they would be roughly of the form of saying if the reads are long enough so that non-overlapping reads have large edit distance and overlapping reads have small edit distance then you will be able to do it. So this is the type of result that should be true. More generally, of course the models were you just have pure noise, those are not realistic. But this is also a bit going overboard because nature is not necessarily adversarial. I think there is this space for models in between previous models and this current one in trying to find a good model in between. In particular, we talk about heterogeneous error rates for instance and various other things. With that, let me conclude. Thank you. [applause]. >> Yuval Peres: Any questions? >>: I have a really stupid question. Is DNA directional? I mean can you tell which end is which? >> Miklos Racz: Okay. I'm not an expert in this, but yeah. >>: [indiscernible], yes. >>: Thanks. >>: Don't occasionally small pieces get reversed? >>: Oh yes. When you look at the full string, then there is a transitional always happens and one direction. >>: But did you just have one of these sections [indiscernible] or whatever, do you know what round it is? >>: No. But I believe when you get the reads you know which direction to find. If I just gave you a sequence you wouldn't know, but the read sequencing [indiscernible], to my understanding that's [indiscernible] but I'm not an expert. >>: It's my understanding that the two strands going opposite directions, so when you get a read it's ambiguous whether it's forward from the left strand or backward from the right strand or the backward based [indiscernible] from the right strand. >>: [indiscernible] >> Yuval Peres: Any other questions? I have one. Can you go back once slide? We said the edges exactly [indiscernible]. Here were you depressing a little advocate error there? >> Miklos Racz: No. >>: [indiscernible]. Less than or equal to. >> Miklos Racz: Less than or equal to 2k follows this reconstruction. >>: [indiscernible] >> Miklos Racz: Yeah, with probability at least one minus epsilon. >>: So it's full as big is it… >> Yuval Peres: Any other questions? Then let's thank the speaker again. [applause]