>> Bill Bolosky: I am very pleased to introduce to you Kristal and Matei who are both fifth-year grad students at Berkeley, although, this does not mean that they are equidistant from graduating because Matei is finishing next year and Kristal just told me two, but they have spent quite a bit of time working with a bunch of people including me on gene sequencing. I can't speak for them, but for me this is one of the most fun things I have ever done. It's just amazingly cool to learn that knowledge about hash tables translates into biology. They are going to get to tell you about the project that we worked on and they are going to team speak and I don't know which order they are in. >> Matei Zaharia: I am Matei and I will start off. This is a project that a bunch of people who were interested, a lot of the people you see up here are from the amplab at UC Berkeley which is a new lab that started about a year ago which is focused on big data and large-scale machine learning and data processing and we are also working with Bill and Harvey from Microsoft and with some folks from UCSF including Taylor Sittler who is here today. He is a medical researcher there. Let me just jump into this, why we're doing this. I think if you have seen some biology in the past you will know that DNA is the molecule and the code that orchestrates all of the activity of a living cell, and because of this, DNA is a central element in a lot of diseases. It either has differences in the DNA that causes the disease or it is actually involved in propagating it somehow. In particular, what DNA does is it encodes how to build these worker molecules called proteins in the cell and also signaling information about essentially when to build them, so everything the cell does is guided by that. Two ways in which it ties to diseases is first of all in cancer, cancer is essentially caused by the DNA in some cells accumulating enough mutations that the regulatory mechanisms break down and those cells start multiplying without the right regulation and start consuming all of the resources of the body. On top of that there are hereditary diseases which are suggested that there is some mutation that is passed down that causes a problem. DNA also affects susceptibility to various drugs and it's an important thing to understand what's going on in the cell. One of the things that's happened in the past decade is that the cost of actually reading the DNA from a cell has fallen dramatically and it's fallen enough that it's starting to actually be used in clinical medicine. Here is a picture of that and this compares against Moore's law. This is the cost of sequencing one human genome and if you go back to 1999 and 2000, the human genome project cost several billion dollars to sequence the first human genome and that was a great accomplishment. If you fast forward a bit, this year you can actually get your genome sequence for about $3000 and next year a couple of companies have announced that they will do it for $1000, so this sort of magical thousand dollar genome is really on the horizon and I think if you work out the rate, the cost has been falling by something like a factor of three or four per year and it's actually quite a bit faster than Moore's law, so we are able to actually read this stuff now. Being able to read this stuff can impact in a lot of areas of medicine and of course of biological research as well. Here are just some examples. In cancer the most exciting thing about this is that cancer is typically caused by a bunch of mutations that affect many different gene pathways, like these are groups of genes that work together to perform a function in the cell and often cancers that look very different like a skin cancer and a lung cancer might involve the same pathway, and by actually sequencing it you might be able to figure out which pathway is involved and also which drugs you can use to target that particular pathway, so there are many targeted cancer drugs that will work really well if one of the possible things that is going on then will have no effect if something else is happening. For infectious disease like just common viruses and bacteria one of the nice things you can do with sequencing is you can really quickly find out which pathogen a patient is infected with. You don't have to do these sort of crazy diagnostic tests where you look at the symptoms. You can actually get some DNA and see what's going on, and another thing this has been used for is identifying new diseases such as the H1N1 flu by actually sequencing and saying hey, this looks different from stuff that we've seen before. Finally you can personalize medicine to an individual’s genome so you can identify things they are susceptible to in advance so for example, some drug allergies or side effects only happen to people with specific genes, you can know which inheritable diseases they have then you can estimate the risk of various things going on. There are some examples of this happening already. Just a couple of weeks ago there were a series of New York Times articles about sequencing driven personalized treatment for cancer, so this was a case at Washington University where they sequenced the cancer that, actually one of the researchers there had and they figured out that it was actually susceptible to a drug that had been designed for a completely different type of cancer and they applied it and actually managed to induce a remission in that patient. There is another one that was, well this was the case where in the end it didn't work out. The remission was only for a short amount of time, but similarly they were able to use a drug from one cancer to treat a different one that it normally wouldn't have been prescribed for. Some examples in noncancer things as well, for example, diagnosing metabolic diseases which is usually very hard. They can sometimes be due mutations in mitochondrial DNA and here they were able to just sequence mitochondria and figure out some of these diseases. So this is all great in terms of affecting medicine, but we are here talking about a computer research lab. So what is the computational challenge? It turns out that although the cost of the wet lab part of sequencing is dropping dramatically; you have to do a lot of processing to actually put this data to use, both to understand what is happening, like how to actually put together the sequence for one human and of course beyond that to understand which genes affect which diseases. We are not even going to get into that in this talk, but that is also a very hard kind of machine learning problem. The reason that the processing is difficult is that the way the sequencing machines work is essentially by using massive parallelism by reading many small random substrings of the genome in parallel and you get together these little random strings kind of like puzzle pieces and you have to put them together see the whole picture of what was actually going on in that person's genome, and I will show what that looks like and what are the steps you need to do that. The bottom line is that the current pipelines for doing this take multiple days for each genome and costs, if you actually work out the compute time, it would cost on something like Amazon or Azure or to some extent even if you buy your own machines, it can cost into the thousands of dollars just for one genome, so this is starting to exceed the cost of the actually sequencing itself. Question? >>: [inaudible] just the chemistry purpose? >> Matei Zaharia: Yes. Just the chemistry part, exactly. So this is not great for clinical use, but it's especially bad if you want to scale up the number of genomes you sequence for research and like really understand these diseases, because people have to pay that many times over to sequence hundreds of cancer patients and see what is going on. The goal of our team is to build a faster and more scalable and more accurate pipeline that does this genome reconstruction and that can be used both in actual medicine and in research where you get a lot of different genomes and you want to compare them or you just want to do this kind of computation at scale. We have a bunch of people involved from a bunch of different areas. We have a lot of systems people. This started out essentially with a bunch of systems people talking with Taylor and figuring out that we can do some of these steps faster and Bill and Harvey from Microsoft. We have Taylor who is from UCSF and we have some folks in machine learning and in computational biology and theory that are helping out as well at Berkeley. So what will we have in this talk? I am going to first have an overview of how the sequencing process works and what are the computational problems and then we will mostly talk about the first processing step which is sequence alignment and also it turns out to be the most expensive in terms of CPU time. We will talk about a new algorithm we developed called SNAP that cuts down the cost of this step by a factor of 10 to 100 compared to the existing tools, so this is something that takes a problem that was essentially CPU bound initially and took more than a day for a human genome and it turns it into a problem that's now basically I/O bound and we can do it in an hour and a half. They we'll talk about some ideas we have for further improving alignment, especially the accuracy by taking advantage of the structure of the genome and finally we will also talk about some downstream processing steps beyond alignment. Hopefully this will give you a taste of like the different types of problems that exist. Let me just show you quickly with some pictures how the actual sequencers work and what you have to do with the data. DNA itself that you have is a long molecule, well it's actually a bunch of long molecules but in total it is 3 billion of these letters or bases long that encodes the sequence. We have a sequence of these 3 billion things and the first thing you do is you replicate it so you have a bunch of copies or you just create a bunch of copies from a cell. Next to actually sequence it, everything is done in parallel, so you split it randomly into fragments and you can just do it by heating it up or something like that and you get these little fragments of DNA. After that you can read the fragments in parallel. There are machines that do this by doing some clever chemistry to be able to read the sequences of each of them and the way it works is they kind of attach it to a little thing on a plate and they float around the complementary bases to the ones that are part of your DNA and they read what actually sticks to it in what order and as they do this they end up reading the sequence. These little fragments are called reads and about 100 bases or letters long today, but the whole thing as I said is 3 billion. In a typical human genome sequencing run you are going to get 10 to the nine of these reads which is basically covering each location in the genome about 30 times and they are each 100 bases long. Now, how do you actually put these things back together? You can view it as a puzzle. You just get this sequence of reads and one way to treat it as a puzzle is like let's see what the picture on the top of the puzzle boxes, you can use a reference genome which is the genome that was sequenced back in 1999, 2000 and has been refined since then where people use longer read technology and know how everything ties together. When you look at the reads from the person they are going to have some differences from the reference genome of course because it is a different person and they will also have differences because of variances in the sequencing itself, but what you can do is you can take each read and see where in the reference genome it matches sequence best, and then just place it at that location, like placing a piece of the puzzle on the sheet in front of you. So this step is called sequence alignment. This is the one that we will talk a lot about. And what this lets you do is that once you've lined up all of the reads at each location, you'll get something like this where a bunch of them are aligned together and, you know, they might show some difference from the reference genome or you might have some of them show a different. Some of them show noise because there is some error in the actual reading process, but then you can do kind of a voting algorithm to actually figure out what the base at that location was. This is really simple, but it turns out that there is a lot of stuff that makes this hard, so this step is called variant calling, and we'll talk about a little bit later. Just to give you a sense of the rate of differences between these, two people only differ in about one in 1000 bases from each other, so that's pretty small, but the sequencing machines can have error rates of up to a few percent although it depends and it turns out some of the errors are also pretty biased. So this is kind of the error rate you are looking at. Let's jump into the first step of this which is alignment and what we ended up doing for that. Alignment problem as I said is given one of these reads and the reference genome which is a big string, tying the position in there that minimizes the edit distance to the genome and again to give you a sense, the genome is 3 billion bases and the reads are about 100. If you look at the current status, alignment is today the most expensive step of processing and it's also important because you really have to map the reads to the right location in order to do anything downstream of that, so depending on the accuracy of the tools you use it can take a few hundred to a few thousand CPU hours and if you work out the math it can take sort of hundreds to a thousand dollars of compute time and the issue there is also that the faster aligners lose accuracy so they typically don't align as many of the reads and they support fewer differences inside each read. The problem with that is if you have a place in the genome where a person really has like five differences in a row and your aligner doesn't support that, you are going to systematically miss all of the reads that map there and you are not going to see that difference in the downstream analysis. So we built SNAP, scalable nucleotide alignment program, is a tool that is 10 to 100 times faster than the current ones and at the same time it improves the accuracy, so it has higher accuracy and it also has a richer error model that allows for more types of differences and basically you can give it a parameter k the number of edits you allow from the reads to the reference genome and it will find a location with the best location with at most k edits. As a result as I was saying we cut down this step from about 1 1/2 days to 1.5 hours and this is done while reducing the number of errors in half on sort of real human data. What do current aligners do and how do we do this better? There are two methods to do alignment. One of the earliest ones, if you've heard of BLAST, this is what it does. It's based on seeds and so the idea here is you have the genome, you index just short substrings of it, of say 10 characters and what you do is then you take your read and you take every 10 characters in here and look for an exact match in the index. Here we have just four. Say these are the first four of them and this is where it matches and so you've got some candidate locations to try and then you place the read on each one and compute the edit distance and you end up picking up the best one at the end. So this is the seed-based method. And of course you don't know like maybe the first 10 bases match somewhere but actually that's just by chance. Maybe there were some errors in there, so you have to try multiple seeds to really find the best location and so you keep doing this work continuing with the seeds. This is one method. The other method people do, many of the faster tools today uses the Burrows Wheeler transform to encode kind of a prefix tree of the genome of like all of the substrings in the genome and then they search through this tree using backtracking. This is you are just going down the tree and trying to insert this into different locations, but I'm not going to talk much more about this because it's a bit more complicated to explain. What we do in SNAP is we have actually taken the seed-based method but we've done a bunch of algorithmic changes and also sort of systems changes that reduce the cost of the most expensive check, the most expensive step which turns out to be the local edit distance strings. On the one hand we leverage just improving resources. One of these is the actual read length, so the read lengths used to be about 25 bases and now they've gotten longer to about 150 and using this it turns out you can really change the form of hash index you have and do quite a bit better. We also leverage higher memory so our algorithm is designed for servers with about 50 gigabytes of memory, so we can use that. On the algorithm side we have a way to prune the search to reject most of the local alignment locations without fully computing the score and this turns out to save a lot of time as well. Let me just explain the first part. We're going to use these seeds to match a hash table of just exact matches in the reference genome but we've chosen to use longer seeds then a lot of the previous aligners and the reason for this is there is a trade-off in general between the seed size and the probability of actually finding a seed in the probability that matches the genome and the amount of false positive hits you have. The human genome is about 4 to the 16 bases and if you have a seed of 10 bases and if you have a 2% sequencing error, it turns out there is a 19% chance that your seed contains an error. Otherwise 80% of the time it will match, but also you will have four to the six or 4000 matches just by chance against the whole genome just on average, so this is why people have used it in the past and it made sense when you have very short reads. You couldn't take much longer seeds than that and expect it to match. If you go up to a 20-based seed, you have a higher chance of an error, but you also have almost 0 chance of it matching just randomly at a particular location. At least if the genome is a random string, you expect to test a lot fewer candidate locations. The reason this makes sense for a longer read is because the read is longer you can take many more independent seeds to try. Yes? >>: Have you made any assumptions about high repeat regions? >> Matei Zaharia: Right. This is very simplified, so actually high repeat regions mess this up, so some regions, some seeds are much more common than others and just big regions are replicated. Yes. So you still have to search a lot; you have to search more than one hit for each seed usually, yeah. We will actually talk about that idea. That's one of the things that we are looking at. So just to show this, so you have this short read of 25 bases, you have an error there. If you take a 20 based seed, there is pretty much no place you can put it without touching that error, but if you have a long read of 100 bases, there are a bunch of varies even with a 2% error rate you do expect some seeds to work well and so you can actually find matches for this in the index. So this is just an observation with some math of something we can do. Yes? >>: What technology allows you to go from 25 to 100 [inaudible] thousand [inaudible]? >> Matei Zaharia: Yeah, definitely. Regions are improving and basically what it is is I showed that picture at the bottom like if you attach one end of the DNA string and then you float these molecules around and there is actually like they are fluorescent and there's a camera pointed at it and see which one is touched, and what happened is they've been able to attach more of them before they have to stop. Before like after about 25 of them there was too much noise to be able to attach more and like see which one actually got put there, and they've improved both chemistry of how those things float around and the camera technology to be able to do more. >>: [inaudible] technology as well [inaudible] electronic reads [inaudible] there are other technologies that are more electronic reads so you have way less complications so we will see longer reads coming in the not too distant future. >>: I was wondering what is the order of magnitude we can expect? >> Matei Zaharia: I think 10,000. >>: 10,000? >>: There are other people working on mega based reads. >> Matei Zaharia: Yeah. >>: Yeah [inaudible] [laughter]. >>: It has a ways to go but it will be interesting. >>: Yeah, the 10,000 based pairer, they seem to be doable. There are some prototypes that are out there. >>: Turns out that back track is a bad strategy for like 10,000 or higher. [laughter]. >> Matei Zaharia: Yeah, even for 100. We'll see. Yep. Yeah, that's a good question. These things are going to be getting longer. The other thing we do is very simple on the index site, but it actually helps a lot. A lot of the existing tools were designed at a time when server memories were a lot lower and for example, if you index 10 base pair of seeds, they only took the nonoverlapping ones, like the one position 0 to 9 and then 10 to 19, whatever, all of these disjoint seeds. What we do is we just do a sliding window and index every substring of that length and if you think about it this means that our index has 10 times, or actually 20 times more seeds in it because we are using 20 base pair of seeds, but it turns out that if you pack the bytes nicely into a hash table, you can fit that into a very reasonably sized memory, so we are able to do this with 39 gigabytes of memory for the human genome. This is important because looking for a seed and not finding it in the hash table is actually really expensive. It's like hundreds of cycles because it's an LP cache base because even with a small hash table than they had before it's not going to fit in your processing cache, so it really helps to actually have this. The other part, the algorithmic part that is really different beyond this is the way we do the local alignment check. As we, as I mentioned before, it turns out that the genome is not a random string. There are a lot of areas that are quite similar to each other, so for many seeds you will find them in a lot of locations and in general for many reads you will have to test them a bunch of candidates before you find the best one. This is where all of the cycles in the algorithm go essentially, at least 90% are going into this. What is our insight here? The thing we figured out is that in the end of the day to actually map the read you only care about the best and second based locations where are the lines. If the place where the line is best, say that it has edit distance one and the second best place has edit distance five, you can be pretty sure that it came from that best one and you are just going to align it there. If the best is at distance one and the second best is at distance two, then maybe you are not sure and you are just going to tell the downstream analysis okay, I am not sure where this read goes. You might give it both locations or something like that, but these are kind of the confidently caller reads; you just need to know best and second best locations. How can we use this? We replace the traditional edit distance algorithm which always computes the full edit distance at each location and is quadratic time algorithm with one where you can give it a limit on the edit distance and you can tell it if the distance is bigger than this, I don't care how big it is. Just stop early and tell me it's bigger, and so we have this algorithm, the complexity is only n times the distance limit and we lowered the limit as we find more hits and improve the best and second best locations. Yes? >>: [inaudible]. >> Matei Zaharia: Yeah, n is the, it is the larger X, so if we are only comparing locally at each place, so n is like a hundred basically; it is the length of the read. We've mapped it to one place. With a seed we take those a hundred and the one hundred that we have, yeah. With this algorithm you can lower the limit and you can also arrange things so that the first hit you find actually has a low edit distance and start out with a low one. >>: Basically abandon edit search, edit distance? >> Matei Zaharia: It's a little kind of like filling in only the diagonal of the matrix, but it's actually a bit nicer than that because it only uses like order d space, so it only tracks how far you can go down each diagonal, so it's actually kind of cool because it fits in the L1 cache of the processor as well. That's what it is. Here is like the actual algorithm. I'll just step through this to show you the way we actually do the pruning. Basically what we do, when we start, we start the d limit to be the maximum distance plus this confidence threshold c which is how far away we want the best and second best rate to be to call it unambiguous. We go out and extract seeds from it and for each seed, so this is the confidence threshold. For each seed we go out and find the location where it matches in the genome and we actually prune seeds that have too many locations because some things are just too repetitive, we look for a better seed that doesn't have that property. We add candidates to a list. Once we've tested a minimum number of seeds, we look at the candidates that match the most of them and the idea here is we are going to an expensive edit distance computation. Let's find a candidate that will give us a low d limit for the next one, so we score that candidate. Next thing we do is we update the distance limit. There are two cases for this one. We look at the best and second best hit and we keep tracking them as we go along. There are two possibilities. If the best is much better than the second best and we have this confidence threshold c, then we only care about finding other hits within best plus c, because the idea is if there are hits with a bigger edit distance than this one, it's not going to change our results, but if we find one in this window of up to c more we’re going to say maybe we are not confident about the match. That's the first case. Second case is if the best and second are already within distance c then there is no way that finding guys bigger than the best is going to help us. The only thing that will help us is if we find something much better than the best that is up in this window here and there is nothing between it and best so we can be confident about it, so we only search up to best -1 in this case. So yes, that's what we do with that. The final thing is there is a trick that you can do to stop early. If you found that the best is in distance two and maybe only carry things out distance four beyond that, what you can do is if you have done at least five disjoint seeds from the read, you can actually stop right there because if you've tested a bunch of disjoint seeds you know that any read that you haven't yet put in your candidate list, so it didn't match any of the seeds, has at least that many errors in it. So if you've tested five seeds and there is a location that you haven't yet found through the exact matches on those seeds, then it must have at least one error in each one of these seeds and so it must be at least distance five. So you can actually stop early over here, so this lets us stop this search as well. In terms of the results, here are some numbers comparing SNAP to two of the commonly used aligners today, BWA and SOAP. They are both actually based on the Burrows Wheeler approach. We can see where the existing aligner is. We're showing three numbers here. This is on simulated data which lets us know where each read actually came from and we are showing the percentage we aligned the error rate and the speed in reads per second. Here you can see BWA and SOAP give you a trade-off between accuracy and speed. SOAP is a little faster but makes a lot of errors and SNAP actually matches the percent aligned, beats the percent aligned of BWA. It has half the error rate and it's also going about 30 times faster. >>: [inaudible] and we also scaled better. >> Matei Zaharia: Yes. That is true. So this is hundred based reads with 2% differences. This is kind of the current reads you have today. Another cool thing is in SNAP you can actually tune some of the parameters, the max hits you check for each seed and trade-off between accuracy and the speed, so if the speed is too fast for you and you were okay waiting for a day to align this stuff, then you can tune SNAP to give you higher accuracy, so you could align with only a .01% error if you go about three times slower. Or if you are okay with the error rate of BWA you could tune it to get higher speed, so you can trade-off between these. Yes? >>: [inaudible] aligned so you are mining 8 to 9 percent of the errors [inaudible]? >> Matei Zaharia: They are, no, they are just reads we couldn’t align, so the main reason why we couldn't align them is--these are the ones that we could confidently turn one location for. For the other ones there were multiple locations where they matched well. There is also usually like half a percent where they have too many differences and we don't find any location for them. This is how many we are confident about. >>: So if you had a perfect [inaudible] would you have a hundred percent alignment and zero errors? >> Matei Zaharia: No, it would actually only the like, I think in this case it would only be 93% or something like that, or 94, because some regions are exactly identical and you can't know with accuracy, so we're not counting those, but it would have zero errors. Actually, even the error, because of the model you might have two errors that puts you closer to something than where you actually came from, but I think you could calculate where that is. >>: [inaudible] errors [inaudible] so the idea is if you get a higher mutation or an error in the read by the machine it may move this string from the place that it came from to be actually closer in edit distance space to a wrong location and then it's impossible to do this. You might be able to just say I'm not going to call it because it's too close to too many of them and we actually did that to some. >> Matei Zaharia: Yeah. So this is kind of with today's data. One other neat thing is that if you have reads with a lot more errors which some of the future technologies will have and also which will happen if the person has a lot of mutations in one place, SNAP still performs pretty well, so here with 10% error the existing aligners kind of fall over. They align less than 20% of the reads and they actually go a little faster because it's easy when you are not aligning anything to go fast, but we still do okay. Yes? >>: So I ask if you can combine this into just one number percent error you would need to incorporate the downstream processing into… >> Matei Zaharia: Yeah, you would do it yeah. >>: So what happens when you take just an off the shelf downstream processor and… >> Matei Zaharia: If you compare SNAP with a… That's a really good question. Yeah. We don't actually have a complete answer for that yet. We looked at [inaudible] alignment effect that downstream callers and we found places where it does, like where BWA has misaligned some reads consistently and you call mutations that aren't actually there, but I don't have a good sense yet. Part of the issue is that there are other things that are hard in the downstream caller and you have to right first before this makes a huge difference. But we think it will help eventually. Yeah? >>: [inaudible] there's a, in most of the current processing lines there is a second step [inaudible] alignment that's done after the initial alignment and that's because there's about 30 to 35% of the Reed space expect don't match a location. They are actually [inaudible] but not yet. >>: This is now looking at cancer genomes which are a little bit more complicated than [inaudible] significant [inaudible]. >> Matei Zaharia: In cancer also a lot of the DNA like replication and sort of checking mechanisms break down so you get a lot more mutations the way things happen. Let me see what else I wanted to show you. This is another thing we wanted to show is that as reads get longer as they will in the future, SNAP actually scales better as well, so the existing tools, the fastest ones today they have this backtracking which is a bad idea when the reads are longer because you have to backtrack at more locations. Whereas, SNAP actually does better because you can use longer seeds and you can get more disjoint seeds and filter out a lot of the locations just by number of seeds matching. We also did some analysis of the speed up. One of the nice things is this heuristic where we test the read with the most exact matches of seeds it means that usually the first candidate we score is actually the best one we will find and we have to load the limit at the start and we eliminate 94% of the locations without actually scoring them and 40% are just because the number of seeds didn't match, so this cuts down on the time and the adaptive edit distance threshold also helped by a factor of four. Finally, just to wrap up this part, we have been doing a lot of this stuff after and I don't have too much time to go into it. One of the things is we generalize the algorithm to work with what I call paired end reads. This is actually the most common type of read. You get a bigger molecule like 500 bases and the machine reads 100 bases off one end and 100 off the other and it can go all the way into the middle so now you get aligned kind of two strings in a place and you know some constraint on how far they can be from each other and there is a bit, you can use similar ideas in this problem to actually align them as well. As Bill was saying we have spent a bunch of time into making this scale well and it does. This is up to 32 cores and part of it is because we are careful about how many memory accesses and cache misses we do. We are careful to prefetch data and basically like having systems people look at this actually does help improve the speed to the point where it matters to practitioners, so that's kind of cool. We've also run it on some real data. This is numbers that I showed before. These are real reads. The interesting thing with real data is that a lot fewer of them align because there is some contamination and just like stranger things happening in them, but in this one we are able to again sort of marry the results in the simulation and get higher accuracy and also grow about 20 times faster. So that's the part of the talk that I had. Kristal is going to talk about what we are doing next. Yeah. >>: Would you consider this a solved problem given the low error rates and the speeds that you are saying you are at I/O bound? Are you guys done? >> Matei Zaharia: That's a good question. I think the accuracy can be improved even further. We want to explore this more to see when it matters, but especially for things like cancer there will be more mutations and also there will be like pieces of the chromosomes that usually aren't together get cut and pasted next to each other and you might want to detect that, so now it's one has to read the lines in one place and half in another. In terms of speed I think the speed is pretty good. Although there is always, you know, this is one genome in one hour but there are people that have 1000 genome data set. If they want to realign that using SNAP or using higher accuracy, that might take a while. So we will talk a little bit about what we are doing for accuracy, but we also want to look at the downstream steps next because there are more unknowns in there. >> Kristal Curtis: As Matei mentioned, we are kind of capping out on speed but the accuracy is actually something that we still want to look out more, because as far as alignment goes there are two parts of the process. One is finding good candidates to track the read against and the other is quickly tracking against all of the candidates so that you can find the best match. In the first case we have pretty much narrowed that down and we are getting a good list of candidates, but in the second one we still want to make some improvements there. What we've noticed in doing a sensitivity analysis to what parameters tend to affect the performance of SNAP is that really the only parameter that makes a big difference is this max hits and so as Matei mentioned this is the kind of the cut off for when you have a seed, how many hits it has in the table as to whether you will consider it or not. And so what we noticed when we varied that max parameter is that the error rate does go way down as you get a higher max hits and you can also align a higher percentage of reads. So what this means is that at first cut you would think that he see that a c that matches in a bunch of places is just indicating that your read is just matching two, it's going to be ambiguous because it's matching so many places, but what we actually find when we are willing to consider more and more places is that for some of those reads we actually can find an unambiguous best match in the genome and the reason this happens this way is as you were kind of alluding to in your question is the genome is not a random string. It is actually highly redundant and what makes it difficult as well is that it is not exact duplication that is really the main factor; it is similar duplications. This is something that we really want to be able to consider more hits per seed, but then of course the downside is if we do consider more hits per seed this speed takes a big hit. How can we improve our accuracy, reduce our errors, get a higher percent align while avoid dropping off to the bottom right of that curve? That's the part of this that we are looking at now. As I was mentioning, we have these similar regions in the genome that make alignment difficult. Back to the seeds explanation and the false positives, the odds of finding too many matches in the genome are so low if we had the random string assumption, but the fact that we have similar regions are what makes alignment take longer for us. This is an example here. You see that it is colored by the type of base, so when we see a solid color column, that means that all of the strings are matching exactly, and when we see different colors in a column that means that some of the strings have differences from each other. These are some strings that we found via clustering in chromosome 22 and they are all unique. There are around 400 of them. However, if we find the consensus string for that group of substrings and we find the average distance from each string in the cluster to that consensus, it's very low, only about six edits. This is the phenomenon that is causing us to spend most of the cycles in the alignment process. >>: Are those all from the same person? >> Kristal Curtis: This is from the reference genome. >>: People are almost identical. There is one part per thousand difference between people so [inaudible]. >> Kristal Curtis: Yeah. >>: [inaudible] person? >> Kristal Curtis: No it's actually kind of a mix because different sequencing centers were working on it so they all kind of submitted part of it and it's all kind of stitched together. What we really wanted to be able to do is test our read against the entire group of strings, however, that takes a really long time. So we have this trade-off then which we kind of looked at and if we are too aggressive in lowering the max hits then we are going to be paying the price in error. Just to illustrate this graphically, what we have in many cases are these reads that are matching against these similar regions and so based on how we set max hits we have some different problems. If we take it too low that means we won't try any of the locations that the read is going to match against because all of the seeds have too many hits, so we won't be able to align that one at all. If we take it too high, we will try all of the locations, but then we will spend an inordinate amount of time trying to align that read, so what happens in practice is because we have some middle-of-the-road setting for max hits, what we end up doing is we test some of the locations where that we would match, but not all of them and therefore this is where most of our alignment errors are coming from. What happens is we test it against a few locations. We find one that works well and we record that you however, there was actually a better one somewhere else that we didn't test. So this is the main thing that we are trying to address. This is important definitely for the downstream processing, so we really want to get this right. What is our approach to fixing this? Well, first we are working on pre-computing those similar regions in advance so that we can use that information quickly during the alignment process. Second, we are exploring an algorithm to rather than, you know, you get the cluster and then you compare individually against each string in that cluster, we are working on a representation that kind of reuses the work in that comparison so that you can efficiently do that. Currently this is a work in progress, but we have so far is we have a parallelizable algorithm for detecting a similar region using SPARK which that is a project that Matei has worked on with some folks at Berkeley and it is a really good framework for cluster computing and we also have a group edit algorithm that gets quite a good speedup over comparing against all of the strings fully. The way that we are detecting these similar regions is very simple, but I'll just quickly illustrate it graphically. What we have is for each substring in the genome, we represent it as a node in the graph and then we look at the other strings in the genome and we compute adjacencies. So this one only has two edits from the original substring so we draw an edge. However, in some cases that have too many edits we don't draw an edge. Then whenever we find that two clusters share a member that, or has members that match each other well, we are going to merge those clusters so we continue to do that, and eventually we end up disjoint clusters and each one of those would be the group that we want to compare a read against. Of course this is expensive because we have this n by n adjacency matrix that we want to compute over the whole genome, so it takes too long to do it naïvely, so the way that we have worked on this is we have partitioned this matrix so that each SPARK pass will be working on a separate piece and what we do is we index this, only the short part of the genome that this task is responsible for. Then what we do is we run our union find so that we get a set of clusters and we do this for each of our tasks. Then, now we want to be able to merge these clusters so that we get the actual set for the entire genome and the way that we do that, remember that each cluster has a bunch of different strings in it and whenever we find strings that match across clusters we will merge those, and we continue that for the whole batch and then we get some final set of clusters over the entire genome. Very simple. Yes? >>: Are you going to do this one per species? >> Kristal Curtis: Definitely, you would. So far… >>: Why not just do it, take a few weeks and [laughter]. >> Kristal Curtis: Yes, I guess you could certainly do that. We've been focusing on only human [laughter]. We have been focusing on only human. If you, yeah, you could certainly run this on whatever reference genomes you wanted to work with, but, so we focused on getting a parallel implementation just because it's proved to be intractable work on for in a single node. >>: I thought you only needed to learn it once per species when it's done when you're developing it [laughter]. >>: [inaudible] once per species for Kristal’s own sanity. >>: Yeah, the length of the clusters might change. You don't know what the best way [inaudible]. >> Kristal Curtis: Yeah, we are still working on fine tuning the parameters, so we would like to have a shorter cycle for running it. Yeah, this is something that we precompute so it's not something that's going to affect alignment performance. How do we use this information in SNAP once we’ve gotten it? Back to the case where we have a read that is going to match against a bunch of different places in the genome, what we do is we notice that one seed is matching against one of these flagged locations, so we can kind of flag them in advance if they are belonging to a cluster. Then what we do is we just grab all of the fellow cluster members of that location and we just use our, that edit distance kind of aggregate algorithm I was mentioning to test against the entire group rather than testing it against some fraction, like the current version of SNAP does. In this way we can kind of get some of the best of both worlds in that we are saving time but we are still able to compare against all of the strings in that group, so we are able to avoid errors and, in fact, what we see is that once we've incorporated this into SNAP, if we compare against the standard form of SNAP, we get a big reduction in error. This confirms the fact that really the errors that we were getting from the standard version of SNAP are caused by these similar regions. As I'll kind of get into in the last part of the talk, these alignment errors can really be important downstream, so even though it's kind of a small fraction overall, it can lead us to drawing some bad conclusions if we have those errors. What is the rest of the pipeline? Back in the beginning of the talk, Matei kind of talked about how the idea of reconstructing a genome is taking all of these it reads, aligning them. That's kind of what we have been talking about so far and then the main goal is to know what that individual actually has for their genome in each position. That is a process called variant calling. In the kind of simplest view it is just this idea of taking a consensus of all of the reads that map to a particular location and seeing what they indicate about the genome. Actually this is a very difficult process and we really are focusing on trying to improve the accuracy of this. So why is variant calling difficult? There are several factors. First, there are some that are just inherent to the genome that make it difficult. These aren't going to be going away. The other ones have to do with technology, so they can potentially be targeted, but there are things that we have to cope with for now. One, when we are actually taking the DNA out of the cells and producing the sample that is going to be sequenced by the machine, we have this process called PCR, polymerase chain reaction, and this is not an unbiased process for kind of amplifying the DNA so that we have enough to sequence. Again, we also have systematic problems that the sequencers make when they are producing the DNA reads and as we talked about the alignment errors; that's also going to confound variant calling somewhat. The heterozygosity, what this means is that for all of your chromosomes you have pairs of chromosomes that are the same length and you get one from your mother and one from your father. At each point on those pairs of chromosomes if you have the same value at each, on both of the strands of that chromosome, then this is called homozygosity, but if you have different values on the two then this is called heterozygosity. So why is that a problem? Well, when we get reads that map to that same location on the genome, they could tell you different things about what the person actually has at that site, so it's not always straightforward to know is this actual like true heterozygosity or is this just that we have gotten some sequencing errors from a couple of the reads? So this is one of the challenges. Another challenge is, as Matei was alluding to, is we have something called structural variants which our larger scale changes. So far we have been talking about mostly mutations that are one, maybe up to a few bases in length, but these are changes that can be arbitrarily large even up into the mega bases. Especially in the case of cancer you really see a lot of these large-scale changes, so what do they look like? Well, first you can have a deletion from the reference genome and one way that you can detect this is that you have a bunch of reads that map normally, but then you have a segment of your reference genome were no reads map at all. Also, Matei mentioned that you get these reads in pairs. Usually you know how far apart those pairs are and sometimes you see that the distance between pairs is off from what it has to be given the parameter of the sequencer, so in this case reads that actually come from, you know, a certain amount are looking like they are way too close together. That's one large-scale problem. Another thing is you can have a sequence inserted. This can be from somewhere else in the genome or it can be a completely novel sequence that you won't be able to detect at all. In the case of a completely novel sequence being inserted, you'll have some reads from the patient's genome that can't be mapped anywhere in the reference genome because it is a novel sequence. Then also you will have some reads that came from close, should be close together in the reference genome mapping way too far apart in the patient's genome. These are also some of the things that make variant calling more difficult. Another is you can have… >>: How would you know where they mapped in the patient's genome to know that they were far apart in mapping? Or is that the two paired ends? >> Kristal Curtis: Yeah, yeah you would definitely always see some paired ends. That's a very valuable signal. One of the trickier types of structural variants to look at are these duplications. In this case this is called the tandem duplication because you have part of the sequence duplicated so it's adjacent to the original sequence and what makes this difficult is that one thing is that you see too many reads mapping to that portion on the reference genome, but then you also have some reads that are kind of from the overlap of these two that are not going to be mappable because they are just going to look funny. These are very difficult to detect and they are definitely an open problem for developing better tools there and we are looking into that. Another thing that makes variant calling challenging is that as I was kind of alluding to, the process of converting DNA to a bunch of short fragments that can be sampled during the sequencing is biased and what are the biases? One is the molecule length and another is the composition of the molecule. >>: Are existing tools trying to find those or are they mostly [inaudible]? >> Kristal Curtis: There has definitely been a lot of work in finding structural variants, but the accuracy is still pretty bad and also each tool is kind of like a point solution, so some are good at finding short insertions. Some are good at finding large deletions. Some can find more types of structural variants but not very precisely, so it's still kind of not clear how to do it in a way that is holistic and can find all of the different structural variants. That is why that problem is still very interesting and especially because structural variants actually have a lot of interesting influence on cancer and other kinds of diseases. >>: In a mutation, does it look like kind of heterozygosity, or does a mutation take a complete, go to both chromosomes? >> Kristal Curtis: Both can happen. Of course, usually you would have only one, but then in some cases you might have both of the genes knocked out on both of the chromosomes. >>: I guess I don't know what causes mutations. [inaudible] would only affect one chromosome but… >> Kristal Curtis: Yeah, they can, but that's kind of why cancer takes a long time to get, because it's just like accumulating the bad luck or like many, many years. >>: So what happens is like in the human population there were mutations on the [inaudible] that are now part of the population. There isn't really one gold standard like human genome and people have mistakes relative to it. There is just diversity. >>: You can wind up inheriting two copies of--you look at a bit of the population tree and maybe there is some mutation that happened in chromosome 14 and one individual copying of that in the germ line and is spread throughout the population and maybe both of your parents may have gotten that through, you know, the fact that your great, great, great grandparents sometimes are the same as each other… [laughter]. >>: [inaudible] necessary. >>: That's not true for everybody, right? The whole thing when you start coming back generations in you go to the population that is bigger than the planet and people get confused and say what happened. Well, what happened is somebody had multiple types of grandparent and that kind of thing can lead to getting mutations and both copies of [inaudible] chromosomes that came to you. >> Kristal Curtis: For sample prep what makes this difficult is that there are biases towards the size and the motif and what this looks like is you start with your original full strand of DNA that you have replicated, sorry, that you have gathered from the sample. Then you send it through some process that fragments the DNA, but then what we do is apply PCR to amplify those fragments so that they can be sequenced throughout the sequencer and then you see that some fragments are preferentially amplified. This means that like I was showing you here, a higher coverage depth of that duplication. Well, sometimes a higher coverage depth can just be because of the way the sample is prepared. Another thing that is pretty subtle is strand bias. I will walk through this. This is from a visualization tool from some folks that created a lot of libraries for looking at genomic data and what you see at the top this T, T, T, C et cetera is the reference genome at this position and this is in chromosome 20. What you see are these little dots and commas. These are reads. The dots are reads that are mapping in the forward direction and the commas are reads that are mapping in the reverse direction. We talked about paired end and what happens in paired end is you read them in the reverse directions, so one gets read forward and one gets read reverse. >> Bill Bolosky: DNA is paired; you have the double helix thing that you always see pictures of and in double helix the bases, there are bases that go together like types of bases that go together and they are directional. The direction of one helix is one way and one is the other, so the information in the two copies of helix are exactly the same but they are in opposite sentences and when the machine goes out and reads them you can't tell which one you've got, so one of the problems that comes in alignment and mapping that we didn't mention is that you have to check both sentences, whether it's forward or reverse complement and then this just tells you what sense the thing was matched. >> Kristal Curtis: Thanks. >>: This isn't a class in biology. [laughter]. >> Kristal Curtis: So what we have here, the reads that you can see that are these commas are showing a bunch of T’s so where you see letters the reads are recording the difference between the individual and the reference genome. You see that these T’s are only showing up on the ones that are going in the reverse direction and so this is a case where we don't want to trust the reads when they are telling us the difference from the reference genome because they are only showing up in one direction and not the other. This is just one and probably the simplest example of a sequencing systematic error that you can get, but definitely we have some other cases as well. Also, these systematic errors in sequencing can also show up based on the motif, so what we see here is that for one thing we see more errors when the reference sequence is a T, which we see with the red T there. The other thing is that we see more errors when the proceeding two bases to the position are both G’s. Just because of the way the chemistry works in reading off the DNA in the sequencer, it tends to favor, or I guess it tends to have a harder time in these cases. The fact that these errors aren't uniform makes this process more difficult. A further thing is that variant calling is also very difficult to evaluate as well because we have very limited ground truth for this. One thing that you can do is produce samples of the genome with different sorts of technologies and then compare what you get with the short reads to what you get with these other technologies, but there really aren't a lot of standard benchmarks that you can use to evaluate for variant calling and in fact all of the variant calling works thus far has kind of relied on ad hoc methods of evaluation. One thing that we are working on is developing more standard benchmarks for our own development just so we can know if we are actually improving based on the things that we are doing. Another thing that you can do is if you have the data that is really cool is using this tree of data. So what is a tree? That is two parents and their child and the reason that this is really valuable is because there are certain assumptions that you can make that the child's DNA has to be consistent with the parents DNA. The reason you can't make this assumption is that the rate of novel mutations from the parents DNA to their child is very low, so therefore if you are seeing something strange, it is much more likely that it is a problem with the sequencing, the variant calling and whatnot, rather than a truly novel mutation. In some of these examples, you see that the child's DNA is consistent with the parents’ DNA because each of the child's sites on their gene is from one of their parents, but then for example, these cases are invalid because their DNA doesn't match against one of their parents, so this is another really valuable signal when you have trio data that we can use when we are evaluating SNP calling. This is still very preliminary but one thing that we did is we wanted to check against the existing tools and get more of a clear benchmark as to how they were performing and also be able to try out some simple ideas and see if we could do as well or even better than existing tools. So what did we do? We took some real data. It is called the Yoruban trio. It's from some individuals from Africa so we have two parents and a child. We looked at only one chromosome. So we aligned all of the data towards the reference. We took the reads out the match from chromosome 20, which is one of the smallest ones. We realigned it and once we got those reads that filtered onto chromosome 20 and then we did this SNP caller that we developed as a prototype and we evaluated it based on those Mendelian conflict assumptions that I just mentioned. The other thing that we did is we looked at simulated data. This is the simulator that we have been working on in our group and it's kind of meant to be more realistic than existing simulators because it uses some data with real SNPs that people have found in various sequencing projects. We restricted the simulator to only include single substitutions, so it doesn't have any short [inaudible] or any structural invariants. In our thinking this simulated case should have been pretty easy. We did something similar. But of course in the simulated case, we know what the variants are so we can actually get that accuracy there. Where are the preliminary results? So we looked at two different pipelines. This is kind of a preliminary evaluation, but what we looked at is the CASAVA pipeline which is produced by Alumina which is the main sequencing company that we have been talking about, and they called out about 128,000 SNPs and they had 162 conflicts. We also looked at the GATK on the simulated data and the GATK, it stands for genome analysis toolkit. It comes from the Baroque Institute in Massachusetts and it's pretty much the state of the art for variant calling. This is the one that is most widely used by practitioners throughout the field and what we found on the simulated data is that even though it was very simple there were still 66 false positives in a pretty short amount of DNA just from chromosome 20. It's also very slow. It took four hours to do this process. So what did we do? On the trio, while calling about the same number of SNPs we got about half the conflict rate and on simulated data we got an order of magnitude less of errors. This is something that can run in two minutes instead of four hours. We think that we have some insights based on this that we should be able to further improve this to work on the full genome and also extend to other types of mutations other than just plain SNPs. Looking forward our efforts over the next few months are focusing on formalizing this feasibility study into a full-scale pipeline that takes us from DNA reads to a fully reconstructed genome. Looking at the way that this is currently done, people have broken this up into individual stages and then separately optimized each stage. In order to get the data from one stage to the next, they use a lot of thresholding, so whenever there is uncertainty, you have to just put it off, report one value, and then push the data on to let the next state handle it. Kind of leveraging all of the work that's been done we’re kind of hoping that instead of having this kind of disjoint pipeline, we are working on producing this end to end system that will rather than truncating information very early on in the process, we’re working on propagating it through the entire process, so therefore we are able to leverage that uncertainty in a better way than just throwing it away early. Just to pop up a level, as Matei mentions, DNA sequencing is getting really realistic in terms of the affordability and it's improving accuracy and this is really going to have great impact on medicine in a variety of cases. One of the things that we are really excited about is producing better treatments for cancer. The cool thing for us is that it's actually involving a lot of computational challenges in addition to wet lab challenges, so we feel like we can participate because a lot of the existing tools lack in both speed and accuracy and that's becoming a big problem if you really want to use this in a clinical setting. Based on our work mostly with SNAP but also with our initial work with the rest of the pipeline, we have some evidence that we can help with both of those cases. If you are interested you can check out our website. We have made this open-source. We have it available for download. Actually, we presented it at a conference a few days ago and so we've already kind of gotten a lot of interest, people trying it out so, I don't know if there are any more questions. >>: How about speaking of the end-to-end process, so do you guys, we saw the error rates for alignment and I'm curious if you have run any studies to see how that propagates down to variant calling. >> Kristal Curtis: We've done some benchmarking but it's been mostly just running BWA since that's kind of the standard benchmarking and just seeing how that, what our ultimate variant calling results are, but we haven't really done something where we kind of vary the parameters so that we get better or worse accuracy and seeing how that gives us the variant calling. Definitely that is something that we are interested in looking at. >>: Even the comparison of SNAP versus BWA, can you see that the error rates, the difference in error rates actually helps with the variant calling? >> Matei Zaharia: In the thing that was shown there was at least one example like reads that BWA misread that we didn't that caused it to call mutations, but this is a very like, it's a very anecdotal setting. We need to do the same thing on whole genome and like, you know, we just haven't gotten very far with it yet. >> Kristal Curtis: And part of the obstacle is just that GATK is so slow, right? We haven't even been able to run it on a whole genome because it's so expensive to run. >> Bill Bolosky: A conference we were at, I listened to the guy about the thousand genome project talking about this and the errors that have come out of BWA have caused them such problems that he was talking about giving up on re-mapping altogether and doing another assembly for all of the genomes because it was driving him bonkers, so that some evidence that it's really a problem. We're hoping we can do better, and by building an integrated pipeline you can correct for the errors later. I think [inaudible]; does GATK do anything with unmapped reads? >> Kristal Curtis: No, not with unmapped reads. >> Bill Bolosky: Our plan is to use unmapped reads and to remap stuff and as you're looking at the, because there is lots of stuff that you can tell is just wrong. [inaudible] reason you can't tell [inaudible]. >> Kristal Curtis: Other tools do look at unmapped reads but they are not integrated with GATK so that's another obstacle. It's just kind of making everything work together and I mean just in our bench making process it was pretty painful when we tried some of these things out. >>: I was wondering about something. After seeing part of the talk and [inaudible] part of the talk, so the approach is to take seeds from the reads and match them to the genome and depending on the depth of coverage you are going to have many more seeds to look at than the places in the genome to compare. But the complexity seems to be the same if you, I mean theoretical complexity. Not real complexity. If you go the other way around, if you actually take the seeds from the genome and try to find them in the reads. Again, you could do the same trick with stopping early but now the stopping would be based on a different criteria, that you want to see a certain number because you know what the coverage is roughly, so you want to stop after you've matched a certain number of times. Has anybody thought about if that could be worked out to be almost as efficient as the other way around, because then some of these structural variant and other issues would perhaps be easier after the fact. So you would pay a price a little bit, but I don't know. I haven't thought about it much, but it just occurred to me now. >> Bill Bolosky: So we have thought about doing something similar to that for trying to do structural variants so the thought was you go through, get rid of the easy stuff because most of it’s easy. Most of the stuff maps unambiguously with few errors and you can look at it, call it and get rid of it, and then go back to the remaining reads and try to build them into context which is, and you can find overlaps [inaudible]. [inaudible] overlap then you build long streams of DNA that they seem to represent and you try to fit them into places where you found the breakpoints in the genomes. At that point you could do that. Whether it makes sense to invert the process from the beginning, I'm not sure. >>: The thing that I don't understand is does it, it doesn't really help the structural variants because it still tells you like these reads look like they came from here, but it seems like ultimately you need to do this detective work at the end. Maybe it would help you with the reads that are split across both ends. But some algorithms have done this. For example, some versions of BLAST do this because they just take the reads like say 1 million at a time and they index those and then they scan through the reference genome when it's too large for them, so in a way it's been done. I'm not, I'd have to think more about how it affects what you can do. >>: I haven't thought about it much but when you think about this problem, you always think about whether you map the reads. >>: Yeah, yeah. >>: But I don't know. In the end if it's through the alignment, then maybe inverting the process could be worked out. It’s kind of straightforward. >> Kristal Curtis: There definitely have been some aligners that choose to index the reads instead of the reference. I mean, they are a little bit old, so I don't know if we could really compare directly against their performance, but also one aligner whose goal is to find actually all of the matches for any read index is both of the reference and the reads and then sort of walks through it that way. Then there's also been some work for detecting structural variants where you use assembly to reconstruct the patient's genome and then you align the reads both to the reference and to that reconstructed genome and then that's kind of a signal for how you can locate these structural variance. >>: Even though that may have been done in the past, just like the way BLAST was using the seeds, they were able to use that idea much better because a lot of these things went unwritten. They are imperfect so they… >> Kristal Curtis: Right. That could be an interesting thing to incorporate. >>: Yeah [inaudible] subset of the reads for which it works very well. If you are dealing with reads that only have a couple of matches in the genome, the current process is probably the best, right? It really depends on how easy it is to find the match in the reference string. >>: But for reads of which have a bunch of matches in the genome, the first preference would be better, so I think that's [inaudible]. >>: Suppose you have a specific goal in mind for one particular patient, you don't need the entire sequence but you want to know which [inaudible] colon cancers they have, can you make it holistic even beyond that and say this is the end goal. I want to update the whole pipeline to optimize for answering that one question I have. >> Kristal Curtis: I see, yeah, so one thing people do is they narrow their focus when they kind of gather the initial reads. They get it from what's called the X gnome which is the coding sequence or the genes that are in your genome because that's actually only about 1% of the entire genome. You also have all of this other stuff that orchestrates how, which genes get expressed or could have just been randomly inserted by evolution. We don't really have a good sense for it. Some people do just pull out that X gnome, sequence that and then, you know, see if they have mutations where they think they might be, but the problem with doing that is that there could be a lot of other mutations that are related to that one that you are missing if you do that. So it's kind of like looking under the light post approach where, which is somewhat problematic when we still don't understand a lot of the other things that could be related. But in terms of how the pipeline might change, I mean if you're focusing more on certain areas then everything becomes lots easier because you have really narrowed down the uncertainty, so you know which parts of the genome that you're looking at, and then it's also going to make a lot of the stuff we were talking about in similar regions not as much of an issue. >> Bill Bolosky: Okay. Anything else? Thank you. [applause].