20928 >> Michael Zyskowski: Hello. Good afternoon. And...

advertisement
20928
>> Michael Zyskowski: Hello. Good afternoon. And thank you all for coming. My name is Mike
Zyskowski and I'm a program manager in the External Research Group in Microsoft, and I have
the pleasure of introducing Xin-Yi Chua, who is an intern in our group this summer.
And she is studying at the Queensland University of Technology, getting her Ph.D. in
bioinfomatics. And she's joined us this summer to work on a project called GenoZoom. I'll let her
describe the project in detail. And if there are other follow-up questions after this event, or if you
wish to contact Xin-Yi later about this effort, her contact information is actually included in the
About dialogue box of this application. So without further ado, I'd like to introduce Xin-Yi.
>> Xin-Yi Chua: Thank you very much for the intro, Mike. So I'd like to first thank you for joining
me at the presentation. So I'll just go on ahead. So the name of the project was GenoZoom. I
do have quite a few slides. And they do cover a lot of content, but that was mainly because the
slides will be put up with the application itself.
But I won't necessarily go into every single point. Just a quick overview. So what was the
motivation behind this project so currently there are actually a lot of many publicly available
genome browsers out there online and we didn't want to reinvent the wheel, but we did notice a
few things with the current browsers. So they don't really scale well to the data. They don't
provide seamless user experience when you're going from low to high resolutions rapidly.
It's difficult to view your own sequence data in the application. So, for example, the UCSC
genome browser which is probably the most popular browser out there, to view your own genome
sequence, you would have to actually download the source code and then set it up yourself. And
it's quite difficult for a non-Unix expert.
And they don't really support unformatted user annotations. So you'll get to see a bit of that in my
demo later on. So the proposed solution for the project was to investigate how DeepZoom and
Silver Light would be able to address some of these issues and make the user experience much
more seamless and smooth.
So why DeepZoom or AKA also known as Silver Dragon? So basically it was because it provided
a nice way to navigate large-scale data and it optimized bandwidth. And it did this by creating a
image permit of your high resolution image data. And it only downloads the tiles that you are
viewing.
So further advantages of actually using DeepZoom and applying it to the genome space is that it
takes care of the data sampling for you, because it already creates that image pyramid. With the
preprocessed images, the user can pretty much select the region of interest they're interested in,
so it jumps straight into the middle of the genome without downloading the entire file again.
And then with the different images created as different collections, you have the potential to mix
and match and create your own GenoZoom collection. So in the beginning, it all sounded very
cool, very nice. But along the way I did notice some limitations. So the major limitation of
DeepZoom, applying it to genome browsing domain, was that DeepZoom was primarily designed
for images that sort of go with the traditional 4, 3, 16, 9 aspect ratios.
So it didn't lend itself well to the conventional genome images, which are really long and thin. So
this is an example of a DNA sequence. So the demo that I'll be demonstrating is actually
showing the E. Coli genome. So it's a back tierium that lives in the gut. That one for a back
tierium, it's about four and a half million base pairs. So four and a half million characters. At four
pixels per base pair, we're looking at a really long image.
And only at eight pixels high. It's really long and thin. So the problem with using DeepZoom in
this case is that when you zoom out to have a view of the entire genome, that line is pretty much
invisible. You don't see it.
I tried compensating by increasing the height. So when you're zoomed out you could still see
images, but that has a performance hit, when you're generating the images, and it's pretty much a
waste of space because everything vertically is the same.
So the reason for this obstacle was primarily because of the way that DeepZoom works when
you're zooming into an image. It actually stretches the image horizontally and vertically, whereas
the desired behavior for a genome browser is just to move horizontally.
What I've got is an animation to demonstrate what I mean. So this is your low resolution image at
the top there, level N. And then you get a high resolution here, and then N plus 2.
So what happens in DeepZoom is when you zoom in, you are actually stretching horizontally and
vertically, and then it replaces the tiles from the higher resolution, and then again and you're
replacing the tiles from the next zoom level.
So that's the actual behavior. The desired behavior for a browser would be at each zoom level,
the height of the image is actually the same. And what happens is when you zoom in, it only
stretches the image horizontally and then replaces the tiles.
So I tried different ways to work around that, find out what I could do with images, how I could
play with them, just to get around that zooming obstacle. So one of the trials was I have just a
single hosting control. And then I dynamically lay out all the images at runtime. So specify where
each of the collection of images go.
So at the top there, we've got the entire E. Coli genome. So it's called a gene density track. So
the blue lines are showing genes in the forward direction and the black lines are showing genes
in the reverse direction.
So what the red boxes are meaning if you zoom into a region there, you get sort of more
information coming up the deeper you go. So we go down to actual gene blocks with arrows
pointing in the direction. You get associated graph information. Keep going down and you
eventually get to the DNA sequence.
But the problem with this approach is that you sort of lose contact. You can't see all that
information in one go at that resolution. So you lose what happened to the peaks of the graphs.
So that was one problem. The next attempt that I did was actually go into the individual zoom
layer of DeepZoom image tiles and manually tweak those image tiles. So changed the sort of
height at different zoom levels.
So zoom levels denoteed by the numbers. But what we notice in here is that the zooming action
is no longer seamless. You get this sort of transparency happening in a popping type effect and it
just doesn't look really nice when you're zooming in and things are popping out at you.
And then there was -- came across an approach called Dynamic DeepZoom. So what this one
was, instead of having all your image tyles preprocessed and hosted on the server, what happens
is generally the MSR, the multi-scale image control Silver Light that hosts the DeepZoom images,
whenever you'd zoom and pan, it actually hits http requests to the server and it comes back with
image tiles.
So instead of that, we intercept the http request, write out our own handler that generates the
images on the fly and then send those tiles back. So the classic example is the Mandlebraut
[phonetic] plot. That one, when you zoom in, it's recalculating what you see and then popping up
the patterns.
And I think this is a possible solution to the problem. And it's a worthwhile further investigation.
The only thing I couldn't go further with it for the project was because it requires a database back
in to store all the genomic data and I pretty much was weighing doing that in time with the
internship itself.
So the proposed approach for our GenoZoom was we arrived at a three component-based view.
So I've got a navigation view at the top that shows the entire genome, and it's a static image. A
region view, which is a zoomed in location, and then a details view, which goes down to the base
pair level.
And I'll go into more depth in the demo. So just quickly before the demo, sort of tossing out
reasons why I selected Silver Light over WPF for this project. Basically because the selection of
using DeepZoom in Silver Light we have the multi-scale image control which is not supported in
WPF. If I went with WPF, I would have to reimplement all the animations with zooming and
panning and actually have to handle data sampling as well.
And then with Silver Light, it's accessible by Web. So that makes it real easy for biologists to
share their research. And it also has the out-of-browser option which you can download it into
your desktop and still interact with the application.
But with Silver Light, there were a few disadvantages. Primarily I couldn't use the MBF library to
pass my data file to generate the image, the raw image that's required by DeepZoom. And image
loading is dependent on the network. So hopefully we won't see that in the demo. But
sometimes the image tiles freezes, so it stays blurry.
And with Silver Light you cannot access the local file system. But there is a workaround, which is
to host the images on your local machine, and it still works.
And then I'll look at the demo. Okay. So this is GenoZoom. So the top layer is the navigation
view. Actually, I'll use the mouse. It's easier. So what I'm showing here is the E. Coli genome,
which is the bacterium in our gut.
So the top layer is the navigation view, which shows the entire genome. Blue showing genes in
the forward direction and black showing genes in the reverse direction.
It's not very interesting because in bacteria they're small and pretty much the entire genome is
contained functional information. So if it was showing [inaudible] you would have a sparser
diagram.
In the red box in the navigation view here corresponds to what we see in the region view. So
that's from one to 25,000. And then the red box in the region view there corresponds to the
details view.
So most of the interaction and action happens in these two components here. So what I can do
is pretty much just drag to a random location of interest. And this region in detail will correspond.
So this is something that the current genome browsers lack with that seamless sort of interaction,
when you drag to a random region, you have sort of a loading screen. Not that smooth animation
that's happening.
So let's see what's down to the details. So I can get actual DNA bases. I don't care about the
first track. I'll move that. So in the GenoZoom it supports unformatted user annotation. What
that means you pretty much put it in post-it note type of information.
So I can change the coloring of that. And it pops up. And I can edit or delete that if I wish. What
I can also do is search genes in E. Coli. Please be kind. There we go.
So let's have a look at that one. It's related to toxins. So, again, I can add another push pin. And
I can also search my annotations similar to what I did with searching a gene.
Now to simplify sort of -- sorry. Not to simplify. What you can also do is configure the tracks. So
you can turn them on and off. And you can also add in your own custom data.
So if you have some that are already processed, it will come up. But they do have to be hosted,
either on a local machine or somewhere on the Web server. But as long as you have the http
URL you can add that in.
Now, one example of how geno zoom with that horizontal and vertical -- sorry. Just let me
restart. So while we wait, basically what I was going to mention is demo to you what happens
with the horizontal and vertical zooming action. If you have graphs with peaks and troughs when
you keep zooming in the peaks and troughs pretty much clip. So I was trying to find different
ways of visualizing that sort of data but still keep, be able to sort of know what the values are
associated with that location.
>>: What's going on with our server? I was actually having trouble with the server this morning.
So ->>: Where is the server?
>>: It's on the DNZ on an external ->> Xin-Yi Chua: I'll just leave it and move on. So this was one of the issues that we noted, that
with DeepZoom, you can have the image freezing up on you. One of the reasons why this was
happening was because DeepZoom was set up to avoid possible DOS attacks. So if you do a lot
of zooming around and panning, that sort of thing, it sort of just stops you and stops the images
from downloading.
So unfortunately that is happening right now. But I'll move on. So the thing is that you can pretty
much install the application on the desk top, and if you had a net connection you could do all the
interactions I was showing in the browser.
I also created a tool, the GSIC, GenoZoom Image Generator. Basically it's a tool to convert the
gene banked file, put it through the MBF library, deposit, and generate the raw image data which
is then put through the DeepZoom tools to create the image pyramids.
Okay. Some of the known issues with the application. The image generation can be slow. But if
you have reference data, they don't change that often. So once you create it for the reference
data, it's not too big of an issue.
Data storage can be a problem, because I'm creating turning text pretty much into large image
directories. But with the dynamic DeepZoom I was mentioning before with the database back
end on the server this could potentially solve that problem.
Multiple mouse events before the animation has caught up I have seen can have unpredicted
behavior with the locations. And then with the details and range slider. So the red windows in
your other views sometimes, sometimes the synchronization behavior there is a bit unexpected
there because sort of where is the input that's coming from.
Few sort of disadvantages using DeepZoom in a genome browser is I can't dynamically change
visuals. What I mean by that I can't select a group of genes and change the color of their genes
because I'm dealing with static images.
The custom data must be hosted by a Web server, which I mentioned. I guess in this case if
there is a multi-scale image control in WPF, that may solve that problem. And the data exposure
from text images is also another issue.
But with images, you could potentially display much more information than you get in the text. So
it's sort of weighing a trade-off there.
Performance is limited by the network connections and that again if the images were on a local
host performance itself could be faster.
Okay. But the advantages. It does produce a smoother user experience. What I should do is
actually demonstrate one of the other genome browsers that are currently out there at the
moment. If you're interested, I'll be happy to demo that after.
You do get quick navigation from low to high resolution. And with the tagging, it supports
unformatted user annotation. So what that means is basically you don't have to conform to what
the server requires. You can just put in push pin or post-it note type tags. And I have the
intention of converting those tags into sort of tracks so you can save them and up load them for
future use.
It's easier for the user to view their own data. So they don't need to actually download that and
set up their own servers, as long as they've got their data converted into the image pyramid then
they can get the URL and convert it into the application itself.
And that sort of lends itself to having a potential of the user to create their own sort of genome
zoom collection. So if they're happy with all these different tracks and different information then
they can put them together and save it.
Because I'm dealing with images, the application itself is sort of in a blunt way of putting it, it's
dumb to what it's hosting. It's sort of just images. Potentially you can generate any sort of image
you want as long as they line up to genome data location coordinate. And I have a whole list of
future work but I'll concentrate on a few.
The first one is I would really like to look at creating this using the dynamic approach and using
the database in the back end to see how it would solve some of the issues that I mentioned
previously.
Integration with Pivot. That was actually on the agenda but we just ran out of time. And linking to
external sources. So a whole list up there. And the last one is actually, I would like to compare
the performance of this compared to like a pure Silver Light or WPF version, actually drawing all
those features every time you zoom and pan. So just to see what the effects are there.
And to close off, just like to say a special thank you to my mentor, Mike. Simon, Bob and Vince
who I worked really close throughout this summer and gave me feedback for the project, which I
really appreciate. And a whole lot of other teams that helped make this happen. Sort of
dangerous putting a thank you slide -- if you don't find your name up there, thank you. And lastly,
I'd like to thank you as well for turning out for the presentation. Really appreciate that.
And questions?
>>: Saw a search can search for something like toxin which implies that somebody already
identified its purpose. But what if I wanted to go look for some [inaudible] sequence I found in
some other bacteria, could I type in base layers and have it find it for me?
>> Xin-Yi Chua: Not in the prototype version. Sort of the issue with that is, as I previously said,
the GenoZoom is sort of [inaudible], because it's only hosting static images. If I wanted to do
that, I would also have to send the underlying base pair sequence. If I did that, it could
essentially do that or you could sort of do all the searches using external tools like MIM or sort of
pattern recognition tools.
>>: [inaudible].
>> Xin-Yi Chua: Sorry?
>>: Then it would tell me where to start.
>> Xin-Yi Chua: Yes, then you could upload that information as a track.
>>: Will this tool be published outside of Microsoft? Other people can access it?
>> Xin-Yi Chua: Yes, I should have mentioned that. The intention this will be open source and it
will be put on to code plex, so the source code will be available and you can download it. The
website I was working from is actually MBF server. It might be a temporary link. We haven't
worked out if it's going to be hosted somewhere yet. That's going to be -- I'm going to talk to my
mentor about that.
>>: We'll end up keeping it there at a minimum so people outside of Microsoft can access it,
demo it as well as the source code. There will be a document as well as this presentation.
>> Xin-Yi Chua: Yes. So there will be a technical document which goes more into sort of the
classes and structures and all that stuff.
>>: [inaudible].
>> Xin-Yi Chua: Yes.
>>: Can you comment a little more on the problems you had with image generation, because I
have some of the same problems? The DeepZoom image generation how long it takes because
a lot of us have those problems. What kind of possibilities are there that might help that?
>> Xin-Yi Chua: Okay. So with image generation, my sort of bottleneck was actually producing
the raw images that would feed as input into DeepZoom itself. So it wasn't actually DeepZoom
making the pyramids. That was all right. But it was actually generating four and a half billion
characters and then all that. So I was looking at 1500 images. And then for each track I was
looking at 1500 images. And those were the bottlenecks. So different ways of trying to do that
was parallelizing the code.
I think it depends on your underlying structure. At times it's saved me time -- saved -- sorry,
improved performance, but at other times, because it was a contiguous block, I needed to know
what images were in my previous block so I couldn't do that in parallel. But I think that just needs
a rewrite of the code itself.
>>: So did you have a chance to have anybody in the domain really look at this, see, is it
significantly better than some of the ideas in the transitions, animations helping in maintaining
context when they're browsing or what other feedback have you had a chance to get from
anybody back in Australia or wherever?
>> Xin-Yi Chua: So short answer, no. Longer answer, earlier sort of mid-internship I was doing
the user experience studies, at that time I had a very, very early prototype version of the Genome
Zoom, and I sort of handed it out to people to get their reaction. One person liked the whole
animation, smooth navigation, that was positive feedback.
But, I think, generally, we didn't see that many people and generally it depends on what they
were working on because the trend is that the biologists tend to stay within a specific zoom level.
They don't really go all the way from a genome down to the DNA. They tend to sort of go the
other top to the navigation region view or the region view and details view.
Yeah, I think need mover sort of user feedback on that. And then the other space, I did send a
link to David Heckerman so I'm hoping to get some feedback from him as well.
>>: Do you think this will scale to do the [inaudible].
>> Xin-Yi Chua: Yes, because basically you're just doing images. So I'm thinking theoretically it
should just scale. The only issue with human is you're going to have to spend probably a day to
produce those images.
And storage, actually. That's going to be an issue. And an example of storage was the E. Coli
gene bank is only 30 megs. Those image pyramids came out to be 12 gigs. So -- yeah. So ->>: Hard drives are cheap.
>> Xin-Yi Chua: Yes. So that's sort of why I'm sort of really looking forward to doing that
dynamic DeepZoom approach to see what would happen there. But again as I mentioned, in
images you can potentially put in a lot more information than what is in a normal sort of gene
bank file. So I, for example, could color code my genes based on function or call categories or
different domains, that sort of thing. That's a single image but I've got a lot more information in
that single image than the normal text-based source.
Any other questions?
>>: So aside from research purposes, when you see this being used for anything else?
>> Xin-Yi Chua: So one of the thoughts throughout the internship was actually this could lend
itself to an education domain. So something like the Chrono Zoom. I'm not sure if you're familiar
with the Chrono Zoom. But basically it might lend itself well to the education domain.
So in that scenario, one could do maybe a cooked up tutorial that goes out from -- if you want to
aim it at high school level or sort of undergrad level, so it cooks out from, this is your entire
organism, let's concentrate on a dystrophin gene so it causes certain diseases and maybe goes
down to the protein level. These are the domains we're interested in. Mutations in the DNA
cause these sort of diseases that sort of thing so it goes straight down to the DNA level so that
type of tutorial may suit for this type of application.
Okay. No other questions, then thank you very much. [applause]
Download