1

advertisement
1
>>: Good morning, everybody. Welcome to the Third International Conference on
AstroInformatics. I expect this will be a fun meeting like these usually do
with a lot of good discussions, which is, of course, the most important part.
First, let me introduce the co-organizers. They're Pepe Longo, sitting there.
Dan Fay, over there having coffee. Ashish Mahaba, who will join us tomorrow
and, of course, our host Yan Xu from Microsoft Research. And I will yield to
her to introduce the Microsoft Research logistics and our first speaker. She
will chair the first session.
Let me remind everybody, we got a Facebook page, which is intended to be an
online forum for any questions, discussion and all that that you don't do in
person. And since we yielded to the decline of western civilization, we have a
Twitter account now. So we do that too.
But feel free to interact and at this point, I'll just yield to Yan.
>> Yan Xu: So I'm Yan Xu. On behalf of Microsoft Research, a warm welcome to
all of you. Fortunately, both Ashish Mahaba and Robin Inaba, who have been
working on the lock sticks, both are not available today. But I think we've
got pretty much everything under control. And if everybody's here, I feel like
we're missing some of the people on the bus. But we'll figure it out.
So anyway, I have been working with you, the astronomical community, for a few
years. And I think some of you have heard me talking about Microsoft Research
multiple times, probably, because I have one slide that I permanently kept in
all my presentations. But today, I don't have to do that, because I have the
honor and the great pleasure to introduce you to someone who has the vision and
the capacity to talk to you about Microsoft Research at a totally different
level, and who is our corporate vice president of Microsoft Research Redmond,
Dr. Peter Lee.
>> Peter Lee: Thank you, Yan, and thank you all for coming to our campus for
this conference. It's really, really a great honor to host all of you. What I
was asked to do is just to speak just for a few minutes, for ten minutes, about
Microsoft Research and why we are interested in astroinformatics and in science
in general. And then I thought I would do that and if there are questions I
would be happy to take some of those.
2
So first thing to explain is why on earth does Microsoft invest in basic
research? And it's a hard thing to explain. It's a hard thing to explain even
within Microsoft and sometimes even within Microsoft Research. Our own
researchers sometimes wonder why do we do this? And the best explanation I can
give actually goes back to a lecture I used to give to freshmen when I was a
professor at Carnegie Mellon University. And in that lecture, I used this
image, and this is the kind of image that you use to try to keep early morning
freshmen awake.
And this was a lecture on the problem -- the classic computer science problem
of what's called prefix reversal, sorting by prefix reversal. More popularly
called, if you're 18 years old, the pancake flipping problem. So if you're not
familiar with this, there are many variations of the pancake flipping problem.
But a simple form of it is to imagine you have a stack of pancakes and imagine
that they're flawed, that they're burned on one side so you don't want to stack
the pancakes showing the burned side up. So the burned side should always be
down.
We have one operation, which is to stick a spatula anywhere you want within the
stack of pancakes and then for all the pancakes that are on the spatula, you
can flip them over. So with that one operation, the challenge is how many
pancake flips does it take to reverse the orders of the pancakes and attain a
final state where the pancakes are reversed but still the burned sides are
down.
And so that's a really classic problem, and this was the subject, actually, of
a whole week's worth of lectures to freshmen at Carnegie Mellon. And it's
quite difficult. There are, actually, now today closed form solutions for
stacks of pancakes up to 13. But beyond 13, it's still an open problem, and
turns out to be quite interesting in terms of pedagogy for young computer
science students.
And, you know, with an image like this, you can keep them interested at least
for the first 15 minutes of the lecture.
One thing that's remarkable here is that the seminal work on this problem was
actually done by our very own Bill Gates. And this was published in a joint
paper with Christos Papadimitriou who some of you may know at Berkeley, who we
still collaborate with actively here at Microsoft Research. Back when
Microsoft was still a very small company, still in Albuquerque, New Mexico.
3
And so the point here is that really from the very beginning, from the very
origins of Microsoft as a company and its founder, there was a deep belief in
the value of basic research, participation in the scholarly traditions of
academic research and open publication. And in this case, of course, the
problems were directly applicable to some scheduling and resource allocation
issues in the early MS-DOS.
So from that start, I think there was always, I'm guessing, an ambition in
Microsoft to be engaged in basic research. And, of course, it took some time
until about 1991. It was about the time that Microsoft finally exceeded the $1
billion mark in annual revenues. Still a small company, certainly by today's
IT industry standards. But it took until then, when finally a formal step was
taken to create a basic research lab. And so where you're sitting today is in
Building 99, which is sort of the mothership of a global basic research
organization of about 850 Ph.D. researchers, about 300 of them housed here in
this building.
Now, from that start, this laboratory does really a substantial amount of basic
research it is primarily in computer science and computer engineering but we
are also deeply engaged and committed to advancing our understanding and
understanding, uncovers new truths in all human endeavors of science and
engineering, including mathematics and biology, chemistry, physics, as I'm
showing here from a recent paper in Science from two months ago. And, of
course, in astronomy, cosmology, and astroinformatics.
It is not just our own desire to be involved directly in research in these
areas, but also, I think, we're very proud to be supporters of external
research. And, in fact, while we in Microsoft Research have a fairly
substantial team doing theoretical physics, particularly in the condensed
matter and quantum domains, this work is actually not done by Microsoft
Research but is the direct outcome of our funding of external experimental
physics research in the field.
And so by providing both infrastructure and computer standards as well as
direct support, we really seek to support the most important new frontiers in
science and engineering.
Now, when I'm talking to senior business leaders in the company, sometimes,
especially for people who are new, I don't have to do this with Steve Baumer,
4
but there are new executives who wonder, you're doing quantum physics? You're
doing astronomy? You're doing medicine? Why? What is the purpose here?
And for that, I draw map about our research investments in the lab. This was
something that Yan asked for specifically here, so bear with me. If you
imagine this blank screen being the space of all kind of research activities in
the lab, we can actually look at the dimensions of this. So along the X axis,
we have research activities that span kind of short-term, like in managed risk
activity, research activities that are only sensible if we can assume that we
can get some type of answer relatively quickly.
As we go out, research activities that demand patience or require some
patience. And then on the Y dimension, we have kind of choices of problems.
And near the origin, we have what we call reactive problems, classical societal
challenges or challenges from the scientific community or our product groups
coming to us directly with problems that they need help to solve.
And as we move up the Y dimension, we have problems or we approach the more
classical open-ended search for truth and understanding and beauty that marks
classical basic research.
And so then in this space, I cut this into quadrants and attach names to it,
because the names are very useful things for senior business leaders to hang on
to. And so in the upper right quadrant, we have the classic kind of long-term,
open-ended research which I've called here for the benefit of our business
executives, Blue Sky Research. In the lower left, we have very mission-focused
research activities. In the lower right, we have what you can think of as our
drive to do the best at what we do, continuous improvement or times marked by
fairly deep commitment, long-term commitment to grand challenges such as
computer vision or machine translation of natural languages and so on. And
then in the upper left, our desire to be very surprising and disruptive and
produce kind of game-changing new ideas.
And so in the lab, what we try to do here is try to embrace the diversity of
different approaches to research and have a very open attitude and try to
reward researchers who play a role in any or all of these four quadrants. And
then for me, when I work with our management team, I try to challenge our
managers to tell me, strategically, how they are investing and supporting
research across all four quadrants.
5
Now, this is important because you can then justify a lot of what we do by
looking across these four quadrants. And these little pictures are just meant
to show just one little story. But there are hundreds of these stories where
in the upper right quadrant, dating back to 1999, we had some early work on
MVDR beam formers, adaptive beam formers for audio array processing, just a
purely theoretical exercise for this laboratory.
At some point, there was a project stood up called Project Natal that was
attempting to bring beam forming into living rooms that eventually got
green-lighted as a product effort, which resulted in the microphone array in
Kinect that allows you to talk to Kinect even in a very noisy room. And then
today, we're looking at more and more refinements and extensions of the
technology.
And so the idea here is that there is a pipeline where all four quadrants are
really essential for making good things happen. And so there's sort of a
philosophy there that's important to us.
And so for things like astroinformatics, this plays out all of the time. A lot
of the infrastructure, a lot of the database technology, a lot of the
visualization technologies, even the programming support that we develop in
support of the astroinformatics community, in support of basic science in this
area ends up having direct impact in many other ways. And, in fact, as some of
you might know, even in things as mundane as the new version of Excel that will
be hitting the market in a few months, concepts from the worldwide telescope
have direct impact and will be realized directly also new features, even in
things as mundane as Office. So the whole thing is sort of a nice cycle.
So that's about all I have to say. If you have questions, great. If not, if
you want to get along with your conference, I'm happy to let you go off and do
that. Thank you again for being here. It's really, really great for us to see
you all.
>> Yan Xu:
Thank you.
>>: What was the disruptive technology in the upper left?
pictures.
You had four
>> Peter Lee: Oh, yeah. So this is a test rig. There are several of these
made, three or four of these made with nine microphones. And the nine
6
microphones were used to develop both adaptive calibration in arbitrary living
rooms, up to four meters in depth using pleasant tones. And so actually if you
buy an Xbox with a Kinect and you turn it on for the first time, you hear a
music tone. It's very kind of uplifting and inspirational. That is actually a
calibration tone.
And then this was meant to then test whether MVDR adaptive beam forming could
actually be used at four meters' distance to shut out all of the surround sound
noise and all of the other voices in the room and just focus on the one voice
that is speaking.
And so that was very successful. This tested in about 500 homes in the Puget
Sound area. Now, of course, the challenge for moving there to here in the
Kinect, there was a 30 cent manufacturing budget and that meant only four
microphones. And, of course, the device is much smaller and so the distance
between the microphones is much smaller.
And so it really, physically can't possibly work. And, in fact, the early
academic papers we submitted, the reviewers uniformly rejected the paper on
that ground. But if you actually own a Kinect, it works remarkably well. And
so we're pretty proud of it.
>>:
[inaudible].
>> Peter Lee: Right. And thanks for asking the question, because we love to
brag about the Kinect so it's good. All right. Well, enjoy the conference.
Really looking forward to seeing what you all do. Thanks.
>> Yan Xu:
Thanks again.
>> Dan Fay: Couple things. Again, I also would like to thank all of you for
coming out here, coming out to our nice weather out in Redmond. Unfortunately,
you came on the one day in the last 50 that we actually had rain. So you got
to enjoy that success. We're all depressed because we actually wanted to go a
couple more days so we could set the new record of consecutive days without
rain in the Puget Sound area. But my lawn's happy for the rain, I should say.
So as Yan mentioned, I work here, also at Microsoft Research in our research
connections group, and I head up what we do kind of around what's called earth,
energy and environment. And I joke sometimes, we actually cover everything
7
that isn't allowed to deal with humans. So we're not dealing with health and
well-being or bioinformatics or anything like that. So it's a fun area to kind
of play with.
One of the things I wanted to kind of go back, about eight years ago, almost
nine years ago, we held an eScience workshop here. The original one that we
did called Data Intensive Computer Workshop, and there was a key note by Jim
about 20 questions to a better application. And it was a really interesting
kind of looking back to see what we did at that time in this cramped little
Sheraton room down in Bellevue. So it was kind of an interesting thing to see,
as we started getting people across different disciplines, what actually
happens in this.
And some of the stuff that came out of, like, Jim's talk and others was around
the online science and how to deal with computational science. There was also
this nice talk by Alex at that time about astrophysics with the terabytes of
data. So it was interesting even seeing what was happening at that time with
some of the work that Alex was doing with Jim and others and the stuff around
the SDSS at the time and sky query and some of the other pieces that kind of
changed how we do some of the science and also had an impact in the way we here
at Microsoft and Microsoft Research work with different science communities as
well.
So I took these slides, these are actually originally from Jim's deck from that
talk, and part of what I did, which is kind of interesting, is if you look at
some of the challenges that were talked about back then, they're actually still
challenges right now. We still haven't solved the information avalanche or the
data tsunami or the -- what is that, the data flood or, you know, whatever the
new term is that we've all assigned this issue of more data as it keeps coming.
In fact, I'll also say that we won't, because the challenges with the speed of
the commoditization of some of these sensors and the devices and the
computation is always increasing faster than we can actually process it and/or
even store it. And in the astronomy space, you guys see that more in spades
than also in any of the other disciplines.
Some of the other pieces are still very [indiscernible], even to the publishing
of the data. How do we do it, what are the right ways to do it, keeping the
provenance of the information in the datasets. And then, you know, how do we
refer back to it. Can we even keep all of it.
8
As well, along with the, you know, one of the key things back then, we talked
about a lot also was about the global federation of different datasets, which
is still the case, especially with more collaboration across not only
universities and research centers, but also across the disciplines as well, as
we go into more multidisciplinary science.
And this was something Jim had talked about, which was actually really
interesting at the time, really breaking out the roles of different types of
scientists within the spaces. And at the time, you know, there were still a
lot of, especially in other domains, thinking oh, no, we do it all. We're
doing everything. But as you look at where computation has come and the use of
informatics in a lot of these areas, there are folks that are very good at
collecting data and analyzing within their space. There's folks that create
new types of algorithms and new statistical methods to actually analyze their
dataset.
How do you bring all those pieces together? How do you keep track of all those
technological advances in those other domains as well? And then there's folks
that actually understand the plumbing aspect of actually how do you get things
to disk, what are the types of disks and actually the networking portion. So
it's actually been kind of interesting to see this is still the case. We're
seeing this breakout happening a little bit more within the areas.
The challenge that we see, especially in a lot of the academic spaces, is how
do you reward folks in these different areas for their expertise and them get
credit for this ongoing work.
So Peter kind of covered a lot of this, which is overall research. Down in the
bottom corner, we have a little map of all the different labs as well so we're
kind of spread, kind of here in Redmond, but different locations around the
world. And so we, in our group, when the research connections pull from a lot
of these different not only areas, but also domains of research that we can use
for scientific interaction.
As I kind of mentioned, in the earth, energy, environment side we cover three
kind of main areas. I'll say three kind of main focus that we look at. And
the first one is kind of around the visualizing and experiencing the data and
the information. And really, what that is about is how do you make a
connection with a scientific information in the same way that people have a
9
connection with a great piece of artwork or maybe a red Ferrari or, in the case
of my wife, nice pair of shoes or handbags. How do you actually have that
emotional kind of guttural connection with that information. So that's one of
the things we look at. And a lot of this work came out of the work from the
worldwide telescope of how can you make something beautiful that people can
actually interact with and actually kind of mesmerize them. And why can't we
do that with scientific information when we can do it with gaming, we can do it
with all this other information.
The other key thing we look at a lot is around what we call accessible data.
So there's key pieces in there, which is around the discoverability of the
information, the consumeability and also the accessibility of the data and
information. So do you even know where anything's at? If you do, can you
actually get to it through the FTP site or through some sort of password or
it's hidden on some sort of share.
And then once you can get to it, can you actually do anything with it. And
this, as you look in the different science spaces and based on the technology
acumen of the different scientists, it needs to be done in different ways. How
can I quickly get up to speed and have science ready information and data to be
processing.
And the last one we look at a lot is around enabling just scientific
collaborations. How do you do that in a way because it's just part of the
overall process.
So this is the Fourth Paradigm part. We've kind of talked about this a lot,
but it's always, you know, kind of good to revisit that this is where we're at.
All of this data sets in here, more data being captured and, in fact, too much.
Then there's this corollary problem and idea as well of we need to capture all
the data and we need to save it all.
And when you start looking at, you know, are we actually going to be able to do
that, we see on some of the telescopes where you actually can't, you need to
throw it away, you need to process it as it's going by. How do we handle the
information at the -- you know, just in time. And what is the data that we
should be keeping.
Part of that is it's not just about doing more and more brute force processing.
It's about thinking about the problem earlier. And one of the challenges has
10
been as we move to more this kind of instant access of a lot of information and
data, is this idea, oh, we can just run it anytime. And sometimes people
forget to think about the problem and what is it they're really trying to do
and what's the long-term benefit of the data that they're trying to capture and
how will it be used. And so setting up the problem correctly.
So overall, the book's available. We always do a promotional for this since
it's something we helped edit and author. It's available for free. You can go
to fourthparadigm.org and it's under a creative comments license as well so it
can be reused for other uses.
So overall, kind of on the eScience problem, this is still going on as well.
How do you ask questions for the information, get answers. How do you pull it
from all these different data sources and deal with the different levels of
precision from historical information on down. And so how quickly -- how can
you do this and hopefully in a smart and intelligent manner get to the
information query.
And the ideal is, you know, just being able to sit, as we kind of joke, in your
lounge chair, able to ask questions, having the results come back to you. Then
being able to write up your paper, submit your next grant request and actually
get funded.
One thing just to highlight on the fourth paradigm part, Jeff Dozier, who is
here, wrote a good article for the EOS a year ago, year and a half ago,
highlighting the use of kind of the overall concepts of the Fourth Paradigm and
the work he does around show hydrology. And this idea of using both remote
sensing and local information, combining all those together. So it's actually
a quick, three-page read, but it's a very good kind of overview.
One of the other things that's also really key that's kind of come out of some
of the work we've been doing with different groups is kind of just the overall
value of information, and this kind of idea of the value of it versus the
amount of time processing it versus, you know, who's doing the work on it. And
the increases as more work's been applied to it.
And that's kind of obvious, but one of the things, sometimes, we forget is how
do you reward also those people along the way that are doing that hard work to
get the data to the science ready point. And a lot of times in areas outside
of -- especially in some of the ecological areas and outside of large
institutions, there's a lot of people that are doing this painstakingly on
11
their own, or they're hard-won data sets because they actually had to go out in
the field and get them and then process them and you have other people
aggregating them together and doing the work.
But as it's kind of moving up, how do you kind of -- how do you add that value
to it or make sure that that value's usable for different folks. And so this
is something we kind of also look at to say are there ways we can help with the
overall academic community, provide ways of rewarding different groups in this
space for their activity.
One thing I want to kind of just talk about real briefly is something that
we've looked at when we were dealing with some of the environmental spaces and
kind of position as the environmental, the ecosystem. But it actually can be
applied to a lot of different spaces as well.
And it's this idea that you have kind of the knowledge in the earth, in this
case, but the real scientific data and information and based on that, you also
want to have some sort of action that's being taken. So you want the knowledge
that's kind of gained from the scientific understanding of that real-life area
to actually inform policy making decisions or some sort of use cases that you
want to have changed.
Well, you know, as we kind of looked at it more, we were breaking it down into
more areas, it kind of evolves into these kind of two pieces. And one part of
it is there on the left-hand side, which is the traditional, say the scientific
face. Collecting data or processing data, having running models and actually
having output of the information, the data and doing more analysis on it.
And then you have this really interesting part, which is the thing the humans
come into, which is actually the insight of that data and the information. And
then that insight gets written up in maybe a publication and then gets
submitted or gets, you know, made available to others and then goes back to
kind of producing more data and we kind of keep a lot of that area.
And maybe somewhere in there, the data gets made available to others or the
publications there. But for any of that information to actually have an effect
on, I'll say, that action side, which could be the policy decision making or
actually on the general public or other areas, there has to be communicated in
a way that they can actually consume it and actually understand it. And so a
lot of times, it can't be done in the normal publication mechanism of a paper,
12
but are there other methods that could actually be done.
And on the right-hand side, on the action side, there's also that decision
making portion. So they look at the information. They make some policy
decision or some sort of -- or even a consumer might make a change in behavior
based off of some information they've learned, and then they will actually
hopefully change their behavior and implement it.
Well, how do you actually track that those implementations are being done? You
actually have to go back and do more tracking of it, the information. And this
is especially the case on policy decisions, and some of those the last thing
you want is a policy implemented or written up and implemented and actually no
one knows if it works.
And one of the things, we really look at this as technology hopefully helping a
lot of this out of the area. But one of the other things we've found was that
by making that data and information available to somebody on the right-hand
side to be able to consume it, in different ways, increases the credibility of
that data and information, whatever you're trying to communicate, right. So
whatever the information. Even in the fact if they can't really understand the
data, but the fact that they can get back to and actually see it ask. Go play
with it increases the overall credibility of it.
So finding ways to kind of connect between those two areas and communicating it
in ways that make sense to people. And not all the times does it, you know,
have to be dumbed down or things like that. But put in the context that they
can consume it. That's a kind of key thing to think about.
And it gets even more complex when you look at the overall picture. So if the
boxes in this case were policymakers, they're getting bombarded by different
messages from many different people and constituents and things like that. So
how do you make sure that the message or itself information you're trying to
communicate to the policymakers or the general public actually gets through?
Again, it's got to be in a way that they can consume it and understand it.
So cover a couple other things. We talked about some of these issues around
the ecological data flood and more and more data that's coming, we're seeing
this happening, we need more algorithms helping with some of these and
processing those earlier. And part of the challenge that we look at is how do
you help this across the domains. So what can we take from one domain and
13
utilize it in another one. You know, trying to deal with everything from
field-based data, manual measurements without the precision of some of the
digital instruments that we all kind of deal with all the way up to satellite
model outputs, you know, counting of information and dealing even back to the
historical photographs on some of these ones on the ecological ones. So you
have a lot of types of different data and different data sets within here.
So dealing with all the challenges related to those, about combining those
together, is one of the key things that we try to look at, find where can we
make that easier and have tooling to help as well.
So kind of why is this kind of important? Well, because, you know,
traditionally you want to understand where the data came from and how it got
processed and what are the right ways it got accessed or made available as
well.
What's the uncertainty of the different data sets once you start combining them
together, and does that propagate all the way up?
Also, the part on the data sharing. Sometimes when we deal with the
environmental datasets that are with organizations that may not want it totally
public all the way or the exact location where all the information is.
If you have information on where all the teak forests are, do you really want
to have that published on the internet for everybody to find so somebody can go
log it and clean out the remaining teak forests. The same as the case on
tracking of different animals. Do you want all the koalas, where they're at at
every moment. Other endangered species, even.
So there's a lot of challenges also within there. How do you actually handle
that data, make it available for others to use, but also the same with some of
the human datasets, how do you do that in a way that's maybe anonymized or not
giving exact all the information.
But one of the things that's interesting is this part down here about, you
know, we see it as the science really happens when you start bringing together
these multi-datasets of information. And part of this, we learned this when we
were looking at some of the astronomy information and what was happening, which
was it wasn't just about a single telescope or a single bandwidth or type of
dataset you were getting down, but it started moving into combining across all
the different wave lengths and the information you were getting.
14
So if it was visible or infrared or microwave, you get information from each
one of those and how do you find actually the signal through all that noise of
signals, right? Combining all those together is the key part. And so you're
seeing this even more now happening in kind of these environmental areas as
well.
So we look a lot to things that are happening in these other disciplines, that
are a little maybe farther ahead. How do we learn from those as well and apply
those.
And this was just one of the projects we ended up doing early on a number of
years ago, we were using some of our cloud computing early at the time, Azure,
to process some of the datasets from, in this case MODIS, some of the
reprojection and some of those pieces and the calculation on it to get the
EVAPO transpiration. So we were doing this in conjunction with UC Berkeley,
and Youngryel at the time was trying to do this on one machine, and would have
never been able to actually expand it out to do the 30 years of datasets at the
1K resolution that we're looking at.
And so we were looking how could we bridge that gap. And then what can we
learn from that to also make it easier for others to do this in a more easy
way. And what technologies need to be in place to allow that to happen.
And so it's kind of an interesting one to look at, going what is the real
process that they're doing, what are they really trying to have as an outcome
and how can we help and how can the technology help in that way. But actually
make it about, again, the science making the work run or about the insights
coming from the science and not about what the technology could do.
So not trying to over expose the technology, but how do you actually make it do
what the scientist wants.
Couple other things we look at now with some of the stuff that's happen, we
have come work going on in our SQL team as well on the Azure side, trying out
new ideas. So we have things we try out here. They're actually trying out
some ideas, testing them now. These are available to use even right now.
But putting some numerics available now so you can cull those remotely, utilize
some of the numerical engine and the libraries that we have available. There
15
is also some work that they have going on looking at exploring data. So can
you help people mash up the information in the data sets together. Again, one
of the things we look at is can we span everything from the lower end of folks
on the commodity -- on the technology curve to folks that do programming. And
where is it that they need some of that help.
So are there visual ways to help clean and organize it. One of the things
that's also interesting, there's a piece going on that they're doing around
what's called data hub, which allows organizations to have their own
marketplace for information and data. So it's a way to actually publish it
within a community or maybe within an organization as a whole or an enterprise
and share that information just within them. So you can really kind of
restrict access and do some of those pieces as well but not have to worry about
people finding it and what the access points are.
And then there's some work also on trust information to be able to encrypt some
of the data as well. So there's a bunch of work also going on on their side.
So on our end, we focus, as I mentioned earlier, a lot on this overall piece of
discoverability, accessibility and consumeability of the data. And, you know,
how can we enable that both from the user or the person wanting to consume the
information, but also from the person that wants to publish the information.
And this is a lot -- we find this a lot in the space where you have users that
have small amounts of data, and they want to make it available because it's
very useful information that others could use. How do they make it so that
it's available and discoverable? And what are the right formats? And how do
you move that so it's earlier in their collection cycle? So they're not
actually having to go back and add metadata to their description of the
information and blah, blah, blah. All those things that no one wants to do
after the fact. So how can you move it earlier, and are there ways to tag the
data and information.
One of the things, just as an aside that's really interesting is if you go and
we'd like to actually have it at the same point of where you're teaching people
how to deal with digital data at the beginning, the same way we do it for
teaching people how to collect physical samples. So in the geological areas
and some of those different areas, we teach people how to deal with physical
specimens, how to collect it and how to make sure you're getting all the
information and how you would do that over time. How do we do this for
16
digital-based data as well.
So we also have a bunch of tools coming from different groups as well. This is
kind of an example of one from our group in Cambridge. We have a computational
ecology and environment group there. In this case, we've been looking at Fetch
Climate information. This Fetch Climate project, which is climate information
and how do we make that available to others.
But bringing together all of the different data sets that people have used, a
lot of the different records of them, and then apply behind kind of the scenes,
behind the interface, ideas of how to deal with the uncertainty. So if you're
asking for information on a certain area, how do I provide that to you in a way
that I'm limiting the uncertainty on the different datasets that it's coming
from.
And then be able to very quickly look at the information and maybe some images
online and then also be able to download it. So as we look back at that
discoverability and the accessibility and then really quickly consumeability.
Quick, easy way to do it for a lot of the folks. And you can also add this in,
in their case in a programatic way as well. So you can either do it in an
interface or programatically. We're trying to find ways that you can make,
again, this easier.
We've also been doing, looking at things from what we call around the visual
informatics framework. This is something Yan's been doing a lot of work on.
Looking at how we can help with some of the space of making access to a lot of
these datasets quicker and easier. This is really a lot of the times the case
that we find in environmental and other areas. And you end up having tons of
different applications on many different platforms and you have access to
different services and different types of datasets.
And so one of the things we looked at was this protocol that's been kind of
developed called open data protocol, which is a way to actually send the data
in almost a self-describing way across the internet, across the network, so you
can actually do queries remotely and access different datasets without actually
having to have a library installed as well locally.
And it's available now, it's gone into oasis for standardization. But one of
the things, why it's kind of interesting to us is when we look at the history
of how a lot of the protocols have come across within the internet, so
17
web-based ones, SMTP, other ones, having simple ones that could be implemented
on any platform and be usable right away is really key. And so O-data is based
off of atom feeds and RSS feeds so it allows you to subscribe to information
and data so you can actually refresh the data later on when you want to get a
new update. You're not having to go back out and try to figure it out. It
stays with it.
There are some other nice pieces in there, and they've added, in the case of
the environmental space, geo spatial data support. And while it doesn't solve
everything, we looked at it, one of the things it does is it actually moves it
forward, above people sending CSV files or common deeliminated files that
actually have no type information when you send it. Just has the data. So,
you know, this at least sends type information with it so you can do some sort
of human reasoning about the information as well.
So it, again, as you look at as protocols build up, this was kind of
interesting to look at right away so you can quickly start getting to it.
So we have a bunch of also products and technologies that are building on that
as well, but it's just an interesting space.
The other thing I just wanted to cover was kind of looking at new ways to
analyze and communicate data. So this, you know, everyone here should know
about the work on the Sloan Digital Sky Survey. So some of the work kind of
coming out of that as a new way publish information and datasets. But one of
the things that we actually found very interesting about it as well, once we
started looking at it, was not only was it a way to publish information out and
you could have it for many different communities, but then it could be
reutilized for things like the Galaxy Zoo as well. So you could actually use
the same exact datasets, position it a little bit differently, put a little bit
of boundaries around it and actually make it available for others to utilize in
another type of activity. But bring more, in this case, science users and
others involved in it.
So not as much to, you know, do the deep research, but actually get them
interested as well, but also have them participate in it.
So, you know, are there other areas we can do that in and we look at this in
other environmental areas and datasets, applying some of these same ideas and
saying, you don't need to have a completely different dataset. You can use the
18
same dataset, make queries against it differently and provide it in a way that
can be consumed by that constituency in a way that they can actually handle.
And then there was the work through Jonathan and Curtis and many others,
actually, looking at, you know, bringing together the work for Worldwide
Telescope. How do we bring these datasets into one area and actually how can
you experience the data.
So when I was talking about it earlier with our group, looking at kind of
experiencing and visualizing the information data, this was part of the thing
we look at. How do you experience it and through that visualization of it. So
there's been a bunch of work that's gone on on the integration of the datasets
within there. Easy access, quick overview of being able to look at the data,
have it accessible at your fingertips and not have to wear where it's at.
We've done some bunch of work on adding a new API for the extensibility of it.
We've even added and done an Excel add in so you could very easily, for a lot
of folks, publish data out. It wasn't as much originally developed for as we
were looking at it for the astronomical side, but we see that it could be used
there. Was also looking at how could people get information and put it
directly on to the earth from data points that they have.
And one of the key things that really came out of it was this idea of the
tours. So for sharing the information, the data. But it's really about
telling a story about the data and a way to share that data. The experience
that you have as a scientist or a user of that information and sharing it with
others.
And that is really kind of not only powerful, because it gives an explanation
of the information that somebody doesn't have to read a long paper on or
others. They can use many different types of media types to get to it.
And then there's a lot of work that's being done lately using it in
planetariums. So Jonathan's done a lot of work making it easy for it to be
used within the planetarium space. And, in fact, the Cal Academy has a project
going on around earth quakes, a show going on. And it's being used in there as
well for their planetariums.
Just a couple, some of the visualizations we've been playing with lately to say
how can we make it easy for folks to share this data and the visualizations of
19
it. On the lower left, this is a slab model that we were working on with the
folks from the USGS. So they create these slab models of where they think the
slabs are, and then being able to visualize it with the earthquakes as well to
see are they actually in range or not. And its ways that they hadn't been able
to visualize or deal with the data beforehand.
We also, this last year, did some work around this thing called Layerscape,
which was a way to actually publish out the tour so none can publish these
tours in the space and share their interpretation of their data and their
datasets. Wanted to kind of be a way to have a quick, easy way to share this
information. But it also, what's interesting about it, is when you do it that
way, you can -- you're not only sharing this tour and the interpretation of it,
but by sending the tour to someone else, you're actually sharing the handle to
the data as well.
So because the tour has the data within it, you can go into the layer manager
and actually right click on the data and actually copy the data out or get to
the information. You can also put your data in with it so you can see your
data in conjunction with the other datasets.
And so for us, it's an interesting way too of not only sharing your
interpretation of it, but also a way of sharing the data as well, to others.
Talked about the Excel portion, and then the other piece we played with as well
was kind of this -- hopefully this will play. There we go. You know, looking
at how we can use things like Kinect to actually interact with Worldwide
Telescope in this case but interact with the scientific information and data.
How can you take advantage of something like Kinect to do that.
Something, really, that we wanted to look at to say, okay, you know, we have
these devices and as Peter mentioned, something we did. It's being used in
gaming and in living rooms. Can we use it for scientific information and data
and actually allow someone to actually very quickly navigate through some of
these datasets. So it's something we want to keep playing with and looking at,
and we actually had some interns over the summer playing with it as well in
different ways to look at other datasets.
On a lot of this area, some of the next steps we're looking at is we're adding
in more functionality for some of the earth-based datasets, the NETcdf and
other datasets that folks have been asking us for. We're looking at new
20
clients, new implementations and some new interfaces, both for kind of making
it both easier to consume and to create tours.
And then we're also doing some stuff around looking at Azure for different
pieces. So one of the things that we're really interested in is this platform
as a service and some of the work especially around using things like python
directly through Azure. So there's some of the -- there's a lot more of the
work going on in python and some of these codes. Can we take advantage of
those as well.
One plug real quickly so we have an eScience workshop coming up in Chicago and
Ian will be there for the overall one as well that's in conjunction with the
IEEE eScience workshop conference. Folks want to come join us, we'll be there
to hopefully do some of the cross-domain discussions.
Couple things I just want to also cover kind of in conclusion, which I always
think is an interesting one I have as kind of a discussion point but with all
this digital data, it can be opened, but who ends up paying the cost for all
the spinning disks and the bandwidth and the cooling and things. And so what
are some of the mechanisms that can be used, and should we be looking at other
areas, like tolling that you see on roads or, you know, in the case of other
countries in Europe and other, where you do licensing of TV signals so you pay
for the broadcasting of the infrastructure.
It's just an interesting concept of seeing where does this go, because if you
keep looking at the size of the data increasing, and yeah, we'll get more, able
to build off more of the commoditization of some of these things, get more and
more per drive. But you still have all these other costs going on and how do
all those get covered and how does that happen if everything's kind of, you
know, online in an area. It's just a thing we look at to see are there
interesting economic models that can be brought into it as well to help with
the not only to cover those costs, but also to help with the economics of
different groups who are creating these datasets.
So, you know, again, going back to the original slides on looking back from
eight years, you know, still some of these same challenges on algorithms that
will scale across many different datasets, especially as the datasets increase.
We still think there's areas for thinking about the overall data and the
retention of it. Do we keep everything, how long do we keep it, where do we
21
keep it.
And dealing with other visualization of information.
One of the key things, just I think in any of these is dealing across the
domains and sharing a lot of these best practices, which is not something you
get in the traditional conferences for the domains. You just don't get that.
And the other thing is where do you find places to actually not only share the
information but not have people just talk about what I call the chest beating.
This is how great we were. This is how everything worked. Everything worked
perfectly. And then you go later on, well, you know, we had these little
hitches and gotchas.
So where do you find the place that's okay for people actually to have those
discussions where it's not going to be about, you know, the paper or those
things, but really about here's still the challenges that are happening. We
need to solve these. Does somebody else have an idea or what's worked in other
domains.
The last slide, just kind of covering some of the other things we're looking
at, just the balancing of these things, everything from data to the bandwidth
to the storage and processing. It's like a three-legged stool that's out
there. And they need to kind of be in alignment for it all to kind of work.
And the challenges are never going to be in alignment because you're always
going to have more data coming in from more and more sensors so you have to
increase the bandwidth speed. Oh, then can you deal with the processing and
the storage at that time. How do you handle all this in kind of a way that
makes sense.
One thing we look at as well is can we push a lot more of that computation back
towards more the sources so how can you actually do that more directly at the
sensor area so you're not only just filtering information and data, but maybe
processing and aggregating it closer so that you're not just stuffing it into a
data storage and hoping someone later on will process it.
And then there's the challenges on how do you create some of these new types of
scientists to actually deal with the different data sets in different ways and
process it and coming up with new ways of applying algorithms and others to it.
And the other one we looked at is how do you continue to ride the commodity
curve. If you look at what datacenters and cloud computing is really about,
it's about riding those commodity curves of disks and processing and
22
networking. And so how do you do that just in the scientific area, both from
sensors and other datasets that are there so you can get the benefit from that.
So that's kind of my quick overview of some of the stuff that we look at here
within our area and there will be other talks later on about Worldwide
Telescope and some of the other ones.
If there's any questions or anything, be glad to answer those. Otherwise,
thanks very much and thanks so much for being here as well. So thanks.
>> Yan Xu:
Questions.
>>: Well, an important part of your presentation was obviously visualization
of [indiscernible] data. Have you developed anything like that handbook of
principles, something like [indiscernible] work, only expanded for the data
challenges we have now?
>> Dan Fay: No, we haven't. There's many of them out there from like the
visualization communities or the large visualization analytics community and
some of those that have done a lot more within those spaces. What we're also
just looking at is just how are we doing it in some of the areas we are and can
we write it up in little quick either blog articles or some of those.
Maybe putting those together into a more expansive way on here's some of the
things we have just learned would be good as well.
The other thing we do look at, though, and Jonathan's really good about this,
is looking at how do we take advantage and make the most of some of the GPUs
and how do we take advantage of that technology that everyone has on their
devices. Because that's really where the magic kind of happens to give you
that experience. It's still not the thing you can get from doing it through a
browser completely yet or remotely. So there's things like that that we should
probably also communicate.
Actually, Jonathan does it, because he does some of that through like the AMD
conference and some of the other ones. He's actually talked about some of
those, how to use it, what they've learned. But yeah, it's good.
>>:
There's nothing like [indiscernible] book.
23
>> Dan Fay:
No, no.
>>: You mentioned cross-disciplinary work. So in your opinion, which are the
areas of eScience the most closely aligned with astronomy that astronomers can
work with to push forward together?
>> Dan Fay: You know, so that question comes up a lot, especially when I talk
to the astronomy community. And it's ->>:
[inaudible].
>> Dan Fay: I know, and I'm not sure why. It's actually interesting
because -- there's couple things that are kind of unique also that we look at
when we look at the astronomy community and kind of the physics as well just in
general.
A lot of the work goes around, let's say, many or small amount of big sensors.
And so there's a nice piece about that, about actually having all the focus on
those and how to deal with those at one time. So you have a lot of people
dealing with them at one time.
One of the challenges with a lot of these other areas in the environmental
space, except for, again, some of the ones that do satellite imaging and some
of those is that it's a lot of small amounts of sensors. And so but the
techniques that you guys have kind of come up with in some of stuff, just like
in the sky query idea, being able to query across multiple datasets, using and
doing cross-joins and crosses in a unique way is the types of thing that we
look to say hey, this could be replicated in other areas as well.
One of the things we also look at is you want the people who actually deal with
the data to curate and actually to own it, to keep updating it. There's
something about having that ability. So you don't want it all going to one
big, huge repository where it just sits.
And so to kind of get back to your question, there aren't very many others that
are doing that similar piece that you guys are doing. So I think that's
helped. And we actually utilize a lot of you guys, the astronomy community, as
examples of how you can do it across the multispectral type analysis, about
bringing those together, about having distributed datasets, about having
inclusion of the data with the papers as well and being able to get to both of
24
those.
So I don't have a good one that you could -- maybe someone else does.
Jonathan?
>>: I think part of it is depending upon is it data, visualization, analysis.
There are different analogies. So other folks have put together, like, medical
imaging, some sort of visualization analysis. Other people looked at data with
physics projects and cross-correlation.
The other thing that's also interesting is whenever you're dealing with any
cross-discipline, the other -- the grass is always greener. Here are all these
people, oh, yeah, the astronomy people, they have it all together. You know,
oftentimes they kind of see the work that's done and they see often the good
parts of it seem successful, but they don't always know how much pain was
involved in it or maybe how much still needed to be worked on to make it work.
>>: There are some ways in which the data is simpler. Also, because it's
always been large scale, the arguments that you have to spend money looking at
it has been one [indiscernible] and that's a radical thing in some other areas.
>>:
You have to really think about it.
>> Dan Fay: And there was something else about that, the samples. Oh, well,
they just sit there. They sit on the shelves, yeah. But the challenge, I
think, on a lot of these is how do you -- you want to have those long,
longitudinal studies, and how do you deal with those and keep them for a long
time. And I think all the sciences are still struggling with that.
>> Yan Xu:
>> Dan Fay:
Last question.
We have a discussion on this later.
>>: I just want to make a comment. Don't feel like you have to respond to
this. When you're talking about the sustainability issue, it reminded me of a
paper I saw last year on data furnaces, and I just looked it up again. It has
two researchers from the University of Virginia and four from Microsoft
Research.
And the idea is that instead of worrying about cooling with your data storage
25
devices, install them in people's homes and use it to heat people's homes. So
they're not spending additional energy to heat and you're not spending
additional energy to cool. It's the heat that's generated is naturally used in
that environment. And people get some kind of a discount on their taxes or gas
bills or something.
I don't know how it works, but it was a pretty interesting little idea.
want a fun thing to read, look up data furnaces.
>> Dan Fay:
>> Yan Xu:
If you
Good.
Thanks again, Dan.
>> David Reiss: All right. Well, my name is David Reiss. I am a research
scientist at the Institute For Systems Biology here in Seattle and I'm excited
to be here. I actually just maybe for a little street cred, I got my Ph.D. in
astrophysics back in 1999, working on super nova searches.
So hopefully, I can offer a little bit of fodder for discussion about the
things that I think biologists can really learn from you guys as well as
potentially the other way.
So when I was first offered to give this talk, naturally the first thing I did
was to look at Wikipedia to see what bioinformatics actually meant. And as you
can see, it's a pretty broad -- it's got a pretty broad description and I
decided that I wasn't going to be able to talk about it all in the 20 minutes
of time. I actually added a couple of things that I think important at the
bottom there.
But there's some -- so given my particular area of expertise, I thought I'd
focus on one, what I consider to be important integrative aspect of
bioinformatics, or computational biology, as I often refer to it. But it's
also going to talk -- it's also going to cover a wide range of the other
specific areas of informatics that are dealt with in biology.
So these are the challenges that I kind of came up with that I thought were
particularly in contrast to the way I think of astroinformatics or astrophysics
data analysis. In particular, well, the increase in data size isn't
necessarily different, but in biology you're typically dealing with a very wide
range of different types of data that need to be dealt with in different ways.
26
And often, for the different types of data, there are different informatics
experts that know, that have come up with methods for modeling them,
understanding the noise, understanding the experiments that we use to come up
to create that data and it's important to understand, whenever you're looking
at a biological data, biological information, that oftentimes you're confounded
by things that you're not aware are going on in your biological samples.
That you're observing things that are -- that what you're observing is often
not as important as what you're not observing. So to kind of start with a
specific story, I thought I'd just kind of go over some basic biology.
Most of you probably remember the central dogma from your high school biology
classes. That was about as much biology as I knew when I started at the
Institute For Systems Biology. It basically describes the basic way that the
cell uses the information in its genome to drive the creation of the molecules
that do all of the rest of the work for the cell.
So as you all know, the information is arranged in a linear sequence of
letters, along the genome, and these contain coding regions or genes and
non-coding regions. Those regions are transcribed into messenger RNAs, which
are molecules that are then -- that carry the message in order to do additional
information processing.
And the standard theory basically then says that those messenger RNAs are then
translated into proteins which are basically the machines that make up the
activity of the cell. These are receptors. They do signaling, they do
additional information processing and regulation.
So that's the central dogma and, of course, this is a very, very simple
overview of what actually goes on in the cell. And so there's actually a whole
lot more going on. These proteins go back and bind to the genome to regulate
what genes are then transcribed and what genes are turned on or off,
essentially.
Additional proteins bind in combinations, and it's important to recognize that
these, each of these combinations occur many thousands of times across the
genome, and they're confounded by different cell types, different environments,
different experiences that the cell has received, and so bioinformatics is
basically trying to use data in which all of these processes are observed to
27
integrate it and make sense of what's going on in the cell.
>>:
Is there a reason you're calling it dogma?
>> David Reiss: I think it's a pretty appropo phrase. And I don't think -currently, most biologists think of it as kind of a historical paradigm. So as
I'll talk about, there's a lot more going on and biologists recognize that.
But the difficulty is trying to actually observe it, trying to make sense of
it.
So in terms of -- so one of the reasons that I brought this up is because these
messenger RNAs, which are -- I don't know if there's a laser pointer. The
messenger RNAs, one of the things that we've been actually able to measure
reasonably accurately for a longer period of time, thanks to the sequencing of
the genome, we have these things called micro arrays or gene chips you may have
heard of where you can essentially measure relative levels of messenger RNAs,
of all of the messenger RNAs in cells in a sample.
And you can do this across varied conditions or varied cell types and typically
these are converted into a matrix like what I show here. And then once that
matrix is done, that's where the information processing occurs. So a typical
analysis involves cluster analysis or support vector machines or other types of
more sophisticated analyses that I won't have time to go into.
But this has been going on for about ten to twelve years. This has really kind
of led the informatics data processing breakthrough that's been going on in
biology for the past ten years.
So as I said before, so this is the standard model, but there's a lot more
that's going on, and in many cases, those additional bits of processing, or
those additional types of interactions are basically ignored, more or less
because we don't have the adequate data observation technologies to observe
them.
Micro arrays are basically our best tool at this point. So we can observe
these messenger RNAs and we can measure them by the thousands and we can use
these to infer what's typically referred to as a transcriptional regulatory
network, which is a network of interactions by which these regulator proteins
turn on or off other genes at the messenger RNA level.
28
And basically, that's done by just assuming that by measuring their messenger
RNA levels, you're measuring their protein levels. That's a whole other issue,
because there's a whole lot of processing that goes on between the levels of
messenger RNA and protein and these are all things that are basically ignored.
But surprisingly, and this is a recent publication that just came out in Nature
Methods. There have been hundreds of publications from people just taking this
type of data and trying to infer the networks of interactions that are going on
in the cell. And what's been recognized now is that different types of
analysis methods do a better or worse job at different aspects of this problem.
And so ensemble approaches are really starting to gain traction in biology.
And so what I show here is an example network that was inferred by an ensemble
of methods that were developed by over 70 different teams of bioinformaticians
to create this network of associations in e. coli. This is the best network
that has been able to be inferred to date. And depending on your perspective,
it's either great or pretty sad because it only infers that the interactions
for about one third of the genes in this relatively simple microbe, and the
estimation was that the predictions are about 50 percent precision.
So here's some examples of how some of the different algorithms did it at
making these predictions in e. coli. And you can see that by integrating them,
which is the black bar on the right, the groups taking all of the different
methods and integrating them do basically a better job than each of the
individual groups were able to do on their own.
Now, one of the things that this paper showed was that if they gave these teams
some synthetic data, where they basically simulated reactions and created data
for these people to -- for these teams to infer these networks, they did about
three times better on the synthetic data than they did on the real data.
And that basically shows how many things are going on in the cell that we
basically don't have any idea about, and we're missing. And then it gets even
worse when you get it to even slightly more complex organisms, like Baker's
yeast, which is about two times more complex. Most of the teams do no better
than random at inferring these networks.
So in terms of these computational methods, there's still a long way to go.
And as I said before, there are -- the reason for this is that there's a lot
29
going on in the cell that we cannot -- that we do not yet have the technologies
to measure at the same rate that we can measure the mRNAs. So we would like to
be able to measure translation and protein levels and small molecules and get
quantitative measurements of what the cell is doing at different times and
those shapes there are supposed to represent rates so we'd like to have rate
equations for all of these interactions.
And basically, the types of data and the types of analysis that we can do are
endless, even for these simple organisms.
Fortunately, we have lots of new technologies that are still being ramped up
that, thanks again to having genome sequence, we can measure now protein levels
using mass spec. It's not as well developed a technology as micro arrays, but
we can use mass spec, very expensive mass spec machines to measure the spectra
and identify what proteins there are in the cell and what small molecules there
are.
We can actually measure physically what proteins are binding to the genome and
this is very, so important in terms of the regulation. We can measure -- we
can observe proteins interacting with each other, binding to each other in the
cell and create these networks.
And one thing I want to point out with this slide is each of these data types
has their own dedicated journals, basically, with teams that develop methods
simply for analyzing the data, processing it, converting it into information
that modelers can then use. And visualization as well, network analysis and
things like that.
The possibilities are endless. So this is only going to get worse. So genomic
data is getting cheaper and cheaper. I think the rate is probably surpassing
what is going on in astrophysics. When the first human genome was sequenced,
it took about ten years and three billion dollars. And now, we can -- we'll
soon be having desktop machines that for a thousand dollars can sequence the
human genome in a matter of a day.
And each of these sequencing runs result in a couple of -- probably 50 -probably 50 gigabytes of data, if you include all the images and things that
are generated. And so if you imagine, this is actually kind of a different
paradigm from what happens in astronomy, where it's very Democratized, very
decentralized. Every lab or researcher is going to be generating massive
amounts of data and they're going to need to know how to handle it and how to
30
store and it how to process it.
>>:
What happened in 2007?
>> David Reiss: In 2007, that was essentially when commercialization took off.
So there's a number of different companies that have developed their own
methods for sequencing that are where the technologies are largely -- are
essentially a different paradigm from the way it was done using the -- from the
original sequencing.
>>:
[inaudible].
>> David Reiss:
>>:
Pardon?
Capitalism, learning how to make money from this.
>> David Reiss: Exactly. Of course, there's huge biomedical implications, you
know, going to your doctor and being able to submit a blood sample and get your
genome back. And what they do with that is a different issue altogether. But
that's the goal. So the thousand dollar genome has kind of been the goal for a
long time. And we're almost there.
So these new high through-put sequencing technologies are enabling a whole wide
range of additional technologies. I showed this. This is a plot of a type of
data that I actually developed a method for analyzing that was kind of inspired
by my astrophysics knowledge. It involved deconvolving measurements that were
made at high resolution across the genome to identify, at high precision, where
proteins are binding across the genome so the X coordinate there is a genomic
coordinate.
There's additional technologies for measuring transcript levels at high
resolution across the genome. Now, there's these genome-wide association
studies, which will be using these very cheap, very inexpensive human genomes
to try to identify mutations in the genome that might be associated with
disease.
One of the main issues with this is that there's a huge number of or high rate
of false positives. And so every day, you probably see a newspaper article
that talks about a new discovery of a gene that's associated -- of a gene
mutation that's associated with Alzheimer's or something. And then a couple
31
months later, you don't see the retraction because it's actually not been shown
to be statistically significant.
So there's still a lot of work going on there. And, of course, now there's new
high-throughput imaging that's going to -- where there's a lot of potential
cross-talk to be had between you guys and the biology world where imaging cell
cultures at high rate in three dimensions.
And there's, I think it's a far more complex task than classifying galaxies or
something like that in astronomical data. So there's a lot of work, a lot of
things that we can learn from you guys on this front.
So that being said, one of the tasks that I see of computational biology or
bioinformatics is taking all of these different types of data, understanding
that there's going to be false positives and false negatives and different type
of systematics in each of them.
But by integrating them, we can hopefully try to get rid of or try to identify
where the false positives are in some of the data by kind of cross-referencing
them with what we call orthogonal types of data. And by integrating it into
kind of -- into complex computational model, we hope to generate a -- I guess
here, this is an old slide, so a circuit or a picture -- an idea of what's
going on in the cell so that we can make predictions.
Now, of course, biological systems are not electronic circuits and this is a
recent paper published, a recent opinion paper published by a neuro scientist,
but it's equally appropo to biology, showing that for even simple cells like e.
coli, if we wanted to go out and measure all of the potential interactions that
are going on in the cell, given reasonable expectations for the rate of
increase of measurement technologies and computational technologies, it would
still take more than a million years.
And for the human genome, the case is even more daunting. So obviously, we're
not going to be able to do it this way, and this is where systems biology comes
in, as I see it.
So systems biology has been around for about ten years and the mantra is that
basically it's a multidisciplinary science in which we have biologists,
classically trained wet lab, wet, you know, wet bench biologists, technologists
who are typically engineers or bioengineers who can develop new technologies
32
for measuring these molecules, and then computationalists or informaticists who
can take all of the data and try to make sense of it.
And it fills this, what we call this virtuous cycle where one -- where each of
them feedback on to the other. But one of the issues that I think we're still
dealing with in terms of integration of all of these multiple disciplines into
systems biologies the completion of this loop.
Basically, the problem is that as computation -- as computational modelers or
informaticists, we can make as many predictions as we want, given the data.
And typically, we don't have sufficient biological knowledge or sufficient kind
of -- sufficient -- I guess we're not part of the biological question enough to
be able to prioritize these predictions in a biologically meaningful way.
And the biologists, we can present these predictions to the biologists, and the
biologists say, well, these are too many predictions. We can't test them all.
How would we rank them. How would we prioritize them. And additionally, we
don't understand the way these predictions were made. And therefore don't
necessarily believe them.
So what typically happens, this is kind of a common story, is that we make
predictions, some of which are novel and they're either biologically novel
predictions or they're just wrong. And in many cases, we do make predictions
that are right and have -- and correlate with what the biologist already knows.
So then the biologist looks at the predictions and says okay, these make sense.
And they -- so we can write a paper saying that we've made one round of this
cycle. We've made predictions based on your data. The predictions make sense.
But what we'd really like to is to be able to use information that's gathered
from these data to prioritize new experiments and really fill out our knowledge
about what's going on in the cell.
And so that being said, I think I'm running low on time. So here's just some
kind of final thoughts. So essentially now, over the past ten years,
computation has really become a central part of biology. It's not just
seeing -- many aspects of bioinformatics is seen as a service in terms of
storage of the data that comes directly off of the instruments and so on. But
now it's an integral part of the research, and biologists are becoming more
comfortable with interacting with computational biologists and vice versa. And
at least in multidisciplinary institutes like the ISP where I work and other
33
places as well, this is -- it's a slow process, but it's really starting to
happen.
You even see this happening at academic institutions or universities where
people -- where there are these new multidisciplinary systems, biology programs
where training is happening both in conducting and performing biological
experiments and doing some programming and some analysis so that the biologist
can actually talk, speak all of the languages that are necessary.
So that being said, I wanted to just throw this in. Many of you may have seen
this paper that came out or multiple papers that came out in Science and Nature
last week based upon a consortium of over 450 researchers that essentially
mapped at high resolution, using a lot of the technologies that I described
earlier in the talk, the functional elements across the human genome.
This was a massive undertaking, almost as massive as the original sequencing of
the human genome. As you can see, this author list is kind of reminiscent of
what we've seen with Sloan Digital Sky Survey and other things, and the Higgs
discovery and so on. So biology is really kind of starting to get to the point
where we're going to need to learn from you how to deal with publication issues
and authorship and all those aspects as well.
But I would say probably one third of these authors were computational
oriented.
I think I'll end with that.
>>:
Lots of questions.
[indiscernible] 450 authors.
>> David Reiss: I don't actually know the logic that went into that. It's
interesting, just the culture of the authorship lists are ordered in a
different way generally in biology than they are in astronomy or physics.
Typically, you have -- now, these days, you have multiple first authors and
then the senior authors or corresponding authors are at the end of the author
list. And I think the way that this collaboration worked was that this was the
main publication, but that there's 160 more detailed publications describing
all of the results and each of those publications have their own sub-teams, I
guess.
34
But this is kind of, I think, really the first foray into kind of the survey
world and kind of the making this huge amount of data available to the rest of
the world. They have a wide range -- a wide number of databases and there's
even an iPad app that I downloaded the other day for exploring all of this
data. And I think those are also things that we can learn, we can learn from,
from you guys.
>>: A comment and two questions. Comment is that it scares me when I hear
from a field that's funded at several orders of magnitude better than astronomy
that we have much to learn from you.
Two questions about the data mining. Clearly, biology data are vastly more
complex and heterogenous than astronomical data, let alone something so trivial
as [indiscernible] physics. There's a high dimensionality problem. Is there a
special effort in developing better algorithms that scale well with high
dimensionality? We worry about tens of dimensions. You probably have tens of
thousands of dimensions. And also the visualization in highly dimensional
spaces.
And my second question is it seems to me that this is more like a text mining
problem than numbers mining problem, because AGTC are letters and genes are
more like words. And so is that fundamentally different kind of data mining
than what we do, say, with large tables of numbers?
>> David Reiss: Yeah, those are all good questions. And I think they're all
very valid points to take on -- to take on your last question. That was, that
issue was one of the things that I really struggled with when I tried to
make -- when I made the transition from astronomy to biology was understanding
the -- trying to understand there's a completely different sets of statistical
models that go into modeling genomic sequences, sequences of letters with fixed
alphabets and aligning genomes and so on and so forth.
And there is definitely -- that aspect is different, but I think there are
other aspects, and I hinted at those in one of my slides, there are other types
of informatics that are going to be just as important, if not more so, that are
really essentially measurement, you know, measurements. They're going to have,
essentially, the dimension -- the domain dimensionality of the measurement is
going to be along the genome and trying to correlate the measurements with
signatures that are in the genome is one of the things that we're struggling
with right now.
35
>> Yan Xu: Three more questions.
still have a question?
>>:
You go first and then [indiscernible] you
Identical question.
>> Yan Xu:
Oh, okay.
Pepe go, then.
>>: [indiscernible] real complexities, not so much [indiscernible] but in a
community of thousands of people, thousands of different [indiscernible]
completely different problems [indiscernible]. My idea is that in
bioinformatics, the solution is [indiscernible] finding missing parts and so
on. What is the level of complexity of the problems which you encountered in
informatics? In other words, [indiscernible] there's a very large variety of
tools and [indiscernible] which are required.
>> David Reiss: Yeah, I try -- so I tried to give an idea of the different
types of data that are involved in bioinformatics. So, for example, there are
teams of researchers that are working solely on trying to segment and cluster
images of cells and culture.
And so obviously, you can imagine that being a very complex domain of research
which is significantly different from trying to match genome sequence and
identify the evolutionary tree, for example, from genomic sequences of
bacteria.
And so there is a huge range of different -- of fundamentally, as I see it,
different types of data that require different types of expertise and, in many
cases, different types of background than -- and, you know, correct me if I'm
wrong, than the way I see astronomy data being.
So I think there does need to be, you know -- one of the issues with data is
that sometimes you deal with the issues of communication between just the
different methods of -- the different computational methods. You know, you
have people who publish in journal of proteomics because they analyze mass spec
Tra of protein data and people who publish in IEEE symposia journals who are
more associated with, for example, the imaging question.
And then there are people who publish in bioinformatics solely journals who are
more interested in the analysis of genomic sequencing data.
36
So oftentimes, just the publications are different enough that you don't get as
much cross-talk between those.
>>: Just this morning, someone forwarded me a message arguing from a
biologist, arguing that end code was a waste of money because -- the statement
as far as I could tell is that biology is so diverse that you contrast
astronomy, you can't -- the dataset is collected using a particular method is
unlikely to meet the needs of many biologists. Do you have a comment on that?
Do you think that's a valid statement?
>> David Reiss: I think that perspective comes from what your approach is. I
do see some value in the end code mapping, and essentially the way that they
sell it is that they're trying to identify potential regions of interest across
the genome. Up 'til now, only 98 percent of the genome, we had no idea what it
was doing. There are only about two to three percent of the genome is pard of
these coding sequences that make up genes.
So what they were trying to do was try to identify the functional regions of
the genome that are outside of these coding regions. And I think it goes a
long way. It's not quite as helpful as just getting the human genome sequence
was, but I think it does take us a long way, and people like myself will be
using that data to help constrain our models in significant ways.
And I think where it really comes in helpful is potentially by integrating it
with the data that's coming out from a particular lab for a particular domain,
especially if it's associated with human disease research. It won't be helpful
to a large number of biologists who work on bacteria or things like that,
but ->> Yan Xu:
So thank you again, David.
Download