>> Dennis Gannon: My name is Dennis Gannon, and... committee for this thing. There's a number of other...

advertisement
>> Dennis Gannon: My name is Dennis Gannon, and I'm chair of the program
committee for this thing. There's a number of other people that have been
involved in this. There's Kent Foster over there, who has really done a huge
amount of working getting this thing together. And I'm pointing him out in case
something calamitous happens, you know, talk to Kent.
Juan Vargas is back there. He's been involved. Arkady is probably around here
someplace. There he is. He's also been active in this. Plus we have the staff in
the back. And so we've got a good crew here, and we're all available to help you
in anything you need to do.
Now, I need to give you some basic instructions. First of all, if any of these
plenary sessions are just full, we have as you can see quite a good size crowd
here; in particular for the Microsoft employees during the plenary sessions,
there's an overflow room that you can go to and our guests can have the space
in the live room.
We also have our sessions set up so that this is the room where we'll have most
of the tutorials and invited talks will take place in this room. And the sessions for
the technical session talks are in neighboring rooms. Also, there will be, let's
see, what was I -- well, we have a very full schedule.
Now, one of the things that we're going to do here is that it's very important,
because of the broadcast of the event, is that all speakers have to sign a special
form, which I would be holding up if I had a copy of it, but I don't. So it's a special
form you have to sign and check two boxes.
Now, in addition, we need copies of your presentation so that they can go on to
the website so your slide decks -- also, we need copies of your presentations so
that when you are actually giving your presentation, our amazing audio visual
people can have you live but as well as turn to your slides as needed during the
recording. So they will need your slide deck on their system prior to your
presentation. So make sure that you get in each of the rooms there will be a -- I
think a -- one of these little sticks the, memory stick and you can use that to put
your slides on them if you don't have one, and get it installed into the system
before you talk.
Now, we're going to be ruthless in the way we limit the time spent in the -- on
these talks because we have such a full agenda. We were terribly excited by the
quality of the abstracts that were submitted. We had a tough time picking ones,
and we had an even tougher time throwing out things that we didn't want -- or
that we -- we wanted a lot of them. So we decided to pack the schedule full
because there was so many good things. And unfortunately that means the
schedule is very full. Which means that we don't have a lot of break time and in
particular the time between a couple of the afternoon sessions there is
essentially no break between them, and so if you're going from one to the other,
you'll have to move quickly. So I will ask the session chairs to make sure that
you end your session at least five minutes before the next one starts.
So that means we're going to keep track of time. We have these extremely
subtle signs that we are using to let speakers know when they're supposed to
terminate their discussion. So -- and that's on this. What else am I -- let's see.
Come to the podium with your deck in a thumb. Talk limited to 20 minutes. I
said that. Oh, yeah. On the floor, that's right, Kent is now demonstrating. In
case you haven't found it, there's a little -- and you need power for your laptop,
there's a little -- there's a thing on the floor and it's scattered around the place.
So don't be shy about using Microsoft's electricity.
In addition also over here, if you haven't noticed this already, the way you get into
the wireless is the password and information there for the visitor wireless there.
Let's see. Oh, yeah. Last item, you know, make sure that you -- and I see
there's no problem already. It says move to the center of the aisle so there's
plenty of room. Well, the center's full. So that's great. You're already doing that.
Anything else? What of I forgotten? Nothing? We got it all? Good. Okay. So
I'm going to actually -- I'm ahead of schedule. I don't have any serious prepared
remarks. But what I want to do now is to next to introduce our keynote. We have
two key notes today. We have one right away and another one right after lunch.
And our three key notes are really I consider to be extraordinary people.
Until recently I was an academic computer scientist. I've been here for about two
years. And there are certain people among the US academic computer science
community that truly stand out as leaders, and we've got all three -- three of I
would say the top five going to be here for this meeting. And this morning we
have one in particular who I consider to be one of the real leaders of our
discipline from the academic side as well as his -- the engagement of computer
science community with the federal government as well as the rest of the
sciences. And that is Ed Lazowska.
He is the Bill & Melinda Gates Chair of Computer Science and Engineering at the
University of Washington. He is -- I've got like three pages of stuff about him,
and I'm not going to go through them. Although I almost have enough -- no, I
don't have enough time. But just a few notes that Ed is, he's a member of the
International Academy of Engineering. He's a fellow of the American Academy of
Arts and Sciences, a member of the Washington State Academy of Science. A
fellow of the ACM, a fellow of the IEEE, a fellow of the Association of American -the American Association for the Advancement of Science.
He is one of the founding members of the Board of the Computing Research
Association. At least I believe -- Ed, I think you've been there from the very
beginning. He's also been a chair of the Computing Research Association.
He has also been currently the chair of the Computing Community Consortium
whose objective is to expand the engagement of the computing research
community in articulating, addressing the societal challenges of the 21st Century.
It's really an important job that he's doing there.
And I, as I said before, I consider him, and I always have, as to be one of the
outstanding leaders of our research community. And Ed, why don't you go
ahead and let you get started five minutes early.
[applause].
>> Ed Lazowska: It's great to be here. And this is not a used talk, it's a new talk,
so you're the dry run for this, and I hope you find it interesting.
For the past couple of years, I've been directing something called the University
of Washington eScience Institute. And eScience is not exactly equal to the
cloud, but they're sort of -- they're married, joined at the hip. So I'm going to give
you an overview of what we've been up to in a set of steps.
I'm going to talk a little bit about eScience and of course, you folks are already
well aware of most of that. I'm going to talk about what we're trying to do at the
UW eScience Institute some examples of scientists who we have moved in the
direction of managing their data more intelligently and utilizing the cloud. Make a
few observations, lessons we've learned. And then in my role running the
Computing Community Consortium, I'm obligated to include in every talk a plug
for computer science. And of course, you don't need that either, but I hope this is
something you'll bring back to your home institutions and try and talk to people
about.
So I have to say all the work we've done has been not just inspired by personally
led by Microsoft, Jim Gray, Roger Barga more recently, and a set of others. And,
you know, I consider eScience just yet another transformation of science by
computer science.
I also want to give a nod to Dan Reed. Dan and I and also Dave Patterson who's
here or will be here, served on the President's Information Technology Advisory
Committee for a couple of years, and Dan chaired the subcommittee of PITAC in
2005 that wrote yet another report on computation enabled science. And there
were a set of important things that Dan said. First of all, this was in 2005, now
five plus years ago. There was a clear call out of the need to focus on data
driven science because of the advent of sensors and data there's more complex
in various ways. Dan also has a very nice turn of phrase. For those of you who
read his blogs, you see this all the time. But he had this report card on
computational science.
He said a report card of national performance might record a great of C minus,
with an accompanying teacher's note that says this student has great potential
but struggles to maintain focus and complete work on time. So I think that's our
story in pushing this field forward.
So the way I think about this, Jim and Microsoft has talked about four paradigms.
I think of there being a fifth actually. You know, in the beginning there was theory
and there was experiment and there was observation. And they're of course
closely linked with one another. And I'll come back to this slide in a minute. This
is how oceanography is done even today. We're trying to change that.
That was complemented 30 plus years ago by simulation rated computational
science. And what's happening more recently is we're paying a transition to what
I call eScience, which does not obviate simulation oriented computational
science, it simply shifts the focus from the cycles to the data, and the
management and analysis and exploration of that data. And the slow and digital
sky survey in which Jim played a crucial role was really the prototypical modern
eScience project that released turned the tide. Because Jim was the first person
to show that you could put large scale science data in commercial database
systems.
Now, of course that was at a time when the volume of that data was exploding.
And as you'll hear from other speakers later today, it's not obvious that scaled up
RDBs are the right solution for the long-term future, but nonetheless this was
important because typically science data has been stored in flat filed with no
metadata. And I'll come back to this in a minute. And sticking this in with
metadata in a commercial relational database system is transforming.
I want to mention that with the eScience Institute at UW, I think of science as
being a pyramid and to be honest, our view is that the folks at any institution at
the very, very top of the pyramid are doing okay but there are a huge number of
phenomenal scientists one level down the pyramid who need guidance if they're
going to continue to be successful. And so I'll sometimes say things that you
know don't apply to the very preeminent large scale science projects but they
apply much more broadly below that.
So, you know, eScience is driven by data, huge volumes of datas from the
advent of modern sensor networks, and this is the Apache Point telescope that
was used in the Sloan Digital Sky Survey. And remember that this survey took
place over a period of seven years, transformed astronomy from sort of simple
observations to this sort of survey astronomy and really changed the field.
And what it produced was about 80 terabytes of raw image data over the space
of seven years. And that was enormous in its day, really truly enormous. But the
new generation of science tools make this look like nothing. For example, and I'll
come back to this in a minute, the Large Synoptic Survey Telescope, LSST,
which is a project that's in motion now through the International Science
Foundation, will generate 40 terabytes a day. So every two days it generates
seven years worth of Sloan Digital Sky Survey data. All right?
The telescope itself is located in Chile. There's 400 megabits per second of
sustained bandwidth required between Chile and NCSA just to move the data,
okay, and much higher peak rates. So it's just a completely different, literally,
order of magnitude of data volumes required. But every other large scale
science project is like this. The Large Hadron Collider generates 700 megabytes
of data per second, okay, so 60 terabytes a day. These little Illumina gene
sequencing machines that sit on desktops produce about a terabyte a day per
machine, and big labs have between 25 and 100 of them. So there's a lab at the
University of Washington that soon will have 25. I have friends working at a
project on the East Coast that will have a hundred of these machines. All right.
So in a single lab these are desktop machines that produce enormous amounts
of image data. And it's necessary to keep that image data and analyze that
image data. You can't discard it as has been the pattern up until now, because
you're discarding essentially the raw data from your experiments that you need to
go back to.
That's part of the lesson of the Sloan Digital Sky Survey which is that you're
going to want to ask questions that you didn't think of at the time you wrote the
proposal and wrote the Fortran program and defined the flat file format, right?
And to do that, you need to keep the original data around.
I've been working at the University of Washington. And Jim helped me for a
number of years. And Roger Barga, more recently, on what's called the regional
scale nodes of the NSF Ocean Observatories Initiative. This is under
construction now. I'll say more about this project later. But there are three
elements to the Ocean Observatories Initiative. But the one that we're most
interested in here involves deploying about 1,000 kilometers of fiberoptic cable
on the sea floor of the Juan de Fuca Plate off Oregon and Washington and
British Columbia and stringing it with thousands and thousands of chemical and
physical and biological sensors that bring data back in realtime, all right? And
the realtime analysis of that data and the democratization that makes that data
available to scientists everywhere and to school kids and teachers along with
those scientists is really transforming and part of what we have to achieve.
Finally of course, The Web is nothing but an enormous data source. And at the
University of Washington we've talked a lot about how the fact that this new form
of data oriented science affects the sociologists as much as it affects the
physicists and chemists and engineers and oceanographers. All right. If you're a
sociologist and you're interested in studying the creation and evolution and
disillusion morphing of social cliche, you used to do that by getting together 30
psychology freshmen and paying them six bucks an hour to sit in focus groups at
lunch time. And now you have four hundred million Facebook users whose data
you can in principle analyze, right? So just a total transformation. Point of sale
terminals, on and on.
EScience is obviously about the analysis of this data, the automated or
semi-automated analysis of the data. And it's not just a matter of data volume.
All right? So scientists are confronted with a number of changes, including the
volume of data, the rate of data, and the complexity or dimensionality of data.
And at all stages of that scientific pyramid you have scientists facing data
analysis challenges that they didn't face before. All right. So that's really the
challenge that we have to face as computer scientists and computational
scientists. It's helping the broad array of scientists manage data that they haven't
been confronted before.
Obviously there are spectrum of technologies, our technologies that are utilized.
Sensors and sensor networks, large scale backbone networks, databases, data
mining, machine learning, visualization, cluster computing or cloud computing at
enormous scale.
And eScience is really married to the cloud in a number of ways that I'll discuss.
This frazzle-haired guy is an undergrad of ours who a few years ago now,
actually in, gosh, now, 2007, used his Google 20 percent time to come up to the
University of Washington and help me device a course on sort of Google scale
data intensive computing.
It's sort of a funny story. We needed a cluster to run on. There wasn't the
availability of a Google, IBM, or Microsoft or Yahoo! cluster at the time. And he
couldn't get anyone at Google to approve of course access to their systems or
purchasing a system for our students to use, so he simply found one on eBay
and put on it his master card and installed it and then sent the bill in and
blessedly somebody was willing to sign off on reimbursement for him. All right?
And this led to a really interesting curricula that's evolved over the years in
teaching undergrads on how to do big data style computing.
And today, you know, whether you're using Google or Azure services platform or
Amazon Web services, this is just fundamental to the scale up that we all have to
do. And I'll again come back to that in a sec.
The important thing from my point of view is that eScience is really going to be
pervasive. And in my view, again, there are a set of people who are offended by
this comment, but it's very different than traditional simulation-oriented
computational science which on most campuses was three physicists, two
astronomers, and a chemist, all right? That's of course a preposterous
exaggeration but it was a niche on the scale of science, all right? And it's not that
it wasn't more or it wasn't transformational, it was important, it will continue to be
important. It was transformational. It will continue to be transformational. But
you didn't as a university or most companies or research labs, you didn't have to
excel at it in order to be competitive. And in my view unless you excel at the
tools of data intensive science and data exploration, you're going to be out of
business over the next decade.
So these capabilities have got to be broadly available across any institution,
whether it's a corporation or a research lab or a university. And that's really the
goal of this University of Washington eScience Institute.
So let me give a little history this from the astronomy point of view. Again, we
talked about the Sloan Digital Sky Survey and its data rates and data volumes
over a period of seven years and the incredible role that Jim Gray played in really
setting an example and putting astronomy ahead of all other fields in terms of
how it handles its data.
The project plan for the Sloan Digital Sky Survey had it budgeted at a 16 million
dollar project. It is literally the case that the software was to be written by
astronomy faculty over their summers when they weren't teaching. Okay? So
this was to provide summer support for astronomy faculty by writing the code.
We had a discussion just last week with Andy Connolly who is a superb
astronomer at the University of Washington who worked on this software as a
graduate student and was sort of giving us the background story. And that was
the plan.
The plan was to use objectivity as the data store. And objectivity was going to be
great because a big company, Motorola had build it to use with the Iridium
satellite project, which of course was going to be a great success. So what could
possibly go wrong, right? So the project reality was first of all, it grew from 16
million to 80 million dollars. 30 percent of those funds were spent on software.
All right in 30 percent of the funds, which still is out of proportion to the way
current science projects are budgeted. The OOI was originally planned
essentially as SDSS had been planned 10 years before, that is with no data
management component to it.
And that 30 percent does not count the monumental contribution by Jim gray and
his colleagues at Microsoft, which transformed the project. Again, a quote from
Andy, who was in this from the beginning, is if it weren't for Jim Gray's
contributions, SDSS would have been more likely to yield 100 research papers
than 5,000. Okay?
And there's an interesting aspect of sociology here. What caused the
astronomers to put their data in a repository so it's accessible to everyone? And
what Andy said -- sorry, what Jim said was the great thing about working with
astronomical data is everyone realizes it has absolutely no value, so you don't
have any intellectual property problems. All right? Andy gets a little upset when
you say that, but what he said was that the people involved in this project
realized pretty early on that there were more papers to be written, more research
to be done than they could possibly do, all right. And therefore, they weren't
relinquishing any competitive advantage by publishing the data for everyone to
use. All right?
So in past versions of projects like this, a small cohort of people who had built the
instruments and had proprietary access to the data would have done the
research and it would have resulted in a tiny fraction of the research being
accomplished that actually was accomplished by the project. And really by the
project -- by the data from the project being put in a Microsoft commercial
database system with metadata so that a whole bunch of people could ask
questions that hadn't necessarily been anticipated by the folks who wrote the
proposal and carried out the project.
All right. So how did this come to be? This is an interesting social story as well.
When Jim arrived at Microsoft, he was immediately set to two projects. I
remember he came over to see me in his first couple months with the company
and he was sort of chuckling over the fact that even at that time he was a Turing
award winner and sort of the king of database management. And he had been
asked to do two things. One was to win the TPC benchmark using Microsoft
database technology. So I was one of the people who worked with Jim on a
paper in the 1980s called an Analysis of Transaction Processing Power that
yielded the TPC benchmark for comparing databases.
And Microsoft wanted to win that with their database technology to prove that it
was real, that it wasn't sort of a toy database.
And the second thing he was asked to do was to mount the world's largest Web
accessible database. And he was trying to think about what data people might
like to utilize. And he came upon the idea of satellite imagery and built
something called TerraServer which eventually fought him awards from the
federal government, sort of lifetime achievement awards from the Geological
Survey I believe it was, all right? So the idea was all of this satellite imagery, he
got some of the satellite imagery from the US Government, he got a lot of it from
the Russians who hoped they could develop a market for their satellite imagery,
so there was a little thing on the TerraServer Web page where you could click to
buy images from it.
Anyway, that was sort of Jim's introduction to geodata and geobrowser and
streaming and again led to a huge number of innovations. This is again another
contribution Jim made was to realize that in computer science I don't know
whether this is good or bad, but whenever you add another order of magnitude to
some dimension of a problem, you typically stumbled into new research, all right?
We simply don't build our systems so that they scale beyond one or two orders of
magnitude in one dimension, certainly one order of magnitude in two different
dimensions and you sort of have to rethink things. So a huge amount of terrific
work happened.
All right. Meanwhile, Alex Szalay, a phenomenal astronomer at Johns Hopkins
was in charge of building the data systems for the Sloan Digital Sky Survey. And
we've already talked about how they started with objectivity. The original plan
was the astronomers were going to do this in their summertime. Of course like
all faculty graduate students were doing it. And Connolly was saying the other
day that it was very funny the graduate students would meet at a conference and
they would say, well, you know, who are you masquerading as and who are you
masquerading as? That is, which faculty were they actually writing the code for
while the faculty member was doing something else in the summer. So that was
Alex's story.
Now it gets even more bizarre. Alex and Charles Simonyi, long time Microsofty
had both grown up in Hungary. And their father had run the two preeminent
Hungarian physics research institutions. And Alex and Charles had actually
never met. But their mothers had been trying to get them together forever. I'm
not making this up, okay. Maybe Jim was but he told me this story. And it's just
so phenomenal. Okay.
So Alex and Charles' mom arranged that when Alex was coming out to Seattle
for a conference, he would get together with Charles Simonyi, and these two
sons of the preeminent Hungarian physicists would get together for lunch. All
right?
So at the lunch, you know, Alex says to Charles, what are you working on?
Charles says I guess Office or something like that. Charles says to Alex what
are you working on? Alex says, oh, man, I got me a problem, okay, and
describes the fact that the data is about to start arriving and it's going to fall on
the floor. Charles says, you need to meet my friend Jim Gray. He understands
data. So the next day they flew down to the Bay area, and Jim and Alex hit it off.
And that's how Jim got changed in the Sloan Digital Sky Survey. It just shows
how through these sort of coincidental happenstances absolutely great things
can happen.
Okay. So now we've talked about LSST as well, all right, the sort of successor
astronomical survey project. Why? Well, the image on the left -- and again this
is from Andy Connolly, is a patch of the sky and essentially what you can resolve
using the Sloan Digital Sky Survey. And the image on the right you probably
can't see a huge difference here, but on the Web you'll be able to see a great
difference between these photographs is what you'll be able to revolve using
LSST. So a phenomenal difference in sort of resolutionability from these two
surveys. And again, survey astronomy in general is totally transforming.
The data management system is widely distributed, all right? So master control
is in the western United States. The archive site is at NCSA. The telescope is in
Chile, all right? As I mentioned, 400 megabits per second sustained continuous
average bandwidth between Chile and NCSA just to get the data into the
repository. So extremely distributed system.
Most of the computation that they do as part of the data pipeline is
embarrassingly parallel. And I'll come back to that theme again and again and
again. A lot of science is embarrassingly parallel which length itself to today's
cloud, and a lot of scientists don't realize this. So again, something I'll refer to in
a minute is at the University of Washington we found enormous resistance to
moving to the cloud by people saying well, my application can't work on the
cloud. These are folks working on local racks with high bandwidth interconnects.
And I view this as analogous to the resistance I don't know, 15, 20 years ago
moving from craze vector machines, okay, to racks with closely coupled
interconnects. For example, when Larry Smarr started getting SGI origins at
NCSA, there was a lot of resistance. And some proportion of the scientists who
resisted that transition were right. Their algorithms, their problems could only be
solved on vector machines.
But for the vast majority of them, either directly or by change in algorithm, they
were able to get on this far more cost effective, far more scalable architecture.
And my just intuition, and I imagine you all share it, is that the same is true today,
that there are a large number of people who say nah, can't do it, right? And
some of them are going to be right. But I'm confident that the majority of them
either directly or with algorithm changes, which are admittedly painful, are going
to be able to get on this next generation of far more scalable, far more cost
effective computation. All right? And LSST lends itself wonderfully to that.
Okay. So the project plan for LSST from the beginning was that 30 percent of
the project budget and for this project including preliminary grants it's close to
400 million dollars. So these things have grown and the dollar has shrunk is
budgeted allocated to software. So they're taking this very, very seriously from
the get-go on that project.
Astronomy is way ahead of other fields. So at a talk at the University of
Washington last week, a computational astrophysicist who is a research scientist
with the eScience Institute began a talk by describing the data management
tools, data management API essentially for computational astrophysics. And
here's what he said. Right. Fopen, fread, fwrite, fclose, and secure copy. Right?
Now, you know, again not everybody -- is this too far off? This is about it. All
right? And you might think that the people doing this are yokels, but you'd be
wrong. The people doing this are top scientists across the country and around
the world. This is how data is managed, okay? So in the computational
astrophysics work that Jeff and his faculty colleagues do, every simulation
generates a sequence of snapshots. Every snapshot is a single flat file. And
analysis is by C and Fortran programs. And that's how physics marches forward,
okay?
So here's data management in biology. And this is an example that I'll come
back to again from an absolutely top tier environmental biologist at the University
of Washington. And this is how she manages the data from her environmental
sequencing, okay? It is -- I'm not making this up, by doing manual joins on
spreadsheets, okay? Bill Howe, who is in the audience here has built a tool I'll
describe later that helps them deal with this, okay?
So these biologists get multiple spreadsheets, right, out of their sequencing and
they do joins, you know, either on the screen or by printing listings, right? Mind
boggling. These are absolutely top tier scientists. And this is how they do their
work. A colleague of mine in the venture capital community here cited a study
that claims that 90 percent of all business data is maintained in spreadsheets,
okay? Despite the enormous proliferation of, you know, PeopleSoft and SAP and
SQL Server, right? And I'm confident that 90 percent of all science data are far
more than that is in spreadsheets or flat files. All right? So that's a problem.
Now, scientists understand that this is a problem. One thing we did at the
University of Washington a couple of years ago was a survey of 125 top
investors across all fields. And here's how we began. We identified about twice
that number, about 250 or 300 top scientists and engineers by looking in each
field for who's getting significant grants in that field, who's winning awards at the
younger level like Sloans and CAREER awards, who's winning awards at the
senior level like National Academy membership and things like that, HHMI, right?
Find out who the top people are.
We, the folks in the eScience Institute conducted interviews with 125 of these
folks, one-hour interviews, and then these were tabulated. And what these
people said was the problem they were facing in the future, the big problem, was
the management of their data. All right? Flat files and Excel are in fact the most
common data management tools which is great for Microsoft but not really so
good for science.
And a typical science workflow we find across UW is you have a situation where
the module intervention in the science data workflow two years ago was taking a
half a grad student day per week. And now it's taking one FTE, right? Because
the data complexity either volume or sophistication is up 10 X in the past two
years. All right? And, you know, extrapolating which is reasonable, in another
two years it's going to be taking 10 FTE of manual invention. So what you've got
to have is tools to help these scientists investigate the data because the data
pipeline, the data workflow is becoming a gating function to science. So that's
the goal of the eScience Institute. The motivating observations, and I've said all
these things before, that like simulation oriented computational science,
eScience, data rated science is going to be transformational, unlike simulation
oriented science it's going to be pervasive. Even more broadly than simulation
oriented science, it's going to use new techniques and new technologies from
computer science.
So there's a great opportunity for a marriage between us and them that goes way
beyond sort of compilers and operating systems, sort of simulation techniques,
visualization.
Cloud services are really essential. And from the point of view of a particular
institution, if you're not a leader in this, you're going to get left behind. So that's
the scare for us.
Let me back up, because there's one more thing I wanted to say about this study.
If you go at a university to a provost or a vice president for research, what you're
going to find is they don't believe this. All right? All right? And they don't believe
it -- see, a bunch of people nodding their heads, okay. Because when they want
to know what sort of computing is going to be required by scientists five years
down the road, the people they go to quite naturally are the people who self
identify themselves as computational scientists. All right? And those are the
very important, very successful scientists who are doing simulation oriented
computational science. And what they need is more machine rooms and more
racks and more power and more air conditioning and more networks to access
supercomputer centers and resources like that. All right?
And so in some sense there -- obviously simulations are and enormously
important source of data, and the data produced by simulations is growing and
needs to be analyzed. But fundamentally the focus of these folks is on the cycles
rather than the analysis of the data. Right?
So in my view, provost and vice provosts research tend to get the wrong answer
because they ask too narrow a swath of people about what the future's going to
be. And the scientists get it. If you ask a broad spectra of top scientists what
their challenges are, they'll talk to you about the data. And I think it's just really
important to understand that management may not quite get it yet.
Okay. So the goal here is to try to position the University of Washington in a
reasonable situation for this eScience. And the strategy is first of all, to bootstrap
a cadre of research scientists who can lead the way. And these are typically
people with doctorates in science fields or in computer science and a strong
orientation towards the cloud and the management of data.
Secondly, and this is really important and what I'll talk about next, help leading
faculty become exemplars and advocates. All right? So this is to try and
overcome the oh, no, this won't work for my stuff resistance, all right? If they see
great faculty who are renowned as leaders across the campus doing this, they'll
start doing it too.
Broaden the impact by then creating facilities to extend these capabilities across
the campus and adding faculty in key fields and this was launched as an initiative
by Washington state legislature with a lunch of subsequent grant funding, for
example from Microsoft and from the Moore Foundation. And again, Microsoft
has been absolutely instrumental in this. And the Moore Foundation, which looks
at sort of advancing science across the board is extremely interested in
approaches to spread data intensive science across the university community.
So we have a large grant with Carnegie Mellon University focused on this sort of
spreading the word.
Here's our technical staff. Dave Beck focus on biology. Jeff Gardner focuses on
physics, astronomy, the physical sciences. Bill Howe is our main data person.
Chance Reschke, I think of him as sort of the high performance computing and
high performance which now means also high performance cloud person. Erik
Lundberg is doing outreach for sort of the lower part of the pyramid on data
management.
So now I'm going to give you a set of examples. And these are the top scientists
at UW who we're using as exemplars for what they've done. Ginger Armbrust is
an absolutely top tier researcher in environmental metagenomics. Okay? So
environmental metagenomics means that you're sequencing the broad stew of
what's in a sample from the ocean as opposed to a single organism, okay. So
the data volumes are up enormously. And Ginger has built -- she's an
oceanographer, she's built a bunch of technologies for continuous analysis rather
than batch analysis. And so again, it's studying these microbial populations. And
the sorts of things you try to answer is, you know, who is there, what are they up
to, and comparisons of data sets across, for example, a near shore and deep
ocean or before and after the spring flood or across salinity or temperature
boundaries, right, day and night. So you're just trying to under what causes
these changes in the ocean biological system.
I'll spend more time on oceanography later, but the fact is that the oceans, since
they cover 70 percent of the Earth's surface, are responsible for an enormous
amount of the environment that we experience. And the future of that
environment. And we know next to nothing about them. It's really shocking how
little we know.
So this is Ginger's business. And here's how she does it. The sorts of questions
she wants to answer. And again, I have to thank Bill Howe for these slides. The
question she wants to answer are in the lower left, okay?
And the way it happens is you do environmental sampling from ships, or
something like that. You pass those through a sequencer. As part of the
sequencing process, you look those sequences up in a set of public databases,
all right? And what does is yield a phylogenetic analysis in which you find out
where these organisms fit in a phylogenetic tree, all right?
Now, every step of that process produces data, and this data -- I'm not making it
up, goes in a bunch of spreadsheets. All right? And then what happens is
manual analysis of those spreadsheets is done, and that manual analysis is like
database joins except on exponentially increasing amounts of data done by
hand. All right? And again, Ginger is not a yokel. She is an utterly top tier, well
funded biologist whose data volumes two years ago made this the most effective
way for her to do her work. All right.
So the change is this no longer cuts it. All right. So what Bill has done is to build
what is really some very simple tools that have been transforming for these folks.
First a tool that allows them to upload these databases into SQLShare. All right?
Without really any schema at all. All right? And then a Web tool that allows the
scientists to execute very, very simple SQL queries against the data that's been
uploaded to database. And I'll show that you in a sec, but the interface that Bill
and his colleagues have built has the ability for them to enter direct SQL
commands but also a bunch of their standard queries are available in sort of a
click and drop in sort of thing. And as with people who don't really know how to
program but for example in an introductory course are given the code that makes
a robot work, right, modifying the queries is something that's easily within the
reach of all of these scientists and even writing queries from scratch once they
see the set of examples, all right?
So here's what they say. That took me a week with Excel, all right, something
that takes me 15 minutes with this system. I can do science again. All right. So
again, there's no rocket science in here. But it's remarkably transforming for
these scientists. So think of this as a screen shot of the Web interface in which a
set of standards queries are on the left. These are fairly sophisticated queries.
They're not simple queries, okay? Against this database that's been created in a
schema free way by uploading the spreadsheets, all right? So now you could do
your joins and selects using SQL. That's the result.
Here's another one that's pretty interesting. David Baker is an absolutely world
class biochemist. David is a 45-year-old, 48-year-old now member of the
National Academy of Sciences. And David owns the code called Rosetta, which
is the international standard for protein structure calculation of protein analysis.
All right. So Rosetta is his code, and it's extremely broadly used. Now, this is a
really interesting story on where David began on big racks of machines. And he
still owns something like 20 percent of the University of Washington's machine
room space, all right.
And on these racks some of the computations are tightly coupled and legitimately
require this, but a large number of the computations are embarrassingly parallel.
Because what he's trying to do is to find global minima, minimum energy states in
these proteins. And like any minimum energy calculation, you have to avoid
getting caught in a local minimum. So what that means is you start at millions
and millions of places and do local optimization from those places using his code.
All right? And that's an embarrassingly parallel computation.
David realized just a few years ago and turned this into a screen saver using
[inaudible], all right. And hundreds and hundreds of thousands of people are
running David's screen saver. And then what happened was he built this little
animation on the screen saver that would show you what it was doing. It was a
very simple animation. And he started getting e-mail from people running his
screen saver saying your program is dumb as dirt, it's doing this when it should
be doing that, all right. And David said, well, you know, if these guys are so
smart, I'm going to turn this into a Web based video game so that people can
actually help with this.
So he worked with Zoran Popovic and David Salesin and a set of grad students
in our department, Adrien Treuille and several others, to build a Web based video
game called Fold It, okay. And Fold It now has 100,000 people playing it, all
right. It's a very cool Web based video game. If you go back to Luis von Ahn at
Carnegie Mellon who sort of launched this human computation craze, the
visibility of it a few years ago with his thesis, Luis observation was people are
willing to do almost anything in return for points, right. [laughter]. And Luis, you
know, he used to give his talks by beginning with the observation that he
somehow got a statistic for the number of person hours spent per day worldwide
playing solitaire. All right?
And it turns out -- and I forget the exact number, but roughly nine days of the
world playing solitaire is the number of man hours that were required to build the
Panama Canal, right? So Luis view is suppose some fraction of this could be
turn to useful work. So that's what David and Zoran and their students have
done for science.
Now, the people -- most interesting thing about this is the people who are great
at this game are not PhD level biochemists. In fact, I've heard David say that
when he starts playing the game, there's of course a chat room associated with
it, people will start saying stuff like who's this Baker guy, he's not very good at
this, right? He's competing against the people who are online.
A year ago, they flew out to Seattle for three days a 13 year old who was sort of
the idiot savant at this folding game, all right? And his parents came out with him
because they wouldn't let the kid travel alone. And they watched this guy play for
a couple of days. I should say that the game is instrumented. Their hope is to
subtract algorithmic principles from how great players play the game that they
can embody in the program. But they couldn't figure out how this 13 year old
was doing so well, so they flew him out here and they watched him for a couple
of days while his parents tourists around. He unfortunately is no longer playing
the game. As of last Christmas, I think he discovered girls. [laughter]. So he's
not playing.
The current guy who is doing extremely well is this 50 year old from Dallas,
Texas, whose name -- this is from the blog associated with the game, is Boots
McGraw. I'm not making this up, okay. Boots says here, I'm a redneck from
Texas, but I was in grad school at the state University of New York in Buffalo, so
and on. So Boots the doing unbelievably well at this game. And Boots just won
a prize for having the first protein that was actually synthesized, okay.
There are many goals of this project, but a goal is to produce novel enzyme
catalysts, for example, okay, that catalyze new reactions and that don't exist in
nature. So you design one and then the question is can it be built, can it be
synthesized, okay? And Boots did the first one that could be synthesized and the
company that did the synthesis for them prepared this trophy, which they
awarded to Boots. It's a sort of plastic model of the protein that he designed and
they synthesized. And there's an interview with Boots now on the website that
just showed up two days ago. I like this. He says, you know, I'm real happy to
get this. I'll put in it my office. Now when my co-workers once again ask why I
don't want to play farmville with them, I can show them the model. [laughter].
All right. So having people cure cancer by finding minimum energy states of
proteins is a whole lot sort of better from a societal good point of view then yet
another guy playing farmville, all right. So it's pretty remarkable.
Now, David has a spinoff company called Arzeda. And what they're doing is
specifically enzyme catalysts for energy applications. Right? And Arzeda has
moved all of their work into the Amazon and Microsoft cloud, right? And they've
done it using essentially a condor interface to the cloud, all right. So again, a
very embarrassingly parallel hugely scalable computation. And this from
somebody who began doing all of his computing on closely coupled racks taking
up 20 percent of our university's machine room space. Okay?
So Arzeda is in the Amazon and Microsoft clouds. That's how they do their
computing. Again, these slides are on the Web. There's a URL on the first page
of this. Okay.
So now I'm going to talk about the oceanography project. And this is a work
that's been really coined at the hip with Microsoft. And I can't thank Roger and
his colleagues enough for the help they've given us. This is a project we called
Azure Ocean. And it has three components to it. But the big issue here again is
there are all sorts of biological and chemical and physical processes in the ocean
that we have to under, and oceanography is traditionally done like this. You go
to sea in a ship, you stick your head in the water where you happen to be and
you measure temperature and salinity and pressure. And at the edge of the
cruise, the chief scientist who's written this stuff down, you know, originally in a
notebook, then in a tape, now in a disk takes that home with him or her and that's
the end of it. All right.
Now, you know, the ships are bigger. UW and Scripps and Woods Hole in
Hawaii now have 250 foot sort of diesel electric vessels. But the science is done
in much the same way. It's expeditionary. All right. And the goal of the ocean
observatories initiative is to transition oceanography from observational to -sorry, from expeditionary to observatory based science and dramatically deepen
our understanding of these ocean processes so we can figure out how to
manage the oceans and the impact on our environment.
So that's the pitch. This project was conceived by a wonderful guy, an
oceanographic geologist actually at the University of Washington named John
Delaney. It's great having oceanographers who actually look like the Old Man in
the Sea. You know, Delaney sort of -- he has a sort of Irish brogue. He really
plays the role. But he's a remarkable visionary. To see this project over 15
years to funding, and it's going to be completely transformational for
oceanography.
So Azure Ocean has three components. There is a program called COVE, which
is a data visualization tool built by a graduate student of mine, Keith Grochow,
originally again inspired by Jim Gray.
There is Trident, which is a workflow management system built by Roger and his
team at Microsoft, and it is a general science workflow management system that
took as its driving examples oceanography and astronomy and a set of other
fields that Microsoft was actively working in. And then Azure is a data repository.
Okay. And let me make a comment about Trident for a sec because I think it's
particularly important.
There are focused workflow management tools for this scientific subdiscipline
and this scientific subdiscipline and this scientific subdiscipline, and they're
typically lashed together, sort of built up from the dirt by the people working in
those particular disciplines. So they're not very robust by and large, and they're
not very portable across disciplines.
What Roger and his team realized is that Microsoft had a phenomenal workflow
engine that had many, many, many person years of expert development put into
it. And the only problem was the user interface was a business person's user
interface. I'm sort of trivializing the work they did here. But fundamentally it was
to put a scientist's graphical UI on Microsoft extremely robust workflow engine
and to make sure that that UI was applicable across a broad range of scientific
disciplines.
Talk to Roger about this later, but it's a real contribution to the scientific
community by Microsoft.
So again, COVE is visualization, Trident is workflow management, Azure is the
cloud computing platform in which the data resides, and part of what we've done
is to make it possible for any component to be located anywhere. That is the
APIs between these three components are clean enough that you can run all of
them locally. At the other extreme you can run all of them except a thin client
remotely. You can distribute them however you want. And it's pretty clear that
no specific distribution, no one size fits all. All right. So it's a worthwhile
experiment.
Tom Daniel is another person who we've worked with a lot. Tom is a top
biologist at the University of Washington whose overall research effort is flight
control in insects. And as part of that, he's been understanding how muscles
work. And muscles work by adenosine diphosphate and adenosine triphosphate,
ADP and ATP from your high school health course, okay, causing these muscles
which are basically linear motors, okay, to slip -- let me get this right -- slip and
then bind, slip and then bind, slip and then bind. Okay. So that's sort of how
your muscles work as you extend or retract your arm.
And what he's done is to build simulation models of these. And Tom, again, had
a big rack that his research funders this bought for him, okay, closely coupled
rack of hundreds and hundreds of blades on which he did these simulations.
And he has now given his rack to somebody else who wasn't smart enough to
get with the program. And all of his computing takes place on Amazon Web
services, all right. And in the beginning it was just his grad students using their
credit cards for AWS cycles. Now you sort of get them for free, okay. But it's
absolutely transformational.
And it's an important message for the campus that this top biologist who had his
own computer realizes that it's more cost effective for him to get on the cloud, all
right. So this is the sort of exemplar project that we've got to have. Tom is very,
very influential. And he simply doesn't do any local computing anymore. All
right? So he's moved from his own data center to a cloud. He has really simple
Python scripts that automate taking thousands of simultaneous experiments
using the EC2 API.
So it's just sort of an embarrassingly parallel Monte Carlo stimulation, all right?
So the Scripps manage this. Again, not rocket science but really useful.
I'm going to spend a bit of time on this guy, John Rehr. He's a physicist who the
National Science Foundation has been funding for the past year and a half to see
if his code, which is wild used, it's called FEFF, I don't know what a Green's
function is either, okay, but it's described here, and move it on to the cloud.
So this is slides from a talk that John gave to a physics conference recently. You
know, it's -- so the idea -- the NSF grant was can anyone in physics use cloud
computing, or is this just for people selling books? All right. And, you know, so
he describes the disadvantages of the current approach which you all know, the
advantage of the cloud which you all know. His strategy, which was to develop a
set of Amazon virtual machine images for this thing tests single instance
performance, develops shell scripts that make it easy to run and sort of turn it
loose on his user community, which is not very sophisticated. You know, it's
Linux.
So let me show you his slides. And this just shows how he proceeded. And it
won't be surprising to you. But it's important to realize that this is revelatory to
the physics science community.
So the first here just shows three -- sorry, it shows the red one, this is elapsed
time, okay is his system at UW running the code on a single processor, single
threaded. And the other three are AWS instances. And the goal of this is simply
to show that you don't pay a price for virtualization. Right. So, you know, sort of
blocker number one from someone who doesn't want to try this is I can't possibly
run in a BMM environment. It's going to be a dog. Okay? So wrong.
This is a different code. Gasoline. It shows the same thing, okay. No
virtualization penalty. That's his obvious organization. That's his interface. What
this shows is that his speedup on AWS, which gives you unlike Microsoft
systems no real control over the positioning of instances and things like that, his
speedup is greater than it was on his local cluster. All right? And so he just has
a set of simple scripts that make this available to his whole user community.
All right. So he's -- a broad proselyte for the physics community the fact that
many of their codes are embarrassingly parallel and lend themselves wonderfully
to this cloud environment.
A couple more then I'm done. Andy Connolly is the person I mentioned before,
the astronomer. He's now doing a set of analyses in preparation for LSST. So
he's part of the LSST team, UW, University of Washington is one of the sort of
founding members of the consortium. And a lot of his work involves taking
different images of overlapping parts of the sky and sort of merging them. So it's
a natural sort of Map/Reduce calculation in which each mapper picks the
appropriate subset of one of the image planes and the reducer merges them.
Right? So all of this is being done in the cloud now.
This is work that Bill Howe has done with Claudio Silva and others from the
University of Utah. And the idea here clever name, Horizon: Where the Ocean
meets the Cloud. And the idea here is to make interactive the exploration of a
set of ocean and environmental questions that involve the access to enormous
volumes of stimulation data, right?
And the problem again is that previously for computational reasons you couldn't
ask these questions interactively. In fact, you would probably ask them by going
to a programmer. And now through cloud computing and reasonable interfaces,
you've got the ability for scientists to interactively ask questions and interactively
explore alternatives. And again, you understand the benefit of that.
Here's a final point that Bill makes. EC2 is Google Docs for developers. I meant
to say Azure is B plus for developers, I think. [laughter]. Okay?
The cloud is a phenomenal collaborative environment, okay? And here's Bill's
experience. He was working on a project under the aegis of the ocean
observatories initiative that was NSF funded and involved OOI, the University of
Washington and OPeNDAP, right. And so the people at these three
organizations scripts and UCSD from the OOI point of view, Bill and collaborative
if the UW point of view and a set of folks at OPeNDAP had to do some
development together. And there was a deadline on this for a set of reasons.
And they horsed around for two weeks waiting for the folks at OOI to create
credentials so they could all log into the same system. Okay? So what bill did
was to simply spin up and EC2 instance, and they were rolling in an hour and got
the work done.
Okay. Similarly, Lee Hood's Institute for Systems Biology at the University of
Washington uses EC2 and S3 simply for the sharing of computational pipelines
with their collaborators. So it's a phenomenal collaboration environment. Pretty
rudimentary, but something we do deal with all the time, which is the need to
collaborate across institutions. And the fact that you're dealing with an incredibly
lab our just setup for security reasons and general I'm too busy to deal with your
problem reasons.
So here are the observations that summarize what we found that I'm definitely
wrapping up here. Flat files and Excel spreadsheets are the most common data
management tools for scientists. So really data analysis, data management is
choking science as the volume of data grows exponentially.
Even great scientists are doing things that you absolutely wouldn't believe. And
again, the example -- the bad examples I gave you are not mediocre scientists,
they are people at the very, very top tier of their fields.
Simple tools can change these people's lives. There's really interesting
computer science to do. For example, Magda Balazinska, who is here, and her
collaborators are working with Dave DeWitte and Mike Stonebraker on the
question of what is the right future database for large scale science? That's a
great computer science question. Some of this is much more mundane, but it's
absolutely transformational.
Many of these tools have broad applicability. So you know the spreadsheet to
SQLShare and SQL query interface that Bill built, the Condor-to-cloud interface
that Chance Reschke and the folks at David Baker's company built, those are
tools that can be very, very broadly used. So part of what we're trying to do as
part of the eScience Institute is harden those tools and make them valuable to a
broad range of scientists.
Workflow management is really important. And it makes sense to build on
commercial tools. Flexible client-cloud architectures are winners. There's no
one size fits all. I touched on that.
And finally, a huge proportion of interesting science is, or can be made,
embarrassingly parallel. And many HPC researchers can thrive in the cloud. I
think most of them don't believe it yet, but they're going to find out that it's true
with our help.
Okay. Lots of science apps. Andy Connolly's is an example. Lend themselves
to Map/Reduce or Dryad style computation. And the cloud is Google docs, it's a
collaboration platform for science developers. That's the conclusions I draw from
this sort of first year of experience.
Now, this is a slide that's now 18 months old. This is from Werner Vogels at
Amazon.com. It shows the EC2 instance usage of a company called Animoto.
Animoto does something incredibly simple and dumb. You give them an audio
file and a set of JPEG images and they produce a syncopated slide show. And
like many computing startups these days, they don't have any computers, they
run on somebody's cloud. And for a number of days here, they were bopping
along at about 35 virtual machine instances. Suddenly tax day two years ago, all
right, they rolled out a Facebook app and they grew to 3500 instances. Okay.
This is not a problem you solve by calling the Dell sales woman and saying I'd
like to have another 3470 machines running on Tuesday, okay. Nor do you solve
this by going to your BC and saying, you know, you should build me a 5,000
machine data center on the off chance that I'm successful. Okay?
What's important about this is that most of our science looks the same way,
okay? You're doing trials until a month before a paper deadline and then
suddenly cowanga and then you're back to here again. Hopefully that didn't
happen to Animoto, right?
So the ability of Microsoft or Amazon to provide these incredibly elastic services
is what we as scientists need. Obviously the bandwidth for AWS is growing. The
number of instances is growing. This is a photo that Werner uses in his talks of
the tasting room at a brewery in Belgium, okay? And the tasting room is in the
building where they used to generate their own electricity, okay? And the big
brass thing in the middle is the apparatus that generated that electricity, right.
And the argument is that five years from now if you're running your own data
center, it's going to be about as appropriate as generating your own electricity.
That doesn't mean it isn't sometimes the right thing to do. But in general, you
should leave this to somebody who knows how to do it and does it at cost
effective scale.
Finding, my remaining three minutes, my little add for what we're doing in
computer science. Obviously we're changing the world. 40 years ago three at
least great things happened in 1969. Can you remember what they were? What
are 3 things you remember from 1969? Man on the moon. Great. Okay. What
else?
>>: The Internet.
>> Ed Lazowska: The first Internet communication. What else?
>>: Mets won the World Series.
>> Ed Lazowska: Woodstock. Okay. Woodstock, man on the moon, the first
Internet communication. Does anybody remember what the first communication
over the Internet was? This is from Charley Kline, Len Kleinrock's programmer to
the folks at SRI.
>>: [inaudible].
>> Ed Lazowska: You got it, it was LO. The first two words of log in, and then it
crashed. All right. So now with 40 years of hindsight, which had the greatest
impact? And you know, unless you're really big into Tang and Velcro, it's really
clear where the impact was. Of course, you know, nobody who was at
Woodstock remembers a damn thing, so it can't have been that, right?
[laughter]. And you know, the reason is we're hitched to exponentials. And
either other fields that benefit from exponentials like biotechnology benefits
because of us, right, it's our CCD array, sensor arrays and it's computation that
put those folks on an exponential right. The past 30 years. This is an example
that's a little silly. Last year a set of people asked a bunch of people at the
business school at the University of Pennsylvania to identify the greatest impact
innovations in the past 30 years.
Why the past 30 years? I think if you go back further than that, you're competing
with things like the wheel and fire and it's a little hard to come out number one,
okay. So, you know, what do these business school people know? Not much.
But the good news it's somebody other than us, okay, talking about what we've
done. This is their list of 20 top innovations. And half of them are just hard core
computer science. And many of the others are half core or quarter core
computer science. You know, you've got a -- we get some credit for cell phones
and some credit for MRI and things like this.
So these folks at the business school, at Penn at the Wharton School. Are
tasked to come up with the 20 highest impact innovations of the past 30 years
and the vast majority of them are what the people in this room do. You know, the
most recent 10 years total transformations in search, scalability, digital media,
ability, E-Commerce, the cloud, social networking, crowd sourced intelligence
and the cloud is a triumph of computing research, right? It wasn't more than a
dozen years ago when you build scalable systems by building reliable hardware.
All right? And that's completely laughable now. What we do is build the data
centers out of the cheapest, junkiest stuff we can get. And software deals with
the failure right? That's transformational. It's the result of 25 years of research in
computer systems on how you do things like elect a new leader in a failure prone
environment where deceit is possible.
I wish I could final a photo of this, because I remember the photo but I can't, and
Amazon can't either. 12, 13 years ago, Jeff Bezos had his smiling face in a
bunch of ads for AlphaServers. Okay. AlphaServers were big SMPs from DEC
and then Compaq and then HP, right? And here's a quote in an add from Jeff.
He says to support our rapid growth, we had to find a highly up gradable and
scalable Internet server. The AlphaServer platform provides the upgrade path
we need. Okay? So Amazon is now on the third or fourth generation of its
system architecture. But for the first generation of that thing which drove a bunch
of their growth, they were scalability limited by the largest reliable multi-processor
that could be built. Right? And all they could do was replicate the database
across several of those and use a load balancer to send different queries to
different ones and somehow hope they didn't tell didn't sell too many copies of
something, right?
And that's just completely laughable way to build an Internet service these days.
And it's because of what we do.
The International Academy of Engineering identified 14 grand challenges for the
next couple of decades in engineering. The one we're talking about today is
engineering the tools of scientific discovery. That's what we do. But if you look
more broadly at these, again, a significant proportion of them are almost entirely
computer science. They might not be the ones you'd pick or I'd pick but, you
know, even better somebody else picked them. And many of the others have a
very significant computing research component, right? So, you know, what are
we doing? We put the smarts in everything smart these days. Homes, cars,
bodies, robots, data analysis crowds and crowd sourcing. That's sort of the
business we're in.
So it's a fantastic time for this field. And that's the message I want to leave you
with. Thanks for your attention.
[applause].
>> Ed Lazowska: And I didn't get hit by Dennis's slide.
>> Dennis Gannon: So we have time for some questions. Yes. First one in the
back there.
>>: Christine Borgman, UCLA. Thanks, Ed.
>> Ed Lazowska: Hey, Chris.
>>: Great talk as always. I want to pose a multi-part problem back to you,
particularly as far as the eScience in pulling this together. We've been
embedded in the center for embedded network sensing for eight years now
studying data practices and we're working on astronomy and such too. And
we've found not only that diversity that you're seeing, we're also finding that excel
is the lowest common denominator across all these different disciplines.
But there's a real risk of homogeneity. We have watched environmental
scientists go from experiments to observation systems, but we've also seen them
abandoned observatory networks to go off to campaigns with sensor networks.
So it's got to be a hybrid system. It's not going to go one direction or the other.
I'd say the three key problems that we're coming up against would be, one, to
recognize the diversity between investigators and problems. Second is to find
mutually interesting problems between the computer scientists and the scientist
because very often they get together and it's absolutely a trivial computer science
problem or it's a trivial biology problem finding that intersection.
Thirdly is that once the problem starts to work the computer scientists are
walking away and going on to the next problem. The biologists don't trust it until
it works for a year in the field by which time they've been left high and dry, they're
not getting their data, they're not getting their science done, the computer
scientists aren't necessarily working with them to get the papers out. So it's
finding those problems that are mutually beneficial and staying with them long
enough because otherwise you got a lot of burnt out scientists who don't want to
work with the computer scientists anymore.
>> Ed Lazowska: Right. So let me talk about the -- just for a minute about the
final point you made, Chris, because I think it's really important one. And just
sort of a couple ways to address it. What we're trying to do in the eScience
Institute is we have a combination of faculty research scientists and staff
programmers. All right. And the goal is to achieve the transition in the
persistence through that path of people, and these investigators have their own
programmers, right?
So again there's a sociological problem of winning them over. But the goal is to
stick with it long enough that you win over the tech staff that these people have.
Another important issue is using commercial tools in which people have
confidence, all right? So people are going to have confidence in Trident because
folks at Microsoft built it and it really is approaching bulletproof.
Here's another example. Through Debbie Estrin, and you know Gaetano
Borriello at the University of Washington, and I had a student several years ago,
Tapan Parikh, who is now on the faculty at Berkeley, and Tapan was one of the
first people to realize that cell phones were the right data collection device for the
rural developing world, okay. They're small, they're light, they're cheap, the
battery lasts a long time. They can separate in disconnected mode. Voice is a
first class object.
The problem was because of the technology of the day, Tapan inbuilt his system
using, you know, MySQL on a PC as the back end cloud database and Symbian
on the phone which some vendor controlled. And what Gaetano did last year
was to rebuild this stuff using Android phones and the Google cloud, right?
And the uptake has been absolutely phenomenal, right, because people have
confidence in an open source phone OS because they can't be jerked around by
a vendor, okay, or a service provider. And they have confidence in the cloud. All
right? And so the uptake, dozens and dozens and dozens of international
projects have adopted this thing called open data kit, right? Because they
believe in the platform. So the point is a huge amount of it is having a platform in
which people have confidence.
There's no magic answer but I think there are things we can do. Yeah, please?
>>: Hi. In the beginning of your talk -- a great talk, you formalized or you said
you were going to state a fifth paradigm and you never formalized it in a
sentence.
>> Ed Lazowska: Well, what I meant to say -- I'm going to go back to the slide
with the URL on it, by the way, which is here. Down at the bottom. I just -- you
know, the recent Microsoft book which is phenomenal is called the fourth
paradigm where eScience is the fourth paradigm. I think of theory and
experiment and observation and stimulation oriented computational science and
eScience as being five. That's all I meant. So I just -- I try and separate out
theory and experiment and observation because I think they're the sort of the
interlocking triumvirate that have been in the three legs of the stool forever.
Question over there?
>>: Yes, Bertrand Meyer from ETH.
>> Ed Lazowska: Hi.
>>: Thanks for a fascinating talk. There's a widespread suspicion that many of
the results coming out of these big scientific stimulation programs are bogus. In
part because the -- you have all these -- all Fortran programs that no one really
checks. So is there something in the cloud computing paradigm that could affect
this hopefully for the better?
>> Ed Lazowska: There is something in the sensor paradigm that can affect this
for the better, okay? Because the sensor paradigm is measuring the world as it
is, all right, and putting that data in the cloud, all right? And so first of all, you're
observing the world as it is, and you're studying these interfaces as I talked about
with Ginger. And secondly, you've got a vastly greater amount of data with which
to compare your simulations, all right? So validation becomes easier.
Finally, some of the oceanographers that Roger and Jim and I have worked with
of done very simple things that have simulated a lot of science. For example, at
MBARI, the Monterey Bay Aquarium Research Institute, the guy who was
running their sort of eScience initiatives did the following very simple thing at the
behest of the navy. He built a Web page that on the front of the Web page had
four boxes. And what it showed was the results for particular periods of time of
the three major ocean simulation models. For example, one is hops, the Harvard
ocean something system, okay? I forget what the other two are. And
measurement data for the same time period. All right? And having all this data
in one place caused the scientists to be faced visually with the fact that their
models disagreed with one another and none of them agreed with the measured
data. All right?
So I think, again, you can make progress. But the most important thing to me is
that this -- these -- this distribution, broad based distribution of sensors in the
environment changes our knowledge of what's going on which allows us to do a
better job of seeing whether the stimulations are reasonably believable.
>> Dennis Gannon: At this point we're at our break time. But I want to thank Ed
once more for a fantastic talk.
>> Ed Lazowska: Thanks for your attention.
[applause]
Download