>> Dennis Gannon: My name is Dennis Gannon, and I'm chair of the program committee for this thing. There's a number of other people that have been involved in this. There's Kent Foster over there, who has really done a huge amount of working getting this thing together. And I'm pointing him out in case something calamitous happens, you know, talk to Kent. Juan Vargas is back there. He's been involved. Arkady is probably around here someplace. There he is. He's also been active in this. Plus we have the staff in the back. And so we've got a good crew here, and we're all available to help you in anything you need to do. Now, I need to give you some basic instructions. First of all, if any of these plenary sessions are just full, we have as you can see quite a good size crowd here; in particular for the Microsoft employees during the plenary sessions, there's an overflow room that you can go to and our guests can have the space in the live room. We also have our sessions set up so that this is the room where we'll have most of the tutorials and invited talks will take place in this room. And the sessions for the technical session talks are in neighboring rooms. Also, there will be, let's see, what was I -- well, we have a very full schedule. Now, one of the things that we're going to do here is that it's very important, because of the broadcast of the event, is that all speakers have to sign a special form, which I would be holding up if I had a copy of it, but I don't. So it's a special form you have to sign and check two boxes. Now, in addition, we need copies of your presentation so that they can go on to the website so your slide decks -- also, we need copies of your presentations so that when you are actually giving your presentation, our amazing audio visual people can have you live but as well as turn to your slides as needed during the recording. So they will need your slide deck on their system prior to your presentation. So make sure that you get in each of the rooms there will be a -- I think a -- one of these little sticks the, memory stick and you can use that to put your slides on them if you don't have one, and get it installed into the system before you talk. Now, we're going to be ruthless in the way we limit the time spent in the -- on these talks because we have such a full agenda. We were terribly excited by the quality of the abstracts that were submitted. We had a tough time picking ones, and we had an even tougher time throwing out things that we didn't want -- or that we -- we wanted a lot of them. So we decided to pack the schedule full because there was so many good things. And unfortunately that means the schedule is very full. Which means that we don't have a lot of break time and in particular the time between a couple of the afternoon sessions there is essentially no break between them, and so if you're going from one to the other, you'll have to move quickly. So I will ask the session chairs to make sure that you end your session at least five minutes before the next one starts. So that means we're going to keep track of time. We have these extremely subtle signs that we are using to let speakers know when they're supposed to terminate their discussion. So -- and that's on this. What else am I -- let's see. Come to the podium with your deck in a thumb. Talk limited to 20 minutes. I said that. Oh, yeah. On the floor, that's right, Kent is now demonstrating. In case you haven't found it, there's a little -- and you need power for your laptop, there's a little -- there's a thing on the floor and it's scattered around the place. So don't be shy about using Microsoft's electricity. In addition also over here, if you haven't noticed this already, the way you get into the wireless is the password and information there for the visitor wireless there. Let's see. Oh, yeah. Last item, you know, make sure that you -- and I see there's no problem already. It says move to the center of the aisle so there's plenty of room. Well, the center's full. So that's great. You're already doing that. Anything else? What of I forgotten? Nothing? We got it all? Good. Okay. So I'm going to actually -- I'm ahead of schedule. I don't have any serious prepared remarks. But what I want to do now is to next to introduce our keynote. We have two key notes today. We have one right away and another one right after lunch. And our three key notes are really I consider to be extraordinary people. Until recently I was an academic computer scientist. I've been here for about two years. And there are certain people among the US academic computer science community that truly stand out as leaders, and we've got all three -- three of I would say the top five going to be here for this meeting. And this morning we have one in particular who I consider to be one of the real leaders of our discipline from the academic side as well as his -- the engagement of computer science community with the federal government as well as the rest of the sciences. And that is Ed Lazowska. He is the Bill & Melinda Gates Chair of Computer Science and Engineering at the University of Washington. He is -- I've got like three pages of stuff about him, and I'm not going to go through them. Although I almost have enough -- no, I don't have enough time. But just a few notes that Ed is, he's a member of the International Academy of Engineering. He's a fellow of the American Academy of Arts and Sciences, a member of the Washington State Academy of Science. A fellow of the ACM, a fellow of the IEEE, a fellow of the Association of American -the American Association for the Advancement of Science. He is one of the founding members of the Board of the Computing Research Association. At least I believe -- Ed, I think you've been there from the very beginning. He's also been a chair of the Computing Research Association. He has also been currently the chair of the Computing Community Consortium whose objective is to expand the engagement of the computing research community in articulating, addressing the societal challenges of the 21st Century. It's really an important job that he's doing there. And I, as I said before, I consider him, and I always have, as to be one of the outstanding leaders of our research community. And Ed, why don't you go ahead and let you get started five minutes early. [applause]. >> Ed Lazowska: It's great to be here. And this is not a used talk, it's a new talk, so you're the dry run for this, and I hope you find it interesting. For the past couple of years, I've been directing something called the University of Washington eScience Institute. And eScience is not exactly equal to the cloud, but they're sort of -- they're married, joined at the hip. So I'm going to give you an overview of what we've been up to in a set of steps. I'm going to talk a little bit about eScience and of course, you folks are already well aware of most of that. I'm going to talk about what we're trying to do at the UW eScience Institute some examples of scientists who we have moved in the direction of managing their data more intelligently and utilizing the cloud. Make a few observations, lessons we've learned. And then in my role running the Computing Community Consortium, I'm obligated to include in every talk a plug for computer science. And of course, you don't need that either, but I hope this is something you'll bring back to your home institutions and try and talk to people about. So I have to say all the work we've done has been not just inspired by personally led by Microsoft, Jim Gray, Roger Barga more recently, and a set of others. And, you know, I consider eScience just yet another transformation of science by computer science. I also want to give a nod to Dan Reed. Dan and I and also Dave Patterson who's here or will be here, served on the President's Information Technology Advisory Committee for a couple of years, and Dan chaired the subcommittee of PITAC in 2005 that wrote yet another report on computation enabled science. And there were a set of important things that Dan said. First of all, this was in 2005, now five plus years ago. There was a clear call out of the need to focus on data driven science because of the advent of sensors and data there's more complex in various ways. Dan also has a very nice turn of phrase. For those of you who read his blogs, you see this all the time. But he had this report card on computational science. He said a report card of national performance might record a great of C minus, with an accompanying teacher's note that says this student has great potential but struggles to maintain focus and complete work on time. So I think that's our story in pushing this field forward. So the way I think about this, Jim and Microsoft has talked about four paradigms. I think of there being a fifth actually. You know, in the beginning there was theory and there was experiment and there was observation. And they're of course closely linked with one another. And I'll come back to this slide in a minute. This is how oceanography is done even today. We're trying to change that. That was complemented 30 plus years ago by simulation rated computational science. And what's happening more recently is we're paying a transition to what I call eScience, which does not obviate simulation oriented computational science, it simply shifts the focus from the cycles to the data, and the management and analysis and exploration of that data. And the slow and digital sky survey in which Jim played a crucial role was really the prototypical modern eScience project that released turned the tide. Because Jim was the first person to show that you could put large scale science data in commercial database systems. Now, of course that was at a time when the volume of that data was exploding. And as you'll hear from other speakers later today, it's not obvious that scaled up RDBs are the right solution for the long-term future, but nonetheless this was important because typically science data has been stored in flat filed with no metadata. And I'll come back to this in a minute. And sticking this in with metadata in a commercial relational database system is transforming. I want to mention that with the eScience Institute at UW, I think of science as being a pyramid and to be honest, our view is that the folks at any institution at the very, very top of the pyramid are doing okay but there are a huge number of phenomenal scientists one level down the pyramid who need guidance if they're going to continue to be successful. And so I'll sometimes say things that you know don't apply to the very preeminent large scale science projects but they apply much more broadly below that. So, you know, eScience is driven by data, huge volumes of datas from the advent of modern sensor networks, and this is the Apache Point telescope that was used in the Sloan Digital Sky Survey. And remember that this survey took place over a period of seven years, transformed astronomy from sort of simple observations to this sort of survey astronomy and really changed the field. And what it produced was about 80 terabytes of raw image data over the space of seven years. And that was enormous in its day, really truly enormous. But the new generation of science tools make this look like nothing. For example, and I'll come back to this in a minute, the Large Synoptic Survey Telescope, LSST, which is a project that's in motion now through the International Science Foundation, will generate 40 terabytes a day. So every two days it generates seven years worth of Sloan Digital Sky Survey data. All right? The telescope itself is located in Chile. There's 400 megabits per second of sustained bandwidth required between Chile and NCSA just to move the data, okay, and much higher peak rates. So it's just a completely different, literally, order of magnitude of data volumes required. But every other large scale science project is like this. The Large Hadron Collider generates 700 megabytes of data per second, okay, so 60 terabytes a day. These little Illumina gene sequencing machines that sit on desktops produce about a terabyte a day per machine, and big labs have between 25 and 100 of them. So there's a lab at the University of Washington that soon will have 25. I have friends working at a project on the East Coast that will have a hundred of these machines. All right. So in a single lab these are desktop machines that produce enormous amounts of image data. And it's necessary to keep that image data and analyze that image data. You can't discard it as has been the pattern up until now, because you're discarding essentially the raw data from your experiments that you need to go back to. That's part of the lesson of the Sloan Digital Sky Survey which is that you're going to want to ask questions that you didn't think of at the time you wrote the proposal and wrote the Fortran program and defined the flat file format, right? And to do that, you need to keep the original data around. I've been working at the University of Washington. And Jim helped me for a number of years. And Roger Barga, more recently, on what's called the regional scale nodes of the NSF Ocean Observatories Initiative. This is under construction now. I'll say more about this project later. But there are three elements to the Ocean Observatories Initiative. But the one that we're most interested in here involves deploying about 1,000 kilometers of fiberoptic cable on the sea floor of the Juan de Fuca Plate off Oregon and Washington and British Columbia and stringing it with thousands and thousands of chemical and physical and biological sensors that bring data back in realtime, all right? And the realtime analysis of that data and the democratization that makes that data available to scientists everywhere and to school kids and teachers along with those scientists is really transforming and part of what we have to achieve. Finally of course, The Web is nothing but an enormous data source. And at the University of Washington we've talked a lot about how the fact that this new form of data oriented science affects the sociologists as much as it affects the physicists and chemists and engineers and oceanographers. All right. If you're a sociologist and you're interested in studying the creation and evolution and disillusion morphing of social cliche, you used to do that by getting together 30 psychology freshmen and paying them six bucks an hour to sit in focus groups at lunch time. And now you have four hundred million Facebook users whose data you can in principle analyze, right? So just a total transformation. Point of sale terminals, on and on. EScience is obviously about the analysis of this data, the automated or semi-automated analysis of the data. And it's not just a matter of data volume. All right? So scientists are confronted with a number of changes, including the volume of data, the rate of data, and the complexity or dimensionality of data. And at all stages of that scientific pyramid you have scientists facing data analysis challenges that they didn't face before. All right. So that's really the challenge that we have to face as computer scientists and computational scientists. It's helping the broad array of scientists manage data that they haven't been confronted before. Obviously there are spectrum of technologies, our technologies that are utilized. Sensors and sensor networks, large scale backbone networks, databases, data mining, machine learning, visualization, cluster computing or cloud computing at enormous scale. And eScience is really married to the cloud in a number of ways that I'll discuss. This frazzle-haired guy is an undergrad of ours who a few years ago now, actually in, gosh, now, 2007, used his Google 20 percent time to come up to the University of Washington and help me device a course on sort of Google scale data intensive computing. It's sort of a funny story. We needed a cluster to run on. There wasn't the availability of a Google, IBM, or Microsoft or Yahoo! cluster at the time. And he couldn't get anyone at Google to approve of course access to their systems or purchasing a system for our students to use, so he simply found one on eBay and put on it his master card and installed it and then sent the bill in and blessedly somebody was willing to sign off on reimbursement for him. All right? And this led to a really interesting curricula that's evolved over the years in teaching undergrads on how to do big data style computing. And today, you know, whether you're using Google or Azure services platform or Amazon Web services, this is just fundamental to the scale up that we all have to do. And I'll again come back to that in a sec. The important thing from my point of view is that eScience is really going to be pervasive. And in my view, again, there are a set of people who are offended by this comment, but it's very different than traditional simulation-oriented computational science which on most campuses was three physicists, two astronomers, and a chemist, all right? That's of course a preposterous exaggeration but it was a niche on the scale of science, all right? And it's not that it wasn't more or it wasn't transformational, it was important, it will continue to be important. It was transformational. It will continue to be transformational. But you didn't as a university or most companies or research labs, you didn't have to excel at it in order to be competitive. And in my view unless you excel at the tools of data intensive science and data exploration, you're going to be out of business over the next decade. So these capabilities have got to be broadly available across any institution, whether it's a corporation or a research lab or a university. And that's really the goal of this University of Washington eScience Institute. So let me give a little history this from the astronomy point of view. Again, we talked about the Sloan Digital Sky Survey and its data rates and data volumes over a period of seven years and the incredible role that Jim Gray played in really setting an example and putting astronomy ahead of all other fields in terms of how it handles its data. The project plan for the Sloan Digital Sky Survey had it budgeted at a 16 million dollar project. It is literally the case that the software was to be written by astronomy faculty over their summers when they weren't teaching. Okay? So this was to provide summer support for astronomy faculty by writing the code. We had a discussion just last week with Andy Connolly who is a superb astronomer at the University of Washington who worked on this software as a graduate student and was sort of giving us the background story. And that was the plan. The plan was to use objectivity as the data store. And objectivity was going to be great because a big company, Motorola had build it to use with the Iridium satellite project, which of course was going to be a great success. So what could possibly go wrong, right? So the project reality was first of all, it grew from 16 million to 80 million dollars. 30 percent of those funds were spent on software. All right in 30 percent of the funds, which still is out of proportion to the way current science projects are budgeted. The OOI was originally planned essentially as SDSS had been planned 10 years before, that is with no data management component to it. And that 30 percent does not count the monumental contribution by Jim gray and his colleagues at Microsoft, which transformed the project. Again, a quote from Andy, who was in this from the beginning, is if it weren't for Jim Gray's contributions, SDSS would have been more likely to yield 100 research papers than 5,000. Okay? And there's an interesting aspect of sociology here. What caused the astronomers to put their data in a repository so it's accessible to everyone? And what Andy said -- sorry, what Jim said was the great thing about working with astronomical data is everyone realizes it has absolutely no value, so you don't have any intellectual property problems. All right? Andy gets a little upset when you say that, but what he said was that the people involved in this project realized pretty early on that there were more papers to be written, more research to be done than they could possibly do, all right. And therefore, they weren't relinquishing any competitive advantage by publishing the data for everyone to use. All right? So in past versions of projects like this, a small cohort of people who had built the instruments and had proprietary access to the data would have done the research and it would have resulted in a tiny fraction of the research being accomplished that actually was accomplished by the project. And really by the project -- by the data from the project being put in a Microsoft commercial database system with metadata so that a whole bunch of people could ask questions that hadn't necessarily been anticipated by the folks who wrote the proposal and carried out the project. All right. So how did this come to be? This is an interesting social story as well. When Jim arrived at Microsoft, he was immediately set to two projects. I remember he came over to see me in his first couple months with the company and he was sort of chuckling over the fact that even at that time he was a Turing award winner and sort of the king of database management. And he had been asked to do two things. One was to win the TPC benchmark using Microsoft database technology. So I was one of the people who worked with Jim on a paper in the 1980s called an Analysis of Transaction Processing Power that yielded the TPC benchmark for comparing databases. And Microsoft wanted to win that with their database technology to prove that it was real, that it wasn't sort of a toy database. And the second thing he was asked to do was to mount the world's largest Web accessible database. And he was trying to think about what data people might like to utilize. And he came upon the idea of satellite imagery and built something called TerraServer which eventually fought him awards from the federal government, sort of lifetime achievement awards from the Geological Survey I believe it was, all right? So the idea was all of this satellite imagery, he got some of the satellite imagery from the US Government, he got a lot of it from the Russians who hoped they could develop a market for their satellite imagery, so there was a little thing on the TerraServer Web page where you could click to buy images from it. Anyway, that was sort of Jim's introduction to geodata and geobrowser and streaming and again led to a huge number of innovations. This is again another contribution Jim made was to realize that in computer science I don't know whether this is good or bad, but whenever you add another order of magnitude to some dimension of a problem, you typically stumbled into new research, all right? We simply don't build our systems so that they scale beyond one or two orders of magnitude in one dimension, certainly one order of magnitude in two different dimensions and you sort of have to rethink things. So a huge amount of terrific work happened. All right. Meanwhile, Alex Szalay, a phenomenal astronomer at Johns Hopkins was in charge of building the data systems for the Sloan Digital Sky Survey. And we've already talked about how they started with objectivity. The original plan was the astronomers were going to do this in their summertime. Of course like all faculty graduate students were doing it. And Connolly was saying the other day that it was very funny the graduate students would meet at a conference and they would say, well, you know, who are you masquerading as and who are you masquerading as? That is, which faculty were they actually writing the code for while the faculty member was doing something else in the summer. So that was Alex's story. Now it gets even more bizarre. Alex and Charles Simonyi, long time Microsofty had both grown up in Hungary. And their father had run the two preeminent Hungarian physics research institutions. And Alex and Charles had actually never met. But their mothers had been trying to get them together forever. I'm not making this up, okay. Maybe Jim was but he told me this story. And it's just so phenomenal. Okay. So Alex and Charles' mom arranged that when Alex was coming out to Seattle for a conference, he would get together with Charles Simonyi, and these two sons of the preeminent Hungarian physicists would get together for lunch. All right? So at the lunch, you know, Alex says to Charles, what are you working on? Charles says I guess Office or something like that. Charles says to Alex what are you working on? Alex says, oh, man, I got me a problem, okay, and describes the fact that the data is about to start arriving and it's going to fall on the floor. Charles says, you need to meet my friend Jim Gray. He understands data. So the next day they flew down to the Bay area, and Jim and Alex hit it off. And that's how Jim got changed in the Sloan Digital Sky Survey. It just shows how through these sort of coincidental happenstances absolutely great things can happen. Okay. So now we've talked about LSST as well, all right, the sort of successor astronomical survey project. Why? Well, the image on the left -- and again this is from Andy Connolly, is a patch of the sky and essentially what you can resolve using the Sloan Digital Sky Survey. And the image on the right you probably can't see a huge difference here, but on the Web you'll be able to see a great difference between these photographs is what you'll be able to revolve using LSST. So a phenomenal difference in sort of resolutionability from these two surveys. And again, survey astronomy in general is totally transforming. The data management system is widely distributed, all right? So master control is in the western United States. The archive site is at NCSA. The telescope is in Chile, all right? As I mentioned, 400 megabits per second sustained continuous average bandwidth between Chile and NCSA just to get the data into the repository. So extremely distributed system. Most of the computation that they do as part of the data pipeline is embarrassingly parallel. And I'll come back to that theme again and again and again. A lot of science is embarrassingly parallel which length itself to today's cloud, and a lot of scientists don't realize this. So again, something I'll refer to in a minute is at the University of Washington we found enormous resistance to moving to the cloud by people saying well, my application can't work on the cloud. These are folks working on local racks with high bandwidth interconnects. And I view this as analogous to the resistance I don't know, 15, 20 years ago moving from craze vector machines, okay, to racks with closely coupled interconnects. For example, when Larry Smarr started getting SGI origins at NCSA, there was a lot of resistance. And some proportion of the scientists who resisted that transition were right. Their algorithms, their problems could only be solved on vector machines. But for the vast majority of them, either directly or by change in algorithm, they were able to get on this far more cost effective, far more scalable architecture. And my just intuition, and I imagine you all share it, is that the same is true today, that there are a large number of people who say nah, can't do it, right? And some of them are going to be right. But I'm confident that the majority of them either directly or with algorithm changes, which are admittedly painful, are going to be able to get on this next generation of far more scalable, far more cost effective computation. All right? And LSST lends itself wonderfully to that. Okay. So the project plan for LSST from the beginning was that 30 percent of the project budget and for this project including preliminary grants it's close to 400 million dollars. So these things have grown and the dollar has shrunk is budgeted allocated to software. So they're taking this very, very seriously from the get-go on that project. Astronomy is way ahead of other fields. So at a talk at the University of Washington last week, a computational astrophysicist who is a research scientist with the eScience Institute began a talk by describing the data management tools, data management API essentially for computational astrophysics. And here's what he said. Right. Fopen, fread, fwrite, fclose, and secure copy. Right? Now, you know, again not everybody -- is this too far off? This is about it. All right? And you might think that the people doing this are yokels, but you'd be wrong. The people doing this are top scientists across the country and around the world. This is how data is managed, okay? So in the computational astrophysics work that Jeff and his faculty colleagues do, every simulation generates a sequence of snapshots. Every snapshot is a single flat file. And analysis is by C and Fortran programs. And that's how physics marches forward, okay? So here's data management in biology. And this is an example that I'll come back to again from an absolutely top tier environmental biologist at the University of Washington. And this is how she manages the data from her environmental sequencing, okay? It is -- I'm not making this up, by doing manual joins on spreadsheets, okay? Bill Howe, who is in the audience here has built a tool I'll describe later that helps them deal with this, okay? So these biologists get multiple spreadsheets, right, out of their sequencing and they do joins, you know, either on the screen or by printing listings, right? Mind boggling. These are absolutely top tier scientists. And this is how they do their work. A colleague of mine in the venture capital community here cited a study that claims that 90 percent of all business data is maintained in spreadsheets, okay? Despite the enormous proliferation of, you know, PeopleSoft and SAP and SQL Server, right? And I'm confident that 90 percent of all science data are far more than that is in spreadsheets or flat files. All right? So that's a problem. Now, scientists understand that this is a problem. One thing we did at the University of Washington a couple of years ago was a survey of 125 top investors across all fields. And here's how we began. We identified about twice that number, about 250 or 300 top scientists and engineers by looking in each field for who's getting significant grants in that field, who's winning awards at the younger level like Sloans and CAREER awards, who's winning awards at the senior level like National Academy membership and things like that, HHMI, right? Find out who the top people are. We, the folks in the eScience Institute conducted interviews with 125 of these folks, one-hour interviews, and then these were tabulated. And what these people said was the problem they were facing in the future, the big problem, was the management of their data. All right? Flat files and Excel are in fact the most common data management tools which is great for Microsoft but not really so good for science. And a typical science workflow we find across UW is you have a situation where the module intervention in the science data workflow two years ago was taking a half a grad student day per week. And now it's taking one FTE, right? Because the data complexity either volume or sophistication is up 10 X in the past two years. All right? And, you know, extrapolating which is reasonable, in another two years it's going to be taking 10 FTE of manual invention. So what you've got to have is tools to help these scientists investigate the data because the data pipeline, the data workflow is becoming a gating function to science. So that's the goal of the eScience Institute. The motivating observations, and I've said all these things before, that like simulation oriented computational science, eScience, data rated science is going to be transformational, unlike simulation oriented science it's going to be pervasive. Even more broadly than simulation oriented science, it's going to use new techniques and new technologies from computer science. So there's a great opportunity for a marriage between us and them that goes way beyond sort of compilers and operating systems, sort of simulation techniques, visualization. Cloud services are really essential. And from the point of view of a particular institution, if you're not a leader in this, you're going to get left behind. So that's the scare for us. Let me back up, because there's one more thing I wanted to say about this study. If you go at a university to a provost or a vice president for research, what you're going to find is they don't believe this. All right? All right? And they don't believe it -- see, a bunch of people nodding their heads, okay. Because when they want to know what sort of computing is going to be required by scientists five years down the road, the people they go to quite naturally are the people who self identify themselves as computational scientists. All right? And those are the very important, very successful scientists who are doing simulation oriented computational science. And what they need is more machine rooms and more racks and more power and more air conditioning and more networks to access supercomputer centers and resources like that. All right? And so in some sense there -- obviously simulations are and enormously important source of data, and the data produced by simulations is growing and needs to be analyzed. But fundamentally the focus of these folks is on the cycles rather than the analysis of the data. Right? So in my view, provost and vice provosts research tend to get the wrong answer because they ask too narrow a swath of people about what the future's going to be. And the scientists get it. If you ask a broad spectra of top scientists what their challenges are, they'll talk to you about the data. And I think it's just really important to understand that management may not quite get it yet. Okay. So the goal here is to try to position the University of Washington in a reasonable situation for this eScience. And the strategy is first of all, to bootstrap a cadre of research scientists who can lead the way. And these are typically people with doctorates in science fields or in computer science and a strong orientation towards the cloud and the management of data. Secondly, and this is really important and what I'll talk about next, help leading faculty become exemplars and advocates. All right? So this is to try and overcome the oh, no, this won't work for my stuff resistance, all right? If they see great faculty who are renowned as leaders across the campus doing this, they'll start doing it too. Broaden the impact by then creating facilities to extend these capabilities across the campus and adding faculty in key fields and this was launched as an initiative by Washington state legislature with a lunch of subsequent grant funding, for example from Microsoft and from the Moore Foundation. And again, Microsoft has been absolutely instrumental in this. And the Moore Foundation, which looks at sort of advancing science across the board is extremely interested in approaches to spread data intensive science across the university community. So we have a large grant with Carnegie Mellon University focused on this sort of spreading the word. Here's our technical staff. Dave Beck focus on biology. Jeff Gardner focuses on physics, astronomy, the physical sciences. Bill Howe is our main data person. Chance Reschke, I think of him as sort of the high performance computing and high performance which now means also high performance cloud person. Erik Lundberg is doing outreach for sort of the lower part of the pyramid on data management. So now I'm going to give you a set of examples. And these are the top scientists at UW who we're using as exemplars for what they've done. Ginger Armbrust is an absolutely top tier researcher in environmental metagenomics. Okay? So environmental metagenomics means that you're sequencing the broad stew of what's in a sample from the ocean as opposed to a single organism, okay. So the data volumes are up enormously. And Ginger has built -- she's an oceanographer, she's built a bunch of technologies for continuous analysis rather than batch analysis. And so again, it's studying these microbial populations. And the sorts of things you try to answer is, you know, who is there, what are they up to, and comparisons of data sets across, for example, a near shore and deep ocean or before and after the spring flood or across salinity or temperature boundaries, right, day and night. So you're just trying to under what causes these changes in the ocean biological system. I'll spend more time on oceanography later, but the fact is that the oceans, since they cover 70 percent of the Earth's surface, are responsible for an enormous amount of the environment that we experience. And the future of that environment. And we know next to nothing about them. It's really shocking how little we know. So this is Ginger's business. And here's how she does it. The sorts of questions she wants to answer. And again, I have to thank Bill Howe for these slides. The question she wants to answer are in the lower left, okay? And the way it happens is you do environmental sampling from ships, or something like that. You pass those through a sequencer. As part of the sequencing process, you look those sequences up in a set of public databases, all right? And what does is yield a phylogenetic analysis in which you find out where these organisms fit in a phylogenetic tree, all right? Now, every step of that process produces data, and this data -- I'm not making it up, goes in a bunch of spreadsheets. All right? And then what happens is manual analysis of those spreadsheets is done, and that manual analysis is like database joins except on exponentially increasing amounts of data done by hand. All right? And again, Ginger is not a yokel. She is an utterly top tier, well funded biologist whose data volumes two years ago made this the most effective way for her to do her work. All right. So the change is this no longer cuts it. All right. So what Bill has done is to build what is really some very simple tools that have been transforming for these folks. First a tool that allows them to upload these databases into SQLShare. All right? Without really any schema at all. All right? And then a Web tool that allows the scientists to execute very, very simple SQL queries against the data that's been uploaded to database. And I'll show that you in a sec, but the interface that Bill and his colleagues have built has the ability for them to enter direct SQL commands but also a bunch of their standard queries are available in sort of a click and drop in sort of thing. And as with people who don't really know how to program but for example in an introductory course are given the code that makes a robot work, right, modifying the queries is something that's easily within the reach of all of these scientists and even writing queries from scratch once they see the set of examples, all right? So here's what they say. That took me a week with Excel, all right, something that takes me 15 minutes with this system. I can do science again. All right. So again, there's no rocket science in here. But it's remarkably transforming for these scientists. So think of this as a screen shot of the Web interface in which a set of standards queries are on the left. These are fairly sophisticated queries. They're not simple queries, okay? Against this database that's been created in a schema free way by uploading the spreadsheets, all right? So now you could do your joins and selects using SQL. That's the result. Here's another one that's pretty interesting. David Baker is an absolutely world class biochemist. David is a 45-year-old, 48-year-old now member of the National Academy of Sciences. And David owns the code called Rosetta, which is the international standard for protein structure calculation of protein analysis. All right. So Rosetta is his code, and it's extremely broadly used. Now, this is a really interesting story on where David began on big racks of machines. And he still owns something like 20 percent of the University of Washington's machine room space, all right. And on these racks some of the computations are tightly coupled and legitimately require this, but a large number of the computations are embarrassingly parallel. Because what he's trying to do is to find global minima, minimum energy states in these proteins. And like any minimum energy calculation, you have to avoid getting caught in a local minimum. So what that means is you start at millions and millions of places and do local optimization from those places using his code. All right? And that's an embarrassingly parallel computation. David realized just a few years ago and turned this into a screen saver using [inaudible], all right. And hundreds and hundreds of thousands of people are running David's screen saver. And then what happened was he built this little animation on the screen saver that would show you what it was doing. It was a very simple animation. And he started getting e-mail from people running his screen saver saying your program is dumb as dirt, it's doing this when it should be doing that, all right. And David said, well, you know, if these guys are so smart, I'm going to turn this into a Web based video game so that people can actually help with this. So he worked with Zoran Popovic and David Salesin and a set of grad students in our department, Adrien Treuille and several others, to build a Web based video game called Fold It, okay. And Fold It now has 100,000 people playing it, all right. It's a very cool Web based video game. If you go back to Luis von Ahn at Carnegie Mellon who sort of launched this human computation craze, the visibility of it a few years ago with his thesis, Luis observation was people are willing to do almost anything in return for points, right. [laughter]. And Luis, you know, he used to give his talks by beginning with the observation that he somehow got a statistic for the number of person hours spent per day worldwide playing solitaire. All right? And it turns out -- and I forget the exact number, but roughly nine days of the world playing solitaire is the number of man hours that were required to build the Panama Canal, right? So Luis view is suppose some fraction of this could be turn to useful work. So that's what David and Zoran and their students have done for science. Now, the people -- most interesting thing about this is the people who are great at this game are not PhD level biochemists. In fact, I've heard David say that when he starts playing the game, there's of course a chat room associated with it, people will start saying stuff like who's this Baker guy, he's not very good at this, right? He's competing against the people who are online. A year ago, they flew out to Seattle for three days a 13 year old who was sort of the idiot savant at this folding game, all right? And his parents came out with him because they wouldn't let the kid travel alone. And they watched this guy play for a couple of days. I should say that the game is instrumented. Their hope is to subtract algorithmic principles from how great players play the game that they can embody in the program. But they couldn't figure out how this 13 year old was doing so well, so they flew him out here and they watched him for a couple of days while his parents tourists around. He unfortunately is no longer playing the game. As of last Christmas, I think he discovered girls. [laughter]. So he's not playing. The current guy who is doing extremely well is this 50 year old from Dallas, Texas, whose name -- this is from the blog associated with the game, is Boots McGraw. I'm not making this up, okay. Boots says here, I'm a redneck from Texas, but I was in grad school at the state University of New York in Buffalo, so and on. So Boots the doing unbelievably well at this game. And Boots just won a prize for having the first protein that was actually synthesized, okay. There are many goals of this project, but a goal is to produce novel enzyme catalysts, for example, okay, that catalyze new reactions and that don't exist in nature. So you design one and then the question is can it be built, can it be synthesized, okay? And Boots did the first one that could be synthesized and the company that did the synthesis for them prepared this trophy, which they awarded to Boots. It's a sort of plastic model of the protein that he designed and they synthesized. And there's an interview with Boots now on the website that just showed up two days ago. I like this. He says, you know, I'm real happy to get this. I'll put in it my office. Now when my co-workers once again ask why I don't want to play farmville with them, I can show them the model. [laughter]. All right. So having people cure cancer by finding minimum energy states of proteins is a whole lot sort of better from a societal good point of view then yet another guy playing farmville, all right. So it's pretty remarkable. Now, David has a spinoff company called Arzeda. And what they're doing is specifically enzyme catalysts for energy applications. Right? And Arzeda has moved all of their work into the Amazon and Microsoft cloud, right? And they've done it using essentially a condor interface to the cloud, all right. So again, a very embarrassingly parallel hugely scalable computation. And this from somebody who began doing all of his computing on closely coupled racks taking up 20 percent of our university's machine room space. Okay? So Arzeda is in the Amazon and Microsoft clouds. That's how they do their computing. Again, these slides are on the Web. There's a URL on the first page of this. Okay. So now I'm going to talk about the oceanography project. And this is a work that's been really coined at the hip with Microsoft. And I can't thank Roger and his colleagues enough for the help they've given us. This is a project we called Azure Ocean. And it has three components to it. But the big issue here again is there are all sorts of biological and chemical and physical processes in the ocean that we have to under, and oceanography is traditionally done like this. You go to sea in a ship, you stick your head in the water where you happen to be and you measure temperature and salinity and pressure. And at the edge of the cruise, the chief scientist who's written this stuff down, you know, originally in a notebook, then in a tape, now in a disk takes that home with him or her and that's the end of it. All right. Now, you know, the ships are bigger. UW and Scripps and Woods Hole in Hawaii now have 250 foot sort of diesel electric vessels. But the science is done in much the same way. It's expeditionary. All right. And the goal of the ocean observatories initiative is to transition oceanography from observational to -sorry, from expeditionary to observatory based science and dramatically deepen our understanding of these ocean processes so we can figure out how to manage the oceans and the impact on our environment. So that's the pitch. This project was conceived by a wonderful guy, an oceanographic geologist actually at the University of Washington named John Delaney. It's great having oceanographers who actually look like the Old Man in the Sea. You know, Delaney sort of -- he has a sort of Irish brogue. He really plays the role. But he's a remarkable visionary. To see this project over 15 years to funding, and it's going to be completely transformational for oceanography. So Azure Ocean has three components. There is a program called COVE, which is a data visualization tool built by a graduate student of mine, Keith Grochow, originally again inspired by Jim Gray. There is Trident, which is a workflow management system built by Roger and his team at Microsoft, and it is a general science workflow management system that took as its driving examples oceanography and astronomy and a set of other fields that Microsoft was actively working in. And then Azure is a data repository. Okay. And let me make a comment about Trident for a sec because I think it's particularly important. There are focused workflow management tools for this scientific subdiscipline and this scientific subdiscipline and this scientific subdiscipline, and they're typically lashed together, sort of built up from the dirt by the people working in those particular disciplines. So they're not very robust by and large, and they're not very portable across disciplines. What Roger and his team realized is that Microsoft had a phenomenal workflow engine that had many, many, many person years of expert development put into it. And the only problem was the user interface was a business person's user interface. I'm sort of trivializing the work they did here. But fundamentally it was to put a scientist's graphical UI on Microsoft extremely robust workflow engine and to make sure that that UI was applicable across a broad range of scientific disciplines. Talk to Roger about this later, but it's a real contribution to the scientific community by Microsoft. So again, COVE is visualization, Trident is workflow management, Azure is the cloud computing platform in which the data resides, and part of what we've done is to make it possible for any component to be located anywhere. That is the APIs between these three components are clean enough that you can run all of them locally. At the other extreme you can run all of them except a thin client remotely. You can distribute them however you want. And it's pretty clear that no specific distribution, no one size fits all. All right. So it's a worthwhile experiment. Tom Daniel is another person who we've worked with a lot. Tom is a top biologist at the University of Washington whose overall research effort is flight control in insects. And as part of that, he's been understanding how muscles work. And muscles work by adenosine diphosphate and adenosine triphosphate, ADP and ATP from your high school health course, okay, causing these muscles which are basically linear motors, okay, to slip -- let me get this right -- slip and then bind, slip and then bind, slip and then bind. Okay. So that's sort of how your muscles work as you extend or retract your arm. And what he's done is to build simulation models of these. And Tom, again, had a big rack that his research funders this bought for him, okay, closely coupled rack of hundreds and hundreds of blades on which he did these simulations. And he has now given his rack to somebody else who wasn't smart enough to get with the program. And all of his computing takes place on Amazon Web services, all right. And in the beginning it was just his grad students using their credit cards for AWS cycles. Now you sort of get them for free, okay. But it's absolutely transformational. And it's an important message for the campus that this top biologist who had his own computer realizes that it's more cost effective for him to get on the cloud, all right. So this is the sort of exemplar project that we've got to have. Tom is very, very influential. And he simply doesn't do any local computing anymore. All right? So he's moved from his own data center to a cloud. He has really simple Python scripts that automate taking thousands of simultaneous experiments using the EC2 API. So it's just sort of an embarrassingly parallel Monte Carlo stimulation, all right? So the Scripps manage this. Again, not rocket science but really useful. I'm going to spend a bit of time on this guy, John Rehr. He's a physicist who the National Science Foundation has been funding for the past year and a half to see if his code, which is wild used, it's called FEFF, I don't know what a Green's function is either, okay, but it's described here, and move it on to the cloud. So this is slides from a talk that John gave to a physics conference recently. You know, it's -- so the idea -- the NSF grant was can anyone in physics use cloud computing, or is this just for people selling books? All right. And, you know, so he describes the disadvantages of the current approach which you all know, the advantage of the cloud which you all know. His strategy, which was to develop a set of Amazon virtual machine images for this thing tests single instance performance, develops shell scripts that make it easy to run and sort of turn it loose on his user community, which is not very sophisticated. You know, it's Linux. So let me show you his slides. And this just shows how he proceeded. And it won't be surprising to you. But it's important to realize that this is revelatory to the physics science community. So the first here just shows three -- sorry, it shows the red one, this is elapsed time, okay is his system at UW running the code on a single processor, single threaded. And the other three are AWS instances. And the goal of this is simply to show that you don't pay a price for virtualization. Right. So, you know, sort of blocker number one from someone who doesn't want to try this is I can't possibly run in a BMM environment. It's going to be a dog. Okay? So wrong. This is a different code. Gasoline. It shows the same thing, okay. No virtualization penalty. That's his obvious organization. That's his interface. What this shows is that his speedup on AWS, which gives you unlike Microsoft systems no real control over the positioning of instances and things like that, his speedup is greater than it was on his local cluster. All right? And so he just has a set of simple scripts that make this available to his whole user community. All right. So he's -- a broad proselyte for the physics community the fact that many of their codes are embarrassingly parallel and lend themselves wonderfully to this cloud environment. A couple more then I'm done. Andy Connolly is the person I mentioned before, the astronomer. He's now doing a set of analyses in preparation for LSST. So he's part of the LSST team, UW, University of Washington is one of the sort of founding members of the consortium. And a lot of his work involves taking different images of overlapping parts of the sky and sort of merging them. So it's a natural sort of Map/Reduce calculation in which each mapper picks the appropriate subset of one of the image planes and the reducer merges them. Right? So all of this is being done in the cloud now. This is work that Bill Howe has done with Claudio Silva and others from the University of Utah. And the idea here clever name, Horizon: Where the Ocean meets the Cloud. And the idea here is to make interactive the exploration of a set of ocean and environmental questions that involve the access to enormous volumes of stimulation data, right? And the problem again is that previously for computational reasons you couldn't ask these questions interactively. In fact, you would probably ask them by going to a programmer. And now through cloud computing and reasonable interfaces, you've got the ability for scientists to interactively ask questions and interactively explore alternatives. And again, you understand the benefit of that. Here's a final point that Bill makes. EC2 is Google Docs for developers. I meant to say Azure is B plus for developers, I think. [laughter]. Okay? The cloud is a phenomenal collaborative environment, okay? And here's Bill's experience. He was working on a project under the aegis of the ocean observatories initiative that was NSF funded and involved OOI, the University of Washington and OPeNDAP, right. And so the people at these three organizations scripts and UCSD from the OOI point of view, Bill and collaborative if the UW point of view and a set of folks at OPeNDAP had to do some development together. And there was a deadline on this for a set of reasons. And they horsed around for two weeks waiting for the folks at OOI to create credentials so they could all log into the same system. Okay? So what bill did was to simply spin up and EC2 instance, and they were rolling in an hour and got the work done. Okay. Similarly, Lee Hood's Institute for Systems Biology at the University of Washington uses EC2 and S3 simply for the sharing of computational pipelines with their collaborators. So it's a phenomenal collaboration environment. Pretty rudimentary, but something we do deal with all the time, which is the need to collaborate across institutions. And the fact that you're dealing with an incredibly lab our just setup for security reasons and general I'm too busy to deal with your problem reasons. So here are the observations that summarize what we found that I'm definitely wrapping up here. Flat files and Excel spreadsheets are the most common data management tools for scientists. So really data analysis, data management is choking science as the volume of data grows exponentially. Even great scientists are doing things that you absolutely wouldn't believe. And again, the example -- the bad examples I gave you are not mediocre scientists, they are people at the very, very top tier of their fields. Simple tools can change these people's lives. There's really interesting computer science to do. For example, Magda Balazinska, who is here, and her collaborators are working with Dave DeWitte and Mike Stonebraker on the question of what is the right future database for large scale science? That's a great computer science question. Some of this is much more mundane, but it's absolutely transformational. Many of these tools have broad applicability. So you know the spreadsheet to SQLShare and SQL query interface that Bill built, the Condor-to-cloud interface that Chance Reschke and the folks at David Baker's company built, those are tools that can be very, very broadly used. So part of what we're trying to do as part of the eScience Institute is harden those tools and make them valuable to a broad range of scientists. Workflow management is really important. And it makes sense to build on commercial tools. Flexible client-cloud architectures are winners. There's no one size fits all. I touched on that. And finally, a huge proportion of interesting science is, or can be made, embarrassingly parallel. And many HPC researchers can thrive in the cloud. I think most of them don't believe it yet, but they're going to find out that it's true with our help. Okay. Lots of science apps. Andy Connolly's is an example. Lend themselves to Map/Reduce or Dryad style computation. And the cloud is Google docs, it's a collaboration platform for science developers. That's the conclusions I draw from this sort of first year of experience. Now, this is a slide that's now 18 months old. This is from Werner Vogels at Amazon.com. It shows the EC2 instance usage of a company called Animoto. Animoto does something incredibly simple and dumb. You give them an audio file and a set of JPEG images and they produce a syncopated slide show. And like many computing startups these days, they don't have any computers, they run on somebody's cloud. And for a number of days here, they were bopping along at about 35 virtual machine instances. Suddenly tax day two years ago, all right, they rolled out a Facebook app and they grew to 3500 instances. Okay. This is not a problem you solve by calling the Dell sales woman and saying I'd like to have another 3470 machines running on Tuesday, okay. Nor do you solve this by going to your BC and saying, you know, you should build me a 5,000 machine data center on the off chance that I'm successful. Okay? What's important about this is that most of our science looks the same way, okay? You're doing trials until a month before a paper deadline and then suddenly cowanga and then you're back to here again. Hopefully that didn't happen to Animoto, right? So the ability of Microsoft or Amazon to provide these incredibly elastic services is what we as scientists need. Obviously the bandwidth for AWS is growing. The number of instances is growing. This is a photo that Werner uses in his talks of the tasting room at a brewery in Belgium, okay? And the tasting room is in the building where they used to generate their own electricity, okay? And the big brass thing in the middle is the apparatus that generated that electricity, right. And the argument is that five years from now if you're running your own data center, it's going to be about as appropriate as generating your own electricity. That doesn't mean it isn't sometimes the right thing to do. But in general, you should leave this to somebody who knows how to do it and does it at cost effective scale. Finding, my remaining three minutes, my little add for what we're doing in computer science. Obviously we're changing the world. 40 years ago three at least great things happened in 1969. Can you remember what they were? What are 3 things you remember from 1969? Man on the moon. Great. Okay. What else? >>: The Internet. >> Ed Lazowska: The first Internet communication. What else? >>: Mets won the World Series. >> Ed Lazowska: Woodstock. Okay. Woodstock, man on the moon, the first Internet communication. Does anybody remember what the first communication over the Internet was? This is from Charley Kline, Len Kleinrock's programmer to the folks at SRI. >>: [inaudible]. >> Ed Lazowska: You got it, it was LO. The first two words of log in, and then it crashed. All right. So now with 40 years of hindsight, which had the greatest impact? And you know, unless you're really big into Tang and Velcro, it's really clear where the impact was. Of course, you know, nobody who was at Woodstock remembers a damn thing, so it can't have been that, right? [laughter]. And you know, the reason is we're hitched to exponentials. And either other fields that benefit from exponentials like biotechnology benefits because of us, right, it's our CCD array, sensor arrays and it's computation that put those folks on an exponential right. The past 30 years. This is an example that's a little silly. Last year a set of people asked a bunch of people at the business school at the University of Pennsylvania to identify the greatest impact innovations in the past 30 years. Why the past 30 years? I think if you go back further than that, you're competing with things like the wheel and fire and it's a little hard to come out number one, okay. So, you know, what do these business school people know? Not much. But the good news it's somebody other than us, okay, talking about what we've done. This is their list of 20 top innovations. And half of them are just hard core computer science. And many of the others are half core or quarter core computer science. You know, you've got a -- we get some credit for cell phones and some credit for MRI and things like this. So these folks at the business school, at Penn at the Wharton School. Are tasked to come up with the 20 highest impact innovations of the past 30 years and the vast majority of them are what the people in this room do. You know, the most recent 10 years total transformations in search, scalability, digital media, ability, E-Commerce, the cloud, social networking, crowd sourced intelligence and the cloud is a triumph of computing research, right? It wasn't more than a dozen years ago when you build scalable systems by building reliable hardware. All right? And that's completely laughable now. What we do is build the data centers out of the cheapest, junkiest stuff we can get. And software deals with the failure right? That's transformational. It's the result of 25 years of research in computer systems on how you do things like elect a new leader in a failure prone environment where deceit is possible. I wish I could final a photo of this, because I remember the photo but I can't, and Amazon can't either. 12, 13 years ago, Jeff Bezos had his smiling face in a bunch of ads for AlphaServers. Okay. AlphaServers were big SMPs from DEC and then Compaq and then HP, right? And here's a quote in an add from Jeff. He says to support our rapid growth, we had to find a highly up gradable and scalable Internet server. The AlphaServer platform provides the upgrade path we need. Okay? So Amazon is now on the third or fourth generation of its system architecture. But for the first generation of that thing which drove a bunch of their growth, they were scalability limited by the largest reliable multi-processor that could be built. Right? And all they could do was replicate the database across several of those and use a load balancer to send different queries to different ones and somehow hope they didn't tell didn't sell too many copies of something, right? And that's just completely laughable way to build an Internet service these days. And it's because of what we do. The International Academy of Engineering identified 14 grand challenges for the next couple of decades in engineering. The one we're talking about today is engineering the tools of scientific discovery. That's what we do. But if you look more broadly at these, again, a significant proportion of them are almost entirely computer science. They might not be the ones you'd pick or I'd pick but, you know, even better somebody else picked them. And many of the others have a very significant computing research component, right? So, you know, what are we doing? We put the smarts in everything smart these days. Homes, cars, bodies, robots, data analysis crowds and crowd sourcing. That's sort of the business we're in. So it's a fantastic time for this field. And that's the message I want to leave you with. Thanks for your attention. [applause]. >> Ed Lazowska: And I didn't get hit by Dennis's slide. >> Dennis Gannon: So we have time for some questions. Yes. First one in the back there. >>: Christine Borgman, UCLA. Thanks, Ed. >> Ed Lazowska: Hey, Chris. >>: Great talk as always. I want to pose a multi-part problem back to you, particularly as far as the eScience in pulling this together. We've been embedded in the center for embedded network sensing for eight years now studying data practices and we're working on astronomy and such too. And we've found not only that diversity that you're seeing, we're also finding that excel is the lowest common denominator across all these different disciplines. But there's a real risk of homogeneity. We have watched environmental scientists go from experiments to observation systems, but we've also seen them abandoned observatory networks to go off to campaigns with sensor networks. So it's got to be a hybrid system. It's not going to go one direction or the other. I'd say the three key problems that we're coming up against would be, one, to recognize the diversity between investigators and problems. Second is to find mutually interesting problems between the computer scientists and the scientist because very often they get together and it's absolutely a trivial computer science problem or it's a trivial biology problem finding that intersection. Thirdly is that once the problem starts to work the computer scientists are walking away and going on to the next problem. The biologists don't trust it until it works for a year in the field by which time they've been left high and dry, they're not getting their data, they're not getting their science done, the computer scientists aren't necessarily working with them to get the papers out. So it's finding those problems that are mutually beneficial and staying with them long enough because otherwise you got a lot of burnt out scientists who don't want to work with the computer scientists anymore. >> Ed Lazowska: Right. So let me talk about the -- just for a minute about the final point you made, Chris, because I think it's really important one. And just sort of a couple ways to address it. What we're trying to do in the eScience Institute is we have a combination of faculty research scientists and staff programmers. All right. And the goal is to achieve the transition in the persistence through that path of people, and these investigators have their own programmers, right? So again there's a sociological problem of winning them over. But the goal is to stick with it long enough that you win over the tech staff that these people have. Another important issue is using commercial tools in which people have confidence, all right? So people are going to have confidence in Trident because folks at Microsoft built it and it really is approaching bulletproof. Here's another example. Through Debbie Estrin, and you know Gaetano Borriello at the University of Washington, and I had a student several years ago, Tapan Parikh, who is now on the faculty at Berkeley, and Tapan was one of the first people to realize that cell phones were the right data collection device for the rural developing world, okay. They're small, they're light, they're cheap, the battery lasts a long time. They can separate in disconnected mode. Voice is a first class object. The problem was because of the technology of the day, Tapan inbuilt his system using, you know, MySQL on a PC as the back end cloud database and Symbian on the phone which some vendor controlled. And what Gaetano did last year was to rebuild this stuff using Android phones and the Google cloud, right? And the uptake has been absolutely phenomenal, right, because people have confidence in an open source phone OS because they can't be jerked around by a vendor, okay, or a service provider. And they have confidence in the cloud. All right? And so the uptake, dozens and dozens and dozens of international projects have adopted this thing called open data kit, right? Because they believe in the platform. So the point is a huge amount of it is having a platform in which people have confidence. There's no magic answer but I think there are things we can do. Yeah, please? >>: Hi. In the beginning of your talk -- a great talk, you formalized or you said you were going to state a fifth paradigm and you never formalized it in a sentence. >> Ed Lazowska: Well, what I meant to say -- I'm going to go back to the slide with the URL on it, by the way, which is here. Down at the bottom. I just -- you know, the recent Microsoft book which is phenomenal is called the fourth paradigm where eScience is the fourth paradigm. I think of theory and experiment and observation and stimulation oriented computational science and eScience as being five. That's all I meant. So I just -- I try and separate out theory and experiment and observation because I think they're the sort of the interlocking triumvirate that have been in the three legs of the stool forever. Question over there? >>: Yes, Bertrand Meyer from ETH. >> Ed Lazowska: Hi. >>: Thanks for a fascinating talk. There's a widespread suspicion that many of the results coming out of these big scientific stimulation programs are bogus. In part because the -- you have all these -- all Fortran programs that no one really checks. So is there something in the cloud computing paradigm that could affect this hopefully for the better? >> Ed Lazowska: There is something in the sensor paradigm that can affect this for the better, okay? Because the sensor paradigm is measuring the world as it is, all right, and putting that data in the cloud, all right? And so first of all, you're observing the world as it is, and you're studying these interfaces as I talked about with Ginger. And secondly, you've got a vastly greater amount of data with which to compare your simulations, all right? So validation becomes easier. Finally, some of the oceanographers that Roger and Jim and I have worked with of done very simple things that have simulated a lot of science. For example, at MBARI, the Monterey Bay Aquarium Research Institute, the guy who was running their sort of eScience initiatives did the following very simple thing at the behest of the navy. He built a Web page that on the front of the Web page had four boxes. And what it showed was the results for particular periods of time of the three major ocean simulation models. For example, one is hops, the Harvard ocean something system, okay? I forget what the other two are. And measurement data for the same time period. All right? And having all this data in one place caused the scientists to be faced visually with the fact that their models disagreed with one another and none of them agreed with the measured data. All right? So I think, again, you can make progress. But the most important thing to me is that this -- these -- this distribution, broad based distribution of sensors in the environment changes our knowledge of what's going on which allows us to do a better job of seeing whether the stimulations are reasonably believable. >> Dennis Gannon: At this point we're at our break time. But I want to thank Ed once more for a fantastic talk. >> Ed Lazowska: Thanks for your attention. [applause]