>> Yan Xu: So the first session will be on transient computing and our keynote speech will be from Dennis Gannon from Microsoft Research on Cloud Computing and Scientific Data Analytics Challenges. Thank you. >> Dennis Gannon: Okay. Can you hear me all right? Does this thing work? Okay. So actually I changed the title after I gave it. I was going to talk about Cloud Computing and the Long Tail of Science. And the theme of this relates to some extent, or I hope to a larger extent, to the discussion that we just had in the last year. So, hey, this works. Okay. So I've got to talk about very quickly what are the challenges of long tail science, what is long tail science in our view. But I really want to get at in this talk is this question: Is there a sustainable financial model for scientific data? And that, I think, relates very closely to the discussion we had. And I'll talk about our data centers that are powering commercial clouds; we've got a lot of them. And I'll talk about what we've been doing with the Azure cloud in research. Data architectures and the role of MapReduce: I deleted those slides because I thought this was an hour talk and actually it's a half-hour, so we'll skip that. And mostly you probably know that anyway. And talk a little bit about managing analytics from your desktop and building communities which brings me back to the main question. So I think you may have heard of this notion of the Fourth Paradigm and the revolution of science, so I don't have to spend too much time on this discussion. But as you may have noticed, the data explosion is transforming science, not just astronomy but every science out there. Everybody now is a data scientist. This is something that has happened very quickly and it has had a profound impact on many disciplines. And one of the things that all disciplines have told us that they need is the technology for publishing data and sharing data in the cloud, to be able to do analytics especially on a much more massive scale than they have in the past. Another somewhat pejorative name for the scientists in what we call the long tail are the spreadsheet scientists. Right? Spreadsheets are the predominant analytical tool used across disciplines in science. The astronomy community is well beyond that but I'm sure some of you still do some spreadsheet work now and again. So they need to move beyond that. They've got a desktop machine and laptop, and it's got so much storage and so much computing capability. And it no longer fits with what they need to do. And there are a number of other issues that have come up that have changed the landscape for how they have to deal with science, so I'll talk about the sustainable economic model. So what would I mean by long tail? So I put in sort of the graph here is the size of data collections. And high energy physics is clearly up there. And I put astronomy there too, but I've had astronomers argue with me that they do not consider themselves yet in the same data league as high energy physics. Well, you folks could tell me whether that's right or wrong. Genomics is now, as we've seen, really working itself there. But what I think about is that collective long tail of science that's just of economic data, social science data, behavioral science data, political science data. A variety of disciplines that you see around your universities that you never in the past have seen really very much of in the computing center. And one of the things that changed things is, as you know, the National Science Foundation now basically requires all the data that goes into publications to be made public. And the universities are currently struggling with this right now. This data has to be preserved. The data must be sharable and searchable and analyzable. Now you talk to some scientists and you say, you ask them, "Well, how is your data --" Well, first of all this is largely from the NSS point of view an unfunded mandate, this idea of... >> : [Inaudible] mandate. I'm sorry, you can't get away with saying that. We do not require all data be made public. We can debate that at length, but we do not require all data to be made public. >> Dennis Gannon: All right. So it's not required to be made public, although, people still have to come up with a data management plan and the fact is the people in UK I believe is now -- much of the data is now required to be made public. I just was in Brussels last week talk to folks there and so it is happening. At some point we have to deal with it. And it is an unfunded mandate to deal with it at this... >> : No, it isn't. >> Dennis Gannon: Well, I -- Talk to the researchers that I talk to [inaudible]... >> : Yes, but [inaudible]. >> Dennis Gannon: Well, okay. Perhaps they don't understand. >> : It's a debate topic. >> Dennis Gannon: All right. Well, we could debate it. But how many people out here among the astronomy community feel that the data they have is -- there's a pressure to make more of it public from the government? Does anybody sense that? >> : Well, certainly... >> : Pressure is not a requirement. >> Dennis Gannon: I see. >> : There is pressure. >> : Well [inaudible]... >> : There are certain specific requirement in specific fields. There is not [inaudible]... >> : Right. Okay. >> : But those people who get funding from... >> : I'm sorry, guys, but I was involved in writing this policy, and it irritates me that you keep misunderstanding it. >> Dennis Gannon: Well, okay, so you don't -- There may be... >> : Let me [inaudible]. >> Dennis Gannon: ...good reason to make data public that is not a mandate. We'll leave it at that. >> : If you want more funding, [inaudible]. >> Dennis Gannon: I'm sorry? >> : If you want more funding the future, making your data public is certainly in your interest. >> Dennis Gannon: And there are also good scientific reasons for doing that as we've discussed before. But then there's this issue of sustainability, you know. How can we create an economic model for this long tail of science? First of all the government will not directly support an exponentially growing data collection. I talk to some scientists and I say, "Well, you know, you want to make your data available for everybody for a long period of time, how are you going to do that? Who's going to pay for it?" And they all say, "Well, my data is national treasure. It'll be kept forever and the government will support it." Well, that's sort of like declaring yourself to be a national park, and that just doesn't quite work. So what we'd like to do, and this is a hypothesis we have that we've worked on, that we can create, it's possible to create, and ecosystem that supports a Microsoft -- A Microsoft? How about that? -- a market place of research tools and domain expertise. Right? The idea is that we will build critical mass data collections in the cloud. And this relates to the discussion we just had that there's a lot of scientific tasks that other can do better. That was the point that was made earlier, this notion of making research public which I thought was really intriguing. But there are certain tasks in a research activity that perhaps there's someone out there who is much better at doing that, and that person or that group can provide the expert services to the community to do that. So if we had data collections that were stored some place that had to be paid for then, in addition, have expert services available that would simply the research life of the people that need those services. For example, I remember talking to someone doing computational fluid dynamics around certain solid bodies, and one of the biggest challenges in doing that is building a really high quality mesh of that body. And that is a real important skill to be able to do that. Well, there are people that are extremely good at it and there are people that have even set up services that will do that for other folks. And actually NSF funded that to get that going. So the idea is that if you provide something that is high enough quality that people will be willing to pay a small subscription fee -- You'd get the basic data for free if you wanted it because it's there and it's a community collection. But if you want to be able to access services at a higher level, something that is truly useful that accelerates your research because why not let a professional do it or let a group that is particularly good at that, to carry out that part of your research for you then they can also -that would give you mechanism since it's almost contractual or through some arrangement, they would agree to cite you in their publications. That issue came up earlier. And so I believe that this is a way that we could build a subscription-based service for a lot of scientific data analysis. So I think Michigan's Inter-university Consortium for Political and Social Research, the ICPSR, is a very good model. They've been doing this for ten years, providing the social science data community with highly curated, very well developed data collections as well as analysis services. And they've been doing that for at least a decade, I'm not sure how long. And that is completely supported by subscriptions. Most universities, big universities I believe, subscribe to this for any of their researchers to use. So this is a really interesting model. Why can't we do that? Anybody familiar with that service that they do? If you go on their -Yeah, if you go on their website, you'll be impressed by the amount of stuff that they offer. And for the researcher, since the university is subscribing to it, feels like it's free. So our hypothesis is that we can do this, as a community we can do this. And so we're going to try to test that hypothesis. So now I'm going to get away from that for a minute and just talk about the data centers and our cloud platform. We've been building data centers for a while at Microsoft. Microsoft and Google and Amazon have really a massive collection of really very large data centers. These data centers are spread. We've got two in Europe. We've got two big ones for our Windows Azure cloud platform, two in Europe, two in the U.S., two in Asia and more to come. And in addition to the Azure, we've got a bunch of other data centers. And these things range -- Our small data centers are maybe a hundred thousand servers to a million in the big ones. So these are really big facilities. Whoops, there's one more. And now here's one that is -- To give you an idea of where the technology is going, this one -- Let me back up here -- in the upper right corner we call a fourth generation data center. And the interesting thing about this is this data center, and this is an artist sketch obviously, there is no building to it. Okay? We've got so good a building these data centers and putting them in large big nasty buildings the size of mini football fields that we've realized that, "Well, you know, actually the way we build the containers for the servers we don't need the building anymore." We simply pull these massive containers up, plug them into a backbone of power and networking, and now you've got a data center. To give you another idea, this is based on these data centers, they cost half a billion dollars to build. But they're based now on these model of shipping containers. Each shipping container -- Or something the size of shipping containers. We've gotten away from shipping containers. But they're the size of a shipping container, forty feet long and they have as many as 2,500 servers each in one container. So it's completely packaged. It's got the network in it. It's got a power outlet. There's basically three plugs on one of these things; there is a power plug, a network plug and maybe a cooling plug which you pump water in. Now the water in the newest ones is literally the size of a garden hose. It's not a big fancy powerful water system because these things here, like this guy -- This is the latest generation -primarily it is cooled with ambient air and that's why it can be left outdoors even in places where it gets extremely hot. If it gets really hot, we turn on the garden hose a little bit. It helps cool it down. So these things are driving down the cost of energy consumption. And the simplicity of setting up the data center, that, you know, you don't even go into this thing very often. If servers fail, you don't take it down to repair it. You wait until a whole bunch of them fail and then you can either take it offline and go in to try to repair it or you just ship in a new container full of servers, send that one back to the factory. So this is the technology that's used right now to build the large data centers. Now the properties then of these data centers: they are really providing, you know, information services to many users simultaneously. The deployment of services is automatically managed on virtual machines. How many of you are familiar with -- Most scientists are familiar with Amazon, I'm sure. Amazon computes their -- Yeah, okay, so people know about deploying things with virtual machines. And it's the same thing in the Windows case except we have a higher level of services, but now we also have something that's identical to Amazon's. These things provide automatic fault recovery for failed resources. The data replication is already built into this. On our cloud if you store, you know, a gigabyte of data in the server that gigabyte is replicated at least three times and maybe as many as seven times. And it can be geographically replicated or even across international borders, but that's your choice. Yeah? >> : So replication is for... >> Dennis Gannon: Reliability. >> : Okay, so that implies backup or --? >> Dennis Gannon: Basically this is the backup. Your data is replicated so that if the server fails there is at least two more copies of that data and a third copy will be automatically brought up based on the other two. >> : There was an instance of Amazon loosing data. >> Dennis Gannon: Well, I'm talking about Microsoft. [ Audience and speaker laughing ] >> Dennis Gannon: Yeah, actually that was an interesting case because, in that case, when Amazon lost services it was really the customer had failed to select geo-replication because their whole data center went out. And so the customers that had geo-replicated their data weren't even offline, you know, their services kept running. The customers that were in trouble were the ones that said, "Oh, yeah it's in the data center. It'll never go down." You know, power is lost in whole regions. You know? Tornado wipes out something. >> : But the question is how much more expensive is geo-replication? >> Dennis Gannon: For the user it's no cost, the replication, it's just a choice. Some people don't like geo especially across international borders. That gets to be a sensitive issue. And also, those of you understand about parallel computing, this is designed to support two levels of parallelism in this architecture. One level is giving service, say Netflix or something with thousands of concurrent users, and then within that you have -- Or a better example is Bing or -What's that other one? -- Google, you know, it is something that is -You have a lot of concurrency going on in the construction of a search, the reply to a search, lots of parallelism there and then parallelism across the user base, many, many concurrent users. So it's really -Multiply those together and you see these things are designed to support truly massive parallelism. Now, though, it is not the same as a supercomputer; this is an important point. The scale of the data centers is really quite large but the thing that's different about the data center and most supercomputers lies in the network. Those of you who are familiar with supercomputers -- And this is an old slide because it mentions the old Blue Waters design with forty thousand 8-core servers and Road Runner, thirteen-thousand cell processors and our Chicago data center is a hundred thousand 8-core servers at the time I made this slide. It's bigger now. So they are very big, but the network architecture is different. In a supercomputer you spend an enormous amount of money on the interconnection network between the processor, so your computation if you're doing parallel computing, you have really good what's called bisection bandwidth, the data that travels between processors can move at very, very high speeds and with very high bandwidth. In a standard data center, the network is completely different. It's designed to be outward focused so that thousands of concurrent users are coming in and going out. And they're coming in over the Internet. And so the network inside the data center is like an internet. It is based on internet protocols, standard IP protocols. And that means, those of you who are familiar with internet protocols, they're very complicated, multi-level routing and exchanges. It is not the same as you would see in a supercomputer which is very, very lean communication protocol stack. Now this is also changing. Amazon was the first to recognize, you know, "Some people want to do some supercomputing-like tasks inside the data center." So they put up a really nice sort of a supercomputing cluster within the data center, and Microsoft has just announced that we're going to do the same thing and so they'll be such a cluster in Europe, another one in the U.S., another one in China. And so that'll give people who need, really need, massively parallel supercomputer-style stuff, things using MPI that type of communication with low latency, yeah, we'll have that available. And, of course, you'll pay more for it because it costs more. So what has been my experience working with scientists in the cloud so far? And what I've been doing for the last two and a half years is running a program which we work with various funding agencies, the NSF, the European commission, the National Institute of Informatics in Japan, the agencies in Asia, in China and Taiwan, about making our cloud available to researchers so that the researchers could try experiments using the cloud. And it's been interesting. So far I've got about 90 projects scattered around the world that are up and running on the cloud. Of those 90 about, I would guess, a third of those so about 30 have really done something very, very interesting with it. To give you some ideas of the types of things people have been doing: Like, for example, we did a big project with the University of Washington on protein folding where they had a tool that was based on a technology like [inaudible] which they allowed people to use, you know, volunteer computing to do protein folding. And they had a specific problem in there was a pretty long wait time to get anything done. And so we had someone from the Baker Lab, who set up such a system, come to us with a specific challenge that they were working on having to do with something involving I believe it was a Salmonella -- yeah, a Salmonella virus inject DNA -Yeah, it was a really important protein fold problem having to do with the study of Salmonella. And so we got them two thousand concurrent cores, put their system up on it so our cloud became the volunteer, and they were able to handle that computation and get a couple of a Nature and a Science publication out of that in about a week or so. So that was quiet successful. We've got another project in France with INRIA there where they're using a thousand cores to compare FMRI scans of brains together with the associated genetic information from the patient to try to understand how the brain anomalies might be exhibited within the genomic information as particular types of mutations or faults in the genome that would be associated with these brain anomalies. And that project is still going on. Fire Risk in Greece, another one from Europe. Greece, a couple of years ago, had some really horrible fires. And they built a service on Azure to be able to take data from around the country to stream it in, to do the analysis and the prediction and the modeling of where the fires might break out. And this then became -- Because it's on the cloud, it became a web service that could be used from a laptop or an iPad or something that could be in the field. You know, the first responders could actually be querying this and looking where the fire hazards are and what the situation is. >> : Does this include SQL Azure as well? >> Dennis Gannon: Pardon? >> : Does this include... >> Dennis Gannon: Ah, yes. Some of these things are SQL Azure related. These three cases here did not use SQL Azure. But SQL Azure, for those of you who don't know we have in addition to the Azure Cloud and the mass data storage that we have there's also a mass deployment of SQL servers. And that is something that is used by some of the projects but not all. In fact that didn't become available until many of these projects already got started. Drug Discovery: Newcastle in the UK is using Azure to model properties of various molecules for Drug Discovery, and that project has been quite successful and it's even got a commercial spin off going. Paul Watson has been doing that. In Japan we had somebody looking at structural analysis of predicateargument in the Japanese language. So they looked a whole host of text, and they do a lot of machine learning-based techniques again on Azure. They used ten thousand cores on Azure to do this analysis. And finally another example is this is done in the U.S. with the University of Virginia and also South Carolina where they are looking at large scale watershed modeling. And that project is also still ongoing and quite interesting. Now, what have we learned from this? You know, traditional communication-intensive MPI applications belong on supercomputers. We had a few people try to do traditional, say, quantum chemistry modeling that they were doing on a supercomputer -- This is an Australian group -- and do it on our cloud, and they said, "This was horrible." And I said, "Well, I told it would've been but you wanted to try. Thanks for trying." So aside from that the cloud had some really wonderful advantages. First of all it is an environment that encourages sharing and through-web access. Excuse me. I'll make it through the half-hour. A lot of this is massive Map Reduce data analytics on cloud resident data, and that works very well. Or massive ensemble computations, you have a thousand things that you're trying to run simultaneously. A thousand cases, okay, that's a typical ensemble calculation. And you can do them all in parallel and that doesn't require lots of communication. It works very well. Also this model of scale-out as needed works well. The users found that they like this pay-as-you-go model for building over buying a cluster and then having to devote the life of a graduate student or two to maintaining it. Pay-as-you-go model works pretty well. In particular, you buy a cluster with, say, a thousand cores. You want to then somehow use something with ten thousand cores; well, your thousand-core cluster doesn't expand. Whereas, the cloud can scale-out if you need it to be. And this turned out -- You know, I did a detailed survey of the users that used it and of all the successful projects they said, "Yes, we would go back and ask our funding agency for money to use the cloud rather than go out and buy specialized hardware." And I see I'm almost out of time. Should I go to one thirty? Is that the time? >> : Yeah. A few minutes. >> Dennis Gannon: Few minutes. Okay, I'm almost done. But what do people do here? They want to bring large scale data analytics to more people, they want scientists to be scientists. Okay? You know, most scientists don't want to deal with system administration. They don't want to learn to operate a supercomputer; they want to focus on their science. They use standard tools like spreadsheets and statistical packages, desktop visualization. What they would like is to be able to push a button and have the big hard parts go off into the cloud, and they want to be able to share results with their collaborators. So what we see as a software design stack is to design a collection of data management analysis tools that open, extensible and provides, you know, this economic sustainability model, is accessible from a desktop, encourages collaboration and leverages this capability of these public clouds that we've got. So we're working on trying to figure out can we build such a thing to do that. That's an interesting -- Oh, let's skip this. Oh, yeah, so here's an example of the types of technologies. We're really interested in not basically selling you our latest piece of software but finding out what are the tools the scientific community wants to use. And we'll put them there on the cloud. So here's an example. We now have some really great support for Python, and I know that's widely used in the scientific community, as is R. And so there's this thing that has been built by a couple of groups in the Python community called Python Notebook. So this is sort of a web thing that looks kind of like a Mathematica Notebook which has got all sorts of cool things. It's a way of sort of communicating. Here's my new way of doing some computation, and here it is. It's an executable notebook. The notebook is actually running in the cloud, and it can invoke cloud resources. It can even do the back end parallel computation. Another thing that we've worked on is something called Excel DataScope. And what it is, is an extension to your spreadsheet that allows you to, from your spreadsheet, you get this ribbon that you see in Excel. Well, we have a special ribbon for scientific data analysis for importing and exporting data, doing things like outlier detection machine learning, a variety of different algorithms. And we've released some of this open source and we're going to be releasing more of it later. So next steps? We've been working very closely with the Internet2 community. And we got started lately with this notion of large scale project with -- Thirteen university CIO's came to us and they said, "Help us. We've got this challenge now. Our people in the past we just gave them a workstation and they were fine. Now they need data collections and they want to be able to store the data, share the data, they want to be able to mix data from different sources." And they came to us. The CIO's said, "What can we do?" And so we had a group of us that met in March, and we started up something. We made an agreement with the Internet2 to be a provider of cloud resource -- or broker, I should say, of cloud resource to the university communities. And so we've sort of focused on two different things. One is -- I should back up again -- genomics research. We will do a program around genomics research. And you've heard some of this stuff from Microsoft and the great interest in that here, and we'll continue that. But in terms of the long tail, a workshop we've got coming up on October 15 and 16 at the University of Washington on this sort of cyber infrastructure for social science. And so there we're going to be looking at trying to find out can we build a community that is interested in sharing data, in bringing together important data collections and the important tools that people need to study the data. And so, you know, our goal is to try to demonstrate that we can build two sustainable collections or collections for two communities within three years and at the end of three years to see if we've got something now that has enough services and capability that people would be willing it self-sustained. So if you're interested in this, talk to me. And I'll stop there. [ Audience applause ] >> Yan Xu: Thank you very much. And we have time for a couple of questions, please. >> Dennis Gannon: Yeah? >> : So on the slide that you had about the lessons. >> Dennis Gannon: Yeah? >> : So I don't think that any of those seemed particularly surprising necessarily. I was wondering is there anything that actually struck -that you didn't know at the time that you learned from this? >> Dennis Gannon: Oops. Well, I just killed that. Well, I won't go back to it. Well, you know, in hindsight, yes, those things are not very surprising. It wasn't clear to me, though, you know, how many users would -- They're willing to experiment with this cloud as long as I've given it to them -- how many users would, say, afterwards when it's all done say, you know, "Yeah, I would actually ask my funding agency to provide me with, instead of money to buy a cluster, provide me time on Amazon or Azure." And that wasn't in the slide. It was in my more recent analysis that I've done. And that was quite strong, both in the U.S. and in Europe. >> : Your average astronomer, I think, is only just now coming to terms with data access through the cloud [inaudible]. And I think most astronomers really don't want to do any analysis in the cloud. Many of the online tools and [inaudible] very similar, to produce results for their publications they do that on the computer with code that they've written themselves or collaborated. And so I like the Python Notebook idea but can you imagine other ways to bring the traditional laptop environment where astronomers are comfortable with their own codes into the cloud? Is that a possibility? >> Dennis Gannon: I think so. I mean, it depends upon the -- You're talking about taking the applications that people run on their laptop, pushing them to the cloud and scale them out. Yeah, absolutely.... >> : [Inaudible] applications up there rather than providing them with a fixed set of applications.... >> Dennis Gannon: Right. So I absolutely agree that you need that capability. We have this thing that we built through the European project that's called the Generic Worker. And it takes some application from your desktop and through a control panel you can push it out and have one or more instances of this thing running in the cloud. Now it becomes very complicated depending upon if that application has a special user interface, you know, graphical interface or whatever. That can be dealt with but it's typically for each one. But, yeah, I think that's important. That's one of the reasons I am really pleased that we are now able to run Linux VM's on the Azure Cloud. >> : It's kind of a comment on the question, but the [inaudible] Camp Fire, that's what Campfire does. You go on a VM and it's like being on your own desktop. And it's a cloud. So that's [inaudible]. >> Dennis Gannon: I should go look at that later. Yeah. >> : I mean this is [inaudible] need to get over this idea that they can't use other people's tools. [Inaudible]. >> Yan Xu: Yeah. Okay, briefly, last [inaudible]. >> : So sort of following out of this last question, what about the case where you've got software that requires licenses? Specifically MATLAB? And, you know, if I'm running a big problem with, you know, your ten thousand cores and don't tell me an R or Python, okay. >> Dennis Gannon: You want MATLAB. >> : I've gone down that path. And so the question is, well, what do I do? [Inaudible]... >> Dennis Gannon: Yeah, so actually in the case of MATLAB we've had discussions for several years now with MathWorks. And they are, I would say, warming up quite nicely to the idea of providing -- I mean this is a business decision on their part. And I think they're kind of getting it now. I don't know what will come of that. I hope it is something that happens sooner rather than later. But I'm not involved in that discussion. You and I can talk about that but, yeah. Yeah.... >> : We should talk. Yeah, because I've got a problem I could, you know, run tomorrow if that... >> Dennis Gannon: It has been a discussion. I mean the first time... >> : Okay. >> Dennis Gannon: ...we went to MathWorks about this they said, "No way. Get away from us, you devil." And then over the years they've sort of kind of figured it out. But I don't know how far along it is. >> Yan Xu: Okay. Thank you very much, again. [ Audience applause ] >> Yan Xu: So we move to the second keynote of this session. And Mark Stalzer from Cal Tech will tell us about Trends in Scientific Discovery Engines. >> Mark Stalzer: All right. Thank you. Can you hear me? Okay, super. So I'm actually going to talk about one of Dennis' slides but in much more detail. And that's supercomputers. And so the reason that supercomputing is important for two: there are some applications that you can only do on a tightly coupled machine. All right? The other thing is, is that supercomputers act as kind of the Formula One race cars of the computing industry and they drive the progress of the entire industry. So to those of you who saw parts of this talk last year, I've made changes. So I know it's after lunch but, you know, please stay with me here. I've already done that. And so some of the things about computing, and these have been true for quite some time, is that supercomputers are always out of commercial parts. And this is even with the Cray-1, the commercial parts they used were just very simple gates. And some of the drivers -- It used to be actually the reason we have a semi-conductor industry in the United States early was for missile guidance and now I think it's all for virtual missile guidance which is video games. But it's all about power and packaging. When you're going to try and get the most performance out of a computer, you have to be very careful about your power and very careful about how you package it. So this is a definite -- Well, we see this is the cloud computing too, but they have different constraints they're trying to optimize. And the fact is, is a hundred megawatts is very expensive. All right? But there are devices that are extremely power efficient like the chips that drive the iPhone or the iPad. And these computers are hard to program but they can be easy to use if you have the right abstractions. And so this talk is kind of broken into two pieces: one, we'll talk about the pure high performance computing in terms of how quickly you can do linear algebra, and there's some very interesting things going on there. And then I'm going to try to make the case that we're all messed up on our storage architectures and we need to rethink those. And the way to do this is to come up with a metric-like -- like the LINPACK benchmark for storage. Okay, so trends in what I call the simulation engines or computing without data because they compute much faster than they can pull data in. And so the top ten supercomputers: This is just the June one. This is this year is Sequoia. And it's running at 16 petaflops which is quite remarkable. And there's another nice machine in Japan that's running at 10 petaflops. But you'll note that these machines draw nearly 10 megawatts of power when they're running. The other interesting thing is these are not accelerated. You have to go down to -- Oh, boy -- This machine is actually accelerated. So people say that, you know, GPU's are all the future, yet the fastest computers we have right now are actual general purpose. This is what Sequoia looks like and you can actually play soccer in this machine room. It's big enough to do that -- This is at Lawrence Livermore -- if there weren't computers in it. And it's very cold. And so one of my points will be that all the parallelism, or a lot of it, is moving onto the socket. So people ask, okay, is it a processor or is it a core, anything like that? The thing to think about is that it's a socket is where a lot of the computing is happening. So there's the socket. And of course these things are optimized for dot products. And that's what one looks like in C++ and there's a lot going on here, about ten instructions. And what's interesting about this point in time is that we can compare the fastest machine in 2002 with the fastest machine in 2012 so over ten years. They're the same architecture. So we can compare apples and apples. And so ASCI White got 7 teraflops. It's the Power architecture. It's clocked at 375 megahertz. It has 8,000 sockets and it could complete one of these loops every clock cycle. It was finishing like ten instructions every clock cycle. And so we'll just define that to be one. All right? In Sequoia, it's over 2,000 times faster than ASCI White but it's the same basic processor architecture. It's running about four times faster. They went up to a hundred thousand sockets so there are reliability issues beyond that. And this is a tightly coupled machine with the very fast networks. But what happens at the socket level is the parallelism is now jumped to 64. And in fact of this gain about half of it is clock rate and just making the machine bigger and better packaging. The other half is the parallelism on the socket. Okay. And this has a lot of implications, but it's going to get much worse. So you look at the Intel MIC. It will have fifty cores on a socket and each will be four-way threaded, so it'll go over a few hundred threads per socket and a teraflop. And in a few years this is going to scale out to a thousand threads, and hopefully it'll be cache coherent. And we'll have something equivalent to ASCI White in a single socket. All right? This is a qualitatively different programming model. You use MPI between the sockets. It now becomes very high latency, and you have to use massive general threading within the socket. And we're going to have to rewrite our codes. We're going to have to rewrite out codes to exploit this two-level parallelism. So if you think of an image processing algorithm and you had a bunch of computers, you could just throw an image against each socket and it would, you know, do the comparison from, you know, what we just got from tonight to a month ago. But you can't do that here. You can't throw hundreds of images against a socket because they just don't have the memory. So what you have to do is you have to rewrite the code so that it actually parallelizes in the image, okay, in little regions in the image. And so this is qualitatively different model. And now we're asking our students and our post docs and whatever to now learn two kinds of programming models, and it's just inevitable because the physics is driving it. And what you get is what I'm calling a Socket Archipelago. It's just this big sea of sockets. It can get much worse than that. We were looking in another program at exascale computing and every one of these is another layer of parallelism. So I've only talked about two here but those are the most relevant, you know, the ones that work right in this level. So, you know, we'll get to an exascale but, you know, it'll work for maybe one code. And this is a chart from Peter Kogge. And what was really interesting, some people have talked about you have to move the data and everything like that. In these supercomputers, you do not want to move the data because this is a complete trace as a part of LINPACK of one operation, one floating point operation and what it takes to stage it all up in terms of power. And it was 475 picojoules. The amount of energy it takes to do the actual floating point operation is 10 picojoules, so it's a factor of 50 just to move the data as opposed to compute on the data. So this is kind of another manifestation of what I'm saying. All right, so let's switch to data engines. And here progress, it's very much behind what's been going on with the traditional supercomputers. They had a very well defined easy benchmark. Storage is actually quite a bit behind. And here you get this latency of how long it takes to access data. If it's in the register, it's one cycle. If it gets down to disk, it's, what, ten million. And remote memory, which is memory in somebody else's socket, is ten thousand cycles and so there's a big gap. And so people have been trying to bridge this with using flash-based solid state disk. And they've had some good success. So this is Gordon. And you can look at the architecture. But it's basically a supercomputer but with some solid state drive attached storage nodes. And it has -- What am I looking for? Okay, anyways, it has about 8 terabytes of solid state disk. So you can actually load an 8-terabyte data set on to this machine. And it has typical supercomputing kinds of interconnects. And so this is a particular benchmark. I actually forget what this is. The nodes here are actually the size of data structure not the size of the computer. That'd be kind of big. And so you look at the performance with hard disk drives -This is very new data. It came out in June -- and when they ran it -- So it computes the same but when they ran it across the hard disk array, and this is a supercomputing hard disk array, it was this amount of time. And then when they moved down to the solid state array, there was a dramatic improvement by a factor of six and a half. You also note that we're actually starting to get an Amdahl's Law effect here because we're actually computing a lot more than we're hitting the disk which is a good thing. Anyways this is available right now on XSEDE. This the TeraGrid successor. And so another idea that Alex Szalay and others came up with [inaudible] was this idea of an Amdahl-Balanced Blade. So there's three definitions here. And the Amdahl number is a bit of sequential I/O per second per instruction per second. And simulation codes tend to only need about 10 to the minus 5 which means they're doing very little I/O. They're just completely computing. But they built some machines, again, using solid state discs that were much -- in fact they used the Grey Wulf as the standard because it was a very effective data intensive machine and it had some good balance. But moving over to SSD, they were able to get every better Amdahl balance. And their hypothesis is that you, for data intensive apps, need a number of about one on the Amdahl number. And they're getting that with some of these machines. And they're also remarkably low power, 30 watts, and they're pretty cheap. And they built a whole cluster out of it which is what they call Cyberbricks. And they get this whole 36-node Amdahl balance cluster in a little over a thousand watts. This is like a hair dryer. This is nothing. And spectacular I/O performance. And so if you have to crunch an astronomical data set this is a nice machine, of course, to do it on. This is going more mainstream and Calxeda -- I haven't ever heard it pronounced so I hope I pronounced it right -- working with HP on a program they called the Moonshot. And these are very tiny and they're ARM processors so ARM is in a lot of cell phones and things like that. And these cards are only like this big. And they can put like 200 and - Does it say here? No. It's over 200 -- close to 300 of these chips in 4U case. And they all boot Linux, so talk about a system administration nightmare. And each of the little virtual servers -- They're building another card that has a solid state disk -- would only draw about 5 watts of power. And so you can imagine, you know, Yahoo and Google and Microsoft Cloud servers moving to technologies like this which saves a tremendous amount in terms operating costs. All right. How much time do I have? >> : About five minutes. >> Mark Stalzer: Five minutes. All right. So anyways, you can go even further than this. Yes? >> : You said that you may expect a big data cloud to use this technology... >> Mark Stalzer: It could. >> : But is that -- Do you see any show-stoppers for that? Or is that actually going to happen? >> Mark Stalzer: You would have to ask the people that run it, but it seems that given that they're running a bunch of commodity servers, you know, these servers are going to keep shrinking too. And their work loads are designed to work well on those kinds of things. And so I would imagine it would... >> : But cost-wise are they competitive? >> Mark Stalzer: The power is much lower. So I'm running out of time here and I want to get to the two punch lines on the whole thing. So we could do a lot better, though, for data intensive applications. In fact we can do a hundred times better than what we're doing now with existing technology. And that is by using parts that the companies like Apple are putting into things like iPads. And what you can do is you can array these parts up and you can, on a single blade -- All right? - you can get like 64 of them. They're small. They're cell phone parts. But when you aggregate it all up, you get about sixth [inaudible] flash on a blade; you get about six terabytes. But in terms of the performance, in terms of bandwidth and latency, it's about a hundred times faster. And then you can get up to about a four teraflops accelerator, so again this is very Amdahl-balanced. And it also fits on a single blade. It would look something like this. It just got published. And it has huge implications for data to discovery because it can read its entire data set a hundred times faster just because it has so many I/O channels and it doesn't have software stacks on the I/O channels. It's a hundred times faster than random access. It's Amdahlbalanced. And so this is qualitatively new because it's factor two orders of magnitude. You need to have a number of reads much greater than the number of writes. This is for a very technical reason. But you can imagine one rack of these things, just about half-petabyte, handling all of the LSST processing. Okay? And it'd be good for this as well. In fact I looked at one server, and one server could store a billion web pages and handle Google's basic search workload. Now they do a lot more but this is factor of hundredths big. It's performance on triple stores -- And, again, I'm going to run out of time because I want to get to something else -- is probably we think about a thousand times quicker. And maybe this is starting to look at what a storage metric would be because this is unstructured, you know, semantic data. And maybe these are the kinds of metrics we should be looking at. You know, things like triple store performance to get the data field to where the simulation people are at. So this is fun part of the talk. So there's some speculation. The human mind only does 10 to the 16th Ops. And the whole point of these next two slides is to show how far we have to go in our trends in computing for both storage and computation ability. So I'm just stipulating this. You don't have to believe it, and it's not my number so it's not my fault if it's wrong. All right? So we could build these now. Okay? Sequoia's already running faster than this and you can build FlashBlade like engines this fast too, probably at about two megawatts. Okay? That's the important point, two megawatts. Oh, for all of you people who write simulation codes, an engine like this would check point in ten seconds which is just extraordinary. But now there's another big data engine, all right. This is actually a monkey brain. So by definition we're at 10 to the 16th. The memory actually in a storage-intensive system as I've described would be actually quite a bit larger. And so this would only be about 10% and it forgets. It forgets all the time. The bandwidth is roughly the same; it just depends on where you are on the hierarchy. But the packaging is rather remarkable in that it's really small. And you saw how big Sequoia was in terms of the number of cubic feet of space, and it's a factor of 8,000. All right, so these numbers are kind of like plausible. Here's the amazing number: this thing only draws 25 watts. All right? So it's 80,000 times more efficient than our current technologies even if we do every trick we can with our current technologies. So in terms of trends, we got a long ways to climb. All right? And then if somebody knows the algorithm at Microsoft Research here or something, please let us know. So, all right, back to the socket archipelago. And I'm almost done. So, again, let's look at where the parallelism is at. So at the cluster level I don't think we can really go beyond 100,000 sockets for reliability issues and size issues and things like that. And the latency between the sockets is about 1,000. A thousand what? In a socket we might go up to about 10,000 threads. This is just in five years. So what we have is a latency difference of 1,000. So this goes back to this whole idea of MPI's basically working between all the islands, okay. And this is where the archipelago idea comes from. And then the threads are, you know, the tribes on the islands. And so they can communicate much, much faster here than, you know, rowing a boat over to the next island. So, again, we're going to have to restructure our codes. And there has to be large caches on the sockets because the second you go off socket, that's it. All right? You have tremendous latency. And I'll claim that the non-volatile storage, whatever technology it is, is going to have to be stacked on top of the structures as well otherwise they take forever to get to. I don't have time for this. How much time do I have? Okay. So anyways, you can look at the slides later. There's some ideas on programming these things. But my concluding remarks, last slide, we're not stuck with clusters. Off-the-shelf technology, you know, it's not what you just get at Fry's but the things that we know how to do and we can quickly build. What is a top 500, like, benchmark for data? Because this will drive the development of the systems. Okay? It's been fantastically successful for dot product engines. We want something similar for data engines. Get used to threaded programming or find somebody who can write a library to abstract away what you're doing. And also think of what can be done in terms of a shrinking 10 to the 16th Ops system because that's where the technology ultimately going. And that's it. >> Yan Xu: Thank you very much. [ Audience applause ] >> Yan Xu: And time for a few questions? Yes, Matthew? >> : I have two questions or two comments. Firstly, on the benchmark for data a result that I show Wednesday of some experiments I've been running comparing [inaudible] against relational databases and proving unstructured data. >> Mark Stalzer: Okay. >> : It turns out that the [inaudible] in relational databases... >> Mark Stalzer: I believe that. >> : ...in just off-the-shelf. On the programming side, so this would mean advocating both, you know, MPI parallelism and GPU-type parallelism [inaudible] as the sort of thing we should be teaching. >> Mark Stalzer: Right. >> : Is essentially what you're saying. >> Mark Stalzer: Actually MPI, yes. But I would say programming with general threads. GPU's are very SIMD machines. And... >> : Well, [inaudible] is... >> Mark Stalzer: Yeah, yeah. >> : ...a bracket term for that sort of... >> Mark Stalzer: Right. >> : ...threaded. Yeah. >> Mark Stalzer: We have to teach threaded programming. >> : So this is lightweight memory usage but for small computational algorithms. >> Mark Stalzer: Yeah, exactly. And how to synchronize them and what 10,000 are flying around at once. Yeah. >> Yan Xu: Please. >> : So you were talking about the [inaudible] reads much larger than writes. >> Mark Stalzer: Yes. >> : And then you followed it up by talking about checkpointing. >> Mark Stalzer: Yeah. >> : It's exactly the opposite. >> Mark Stalzer: No, no. So the problem is, is that flash memories are programmed by tunneling through their oxide. Right? And eventually you break it down. So there's some technical things in that it looks like for a given flash part if you relax non-volatility constraints to like a week instead of ten years, you can get well over a million writes. And so the point is, is that a machine like I was describing would have to like go to sleep for, you know, every few days so that you could swap out flash parts. So if a programmer tries to use it as just a read-write system, they're going to burn it out and you should charge them for that. So you want to just think in terms of reading the data a lot and only writing it. But a checkpoint every hour is no big deal. >> : Even though you're never reading those checkpoints? >> : Right. Right, exactly. >> Mark Stalzer: Correct. Right. >> : But we can also repair outside [inaudible]. >> Mark Stalzer: Yeah. >> : [Inaudible]. They just have to [inaudible] every night. >> Mark Stalzer: Oh, okay. That's okay. I'll just go to sleep. >> Yan Xu: Yes? Final question. >> : The 10 to the 16th Ops 25 [inaudible]... >> Mark Stalzer: Yeah. >> : ...machine is a nice piece of technology but you cannot trust it. [ Audience laughing and commenting simultaneously ] >> : Well, no, you just use the [inaudible]. >> Mark Stalzer: You can't do it with CMOS, okay. But graphene-based structures will get down there. And they're actually -- I'm not a device physicist but they're actually even more reliable than CMOS parts. But you still can't trust it, I agree. >> Yan Xu: Okay. Thank you very much, again. [ Audience applause ] >> Yan Xu: And we'll move to the panel discussion, so please the speakers for the panel discussion. >> Alexander Szalay: So we need to -- In order to get to sort of exascale and exabytes of data we need to get on a completely different curve. So I think Mark gave a really wonderful start to this. But we'll see that most likely we will build systems with millions of components so with very high-density and low power. And then there will be all sorts of interesting issues coming from the change in the programming paradigm. Not just about how do we handle the threading but also how do we handle the frequent failure of the components and how do we write codes which are also self-healing or recover from [inaudible]. How do we create two kits where we can basically simulate hardware failure or anomaly at will into the different pieces of the code so that we can actually see how to debug the code essentially recovering from all the errors. And on the power side if we are, for example, focusing entirely on the data intensive part -- So, for example, if you want to build a very heavy [inaudible] engine that really streams data at Amdahl number of one, we might consider, for example, cache to be an impediment. So today a lot of the power -- So when Intel builds processors they make all sorts of tradeoffs between how much silicone and power do they devote in the chip to the floating point units to the I/O devices and also how much area state is spent on cache. And basically cache makes a lot of sense when we do numerical computing when data locality gives us a lot of advantages. When you want to stream a petabyte of data from the disk through the storage hierarchy onto the CPU, essentially the cache is giving us very little. So in a sense if we could build a special processor, stream processors basically, this is to some extent why the GPU's are so good in doing what they do because they have almost no silicone wasted on the I/O. And basically they also have a very efficient streaming so we can build basically all these pipelines. So just to get started. >> Ian Foster: Okay, I'll say just a few words. A topic that I think is worthy of consideration is what will the computer systems that the community needs to build or acquire or pay for look like in the future? We've heard that they need to be able to accumulate very large amounts of data, perform analysis on that data, presumably -- We haven't heard so much about this but also integrate simulation with data analysis for various reasons. So how many of those systems should we have? Will we be able to perform all of our computation on systems operated by the likes of Microsoft or will -- I don't think it necessarily makes sense for every university to acquire such a system. Will national centers acquire them and, if so, will they look like our supercomputers today? I think they probably won't. So those are some questions that perhaps people have opinions on. >> Mark Stalzer: Well the advantage of just speaking... >> Ian Foster: Yes. >> Mark Stalzer: ...is that you know what I think. I think, again, power is a crucial issue. Workforce development is very important for all sorts of things, from threaded programming all the way up to, you know, how to use data analysis tools. And those are -- And also finding, like I said, what's the metric to drive data intensive systems? Because we could do a lot better than what we're doing. And flash technologies are actually relatively primitive. There's other emerging technologies that could do a lot better. >> Yan Xu: Are there any [inaudible] comments >> : So all of you made essentially an important point that we've been leveraging for scientific computing commercial developments driven by something else, say GPU's or cell phone technology and so on. I'm thinking that likely we're going to see a lot more 3D video coming both real and simulated and that that might produce the next generation of GPU equivalency, if you will, or [inaudible]. Is anybody thinking what might be the architecture of those and how can those be turned into scientific computing engines? Or speculate how would you do it? >> Alexander Szalay: Already I think the current generation of GPU's can render more than the data that we can feed it. So they are already -- So in a sense there is some race going on but essentially the bottle neck is already how do we get the data and divert model into the GPU's basically. So already a tablet can render a very complex game. And so I think the trick will be rather that, okay, so how do we get -- for example, in terms of visualization how do we get the data again to the engine? How do we -- Maybe we will be trading more and more CPU because CPU will be essentially "free." So using much fancier compression techniques, basically trade more CPU against bandwidthsthat I think that might happen. So we come up with extremely clever compression algorithms. >> Mark Stalzer: My guess is that they'll look a lot like better versions of our current GPU's. And the trouble is that these things are difficult to program, so that's why there's a shift back to a model that is actually in some sense more energy inefficient but they can get more performance out of it. But if you really want to just build a holodeck, the final drivers are probably going to be, you know, GPUlike structures like what we have now. >> Yan Xu: Yes. >> : About power consumption, you said CPU's are essentially free equips [inaudible] talked about [inaudible]. And one of our biggest problems there is that we got this machine [inaudible] which requires 10 megawatts, and that's not cheap, dominating our running costs. If CPU's are free but they're [inaudible]. So do you -- And I know the power costs are slowly coming down, but do you see any bigger leaps in the future? And, otherwise, CPU's won't be free. >> Alexander Szalay: So what Andrew Chan, who was the head of Intel Research down in Chicago, he has a wonderful talk about this. So the ten by ten. So basically the current generation of Intel CPU's is like Swiss army knife a thousand times over. So it has an instruction set that is way, way too complex. And basically any co-processor, so an FPGA or a signal processor chip or a codec hardware chip so a DSP chip, each of them can do a hundred times better in a very specific task at the same power budget. Okay, so why not build an array of those, a mosaic of those on the same silicone where we turn all the different components on and off at will instead of trying to [inaudible] everything in a general purpose hardware? I think that's a very good perspective. Of course it goes kind of [inaudible] Intel's bird view so no wonder that... >> Ian Foster: He left. >> Alexander Szalay: ...left. >> Ian Foster: And there are a lot of people who have a lot of ideas for reducing power, running at a much lower power and accepting higher error rates. I think it's not clear which of those will work out given the challenging economics of scaling any of these up to mass production. >> : So you said that we needed a different sort of way of teaching people how to program. And at the eScience meeting in, the one in North Carolina, Michael [Inaudible] made an interesting comment that surprised me but all the computer scientists in the room kind of nodded their head. He said that the improvements with all the Moore's Law parallels and parallelism that we forget what he was doing about improvements in software. And he said we're better off using 1980's hardware with today's codes than the reverse. So the question is in order to take advantage of this, what are the sort of software problems that need to be solved in order to really take advantage of these kinds of hardware improvements? Like I have some problems that, you know, are not solves by Moore's Law and parallelism. >> : Right. >> : But eventually I think solved by some clever person figuring out how to do it better. So what should we start to be teaching ourselves and our students? >> Mark Stalzer: Well, you know, an N-log-N algorithm beats an Nsquared algorithm any day. And so it's important for people to study, you know, the best known algorithms. And there are numerous examples of this. Okay? My point I was making is if you want to exploit the capabilities of the chip you have massive data sets or you're trying to model global climate change or something like that then the students, in addition to knowing algorithms, need to know more about parallel programming, thread-based programming. The computer science departments know all of this, and they know how to teach it but I mean it's not common for scientists to actually take these computer science courses. So it's there, it's just another aspect of education that needs to -It's one more thing for the students to do. So --. >> : So it's well known but only to those who know it well? Is that --? >> Ian Foster: I'd like to comment. Can I make a comment? Right, so the advances that Michael was talking about were algorithms not software. Those are different things of course. So I'm not sure that that many computer science departments teach parallel programming classes. >> Mark Stalzer: We do. >> Ian Foster: Some do but not... >> Mark Stalzer: Yeah. >> Ian Foster: ...I think it's not quite as common as you might think. So it's... >> : [Inaudible]... >> Ian Foster: Yeah. >> : Yeah. >> Ian Foster: So I wanted to observe that as I understand at a place like --Google I'm not so familiar with -- Microsoft, there are thousands of people writing very large scale parallel programs but they don't know about multi-threading or MPI. They do it using libraries that are being developed that meet the particular needs of their applications. And I think you mentioned the importance of libraries in your talk. But in a sense of your students are writing multithreaded code, you've failed in some -- not the professor but the community has failed in some way to build the right infrastructure. >> Mark Stalzer: I mean an example to that is OpenGL. And, you know, it renders all these beautiful images and it can be highly parallel on GPU's. But the people who use it don't have to know the parallel programming. >> : Is that a question of algorithms or language? If we recode [inaudible] we would certainly get all this for free. >> Ian Foster: Well I mean once efficient -- I mean algorithms and language are different. And certainly writing multithreaded codes using a [inaudible] library is not a recipe for happiness. And doing it using [inaudible] is probably far more effective but [inaudible] in itself doesn't make your algorithms N-log-N instead of N-squared. >> : No but if you end it with people having a better natural understanding of how multithreaded things work and they will intuitively make up their own more naturally threaded algorithms. >> : Something different. You and also Dennis have been talking essentially about supercomputer equivalence for data-driven things whether it's machines like Grey Wulf, whether it's the giant data centers. And they are optimized essentially for web search. And there is the cloud which is a few of the centers. But don't you think that what we should be going to is a [inaudible] hierarchy cloud that let's clouds, the whole dang climate. And the architectures that need to be optimized for different things like, you know, certain types of data mining. And I'm picturing not just the data center itself but down to the blades or maybe down to the processor load instead of one catch-all thing. >> Mark Stalzer: You're still going to have some basic things that you can assemble the machines out of. I mean, that's my only --. >> Alexander Szalay: But I would like to throw in another --. So I was in [Inaudible] at a data intensive workshop relating to the grid form. And there was an I/O [inaudible] workshop. So there was a lot of discussion essentially about the fast systems. And so people are obsessed in POSIX fast system since scaling up to petabytes and so on. And when you think about the -- So basically in the underlying fast system we have a very complex set of hierarchies where we know exactly where every piece of data is located by different granularities. And then we hide everything, and POSIX is basically a simple data stream. And then after that we build yet another tier of data structures to again rebuild to figure out where the data is actually located on the physical storage. It just really doesn't make sense when we get to very large amounts of data. And so there is this recent trend of exposing a lot of these details in the fast systems including the new version of Windows. So there is a lot of object-level access basically to data items. But this was really -- It also started to kind of realize that what this Hadoop [inaudible] using these words. So basically it's an ultra-simple scheduler that's a very, very dumb I/O scheduler. It's very -- People have spent 20-30 years to write schedules which work on supercomputers and schedule basically CPU computations. But it is not very easy to co-schedule, basically, different types of I/O operations, random access versus sequential; they mess up each other and interact. And when you have only sequential scans on the Hadoop-like systems and only a single linear [inaudible] of the whole data, it's very simple to deal with. It's easy to predict the behavior. And basically at the same time in database systems, people have spent 30-40 years, again, optimizing the complex I/O happening inside the database engine. And so in this word I think there is a convergence kind of, I think, slowly emerging that somewhere there will be some fast system which will merge the best properties of both the traditional fast systems and basically the database localities and object granularities. >> Ian Foster: But it seems to me that, you know, we -- as you put it the web-searching optimized systems would assume that data sits in one place and then computation is performed on it. But, I mean, inevitably it seems that there are some storage devices that are cheap but slow and some that are fast but expensive. And so data is going to have to move between these different source of systems. So you may well have heterogeneous systems that are optimized [inaudible]... >> Alexander Szalay: And [inaudible] could even be [inaudible] of those things. >> Yan Xu: Any more questions or comments? >> : Again, something different. So moving data is where most of the trouble is, right? Power, [inaudible]. And so is it poor thinking in terms changing our algorithms to essentially be data mining streams in real time and never see the same data again? Which will be a very different approach from having stationary archive and talking at it from all different directions. >> Alexander Szalay: [Inaudible]. >> : Well, [inaudible]. >> Alexander Szalay: [Inaudible]. >> : Well, but all that is throwing away data on an instrument level. What I'm thinking is it's a science-grade level. >> Mark Stalzer: It may get to the point where it's actually cheaper to re-compute things than to pay the cost of moving the data. >> : Well, for simulations but... >> Mark Stalzer: Yeah. >> : ...for the real-life measurements [inaudible]... >> Mark Stalzer: Right. Of course. >> : ...gone, right? >> Mark Stalzer: Right. >> : I think good data can outlast a lot of bad computations. >> Mark Stalzer: It does often, right. >> : We're not able to afford to analyze the data again and again. You can only do it once. >> : We're going to do both. I mean we're already doing both. I mean you take the data stream. You extract something from it in a pipeline style and you put it away in case somebody else has a clever idea later. Yes. >> : What I'm worried about if we may not be able to afford that second step with the exponential growth as we see it. >> : Well, we're already -- I mean, George, you certainly know this. We're already at that stage not so much yet for astronomy but Earth science and to some extent planetary science... >> : We're there in astronomy now. >> : We're there in astronomy. >> : I was thinking certainly by the time of, say, [inaudible] we could be at that stage. >> : That's right. >> : Well, I mean LOFAR is already generating more data than LSST. >> : Yeah, but that's... >> Alexander Szalay: [Inaudible]. There's a lot of [inaudible] simulations that we run our supercomputer. We are already there because we store many fewer snapshots that what would be ideal for our science. >> : And the climate models too. >> : So with [inaudible] it will take about 12 hours to read 12 hours of observed data off the disk. You never, ever want to do that. >> Mark Stalzer: Right. >> : And so [inaudible] off the disk. In which case you don't store it obviously. And so we're planning and [inaudible]. If one day you do get this better algorithm, you're going to need a hell of a computer. This disk actually is [inaudible]. >> : Three minutes. Oh I have the perfect question for three minutes. [ Audience laughter ] >> : Anybody dares to speculate about quantum computing? [ Multiple inaudible audience responses ] >> Ian Foster: Maybe yes, maybe no. [Inaudible]. >> Alexander Szalay: Can I... >> Yan Xu: Or we can lighten the question and just give him some really good remarks of what can be done with commercial and inexpensive hardware, any comments on disruptive technologies like quantum computing or optical computing? >> Alexander Szalay: I would say memory stores. They are actually much closer to reality than, I think, quantum computing will be in our lifetimes or in my lifetime. Memory stores will rewrite computer science if they work, if they become practical because every memory element will be also able to arithmetic operations. So all of the algorithms we can throw out if that works. So it will be an interesting world if that comes to --. >> Yan Xu: Okay. Thank you very much and let's clap for the panelists.... [ Audience applause ]