>> Daron Green: So I first met Ian many years ago when he was trying to sell me a story. The story he was trying to sell me then was parallel prolog. Goes back some years, Ian, right? And then I think it was Fortran M, I think it was, which was the formal methods in Fortran that complicated mix of things. And Ian has done many things and the latest things and the latest thing he's been doing is looking at infrastructure for collaboration grid middleware and things like that. But he has many more things to his expertise. And he's always been involved in more projects than I could believe possible. And so I'm sure Ian's going to tell us some of the things he's learned and this certainly chimes very much his topic today with what we're trying to do in external research with the scientific community. So Ian, delighted to have you in Redmond. >> Ian Foster: Okay. Do I need to hold my hand, too? I met Tony when he was at South Hampton, and he tried to recruit me there, arguing that my mother-in-law lived just across the (inaudible) and that would be a positive reason. He didn't know my mother-in-law. So anyway, I'm -- what I want to do is talk to you about some ideas we are pursuing at the university of Chicago in Argonne in the context of the Computation Institute. So this is something I took on a couple of years ago running the Computation Institute. I think I have a slide about it. And you know, the goal is to further investigations of challenging problems in the sciences often system level problems that, you know, require the integration of expertise from many different disciplines. We have about 100 fellows, 60 staff, and a fair range of projects and one of the things that this work has led me to do is to become increasingly interested in the problems of how we accelerate the data analysis process, the research process and the tools that we might need to achieve that. So we'll start off with a little historical slide just to remind us how life used to be back in the 1600s if we were interested in something like this we would go through a fairly simple process by which we got some money, built an observatory. This is Tycho Brahe by the way. And then you know, we'd recruit the young upcoming fellow like Kiplas (phonetic), he'd analyze our data and eventually learn something of interest. Along the way he might some say poison his advisor in order to get access to the data, so even back then access to data was a challenge. But clearly, you know, the amount of data that people were dealing with back then was not a tremendous problem. So if we forward 400 years and see how things have changed. So clearly automation of the data collection processes resulted in probably nine mind orders of magnitude increase in data rates. The amount of data that we have access to just in astronomy is around about a petabyte this year or next year. A major change of course is that the number of investigators has changed dramatically from around about one back in 1600 to certainly 10s of thousands nowadays in astronomy and maybe a million or more if you count amateur astronomists, with computational speeds, the amount of literature, things are very different. And of course this results in new problems for the investigator. So if we look at another field that is also of interest, biomedical research. So at the same time if you wanted to perform some research to your main problem was finding a body to cut up, which was quite a challenge back then, but nowadays of course things have similarly changed in a dramatic way. This is a slide put together by John Wooley (phonetic) who tries to quantify the number of entities, not the amount of entities that one needs to on deal with if one is investigating either biomedical research or basic health care research. And the associated numbers of course are changing on very dramatic exponentials and in every field from astronomy to biomedicine we see people struggling with the fact that what used to fit in their notebook then on their floppy disk then on their hard drive no longer seems quite so easy to deal with. So the next slide is a cultural reference that you may -- actually, no, that's the slide after is a cultural reference. Here's a slide that quantifies one effect of that. So this is a log plot on the Y axis, it's the number of gene sequences that are being collected in the last decade or so. And then the blue line is the number of annotations, for example annotations asserting knowledge or belief about the function of particular genes. And you can see that's increasing far less rapidly, and that is a sign that people around able to keep up with the amount of data that's been produced in terms of the analysis that's been performed. So this is the slide of the subtle cultural reference. Does anyone recognize this? The Black Night from Monte Python and the Holy Grail. You know, he is saying it's just a flesh wound and his head is up. And of course many scientists will simply say we just need a bigger disk and a work station and we'll be okay. But I think increasingly people are realizing that that is no longer the case. So instead, you know, what I believe we see a need for is something we might call an open analytics environment. So let me explain what that is. And I've adopting a cooking metaphor for some reason that's a super computer down on the bottom. What is an open analytics environment? This is something into which many people, certainly individual investigators, but teams also can pour their data, whether scientific measurements, information about the network structure of their data or their domain, the scientific literature. We'll come back to that in a second. Into which we can also pour programs, service definitions, stored procedures and rules and out of which we can then extract results. And we'll look in a bit more detail on what is involved in each of these steps in a second. And we want this environment to have at least virtually the property that we no longer think of what we can do with our data in terms of what is possible on a single computer. But think at least ideally in terms of you know if we had unlimited storage, unlimited computing, if we weren't constrained by the fact that different pieces of data are in different formats. If we didn't have to worry about the fact that the analysis routine we got from our friend is written from Netlab and we're running on a platform that doesn't support it, et cetera, et cetera. And also if these tools supported all of the collaborative processes that really make part of the scientific problem solving process, we could version things, we could document their provenance, we could collaborate with others in analyzing them, we could annotate them with arbitrary annotations. If we had such a system, which is very far from what most researchers have available to them today, I think we could really transform the way that research happens in a wide variety of fields. So let's look at -- oh, first of all, I know the word open is sometimes one that isn't popular here at Microsoft, so I thought I'd put up a -- I got a dictionary out to see what it meant. And of course it means many things besides the ones that Richard Stallman (phonetic) likes to -- I guess he doesn't use open, he use other things. But having the interior immediately accessible, generous, liberal or bountiest and then not constipated is the one I like the best. Okay. So what goes in. So I'll talk about some of the applications that we're working with, and I'm sure you've got your own applications that you're familiar with, at least some of you, but as I indicated before, you know, scientific data which comes in many different forms, image text, tabular data, image data, this is functional MRI data. In the biological sciences of course we've got many, many instruments each of which is capable of generating data at extremely high rates. Network data, the scientific literature, there's a few others there I'm sure, different instrument data. Clinical data in the case of the biomedical sciences. If you're a political scientists, online news feeds. These are all things that if we can put them in to a common place will allow us to pursue interesting connections and I'll mention some of those connections in a second. We want to be able to then establish and assert connections between things and tag data with hypotheses, you know, putative connections, perhaps putative beliefs regarding the reliability of the data or the -- or its meaning. We also then want to be able to throw in programs and you know programs can be workflows, this is a Taverner workflow, they can be parallel programs, this is a swift program, the system that developed at Chicago by Yun Zhao (phonetic), one of your staff here when he was a Ph.D. student there, or rules, you know, perhaps or in stored procedures. You know, programs written things like, using things like map produce or Dryad, SQL procedures, et cetera, et cetera, programs written in many different programming notations. Optimization methods and so on and so forth. And of course ideally we want to be able to associate these with programs. So when data comes in, it can easily be operated on by any of these procedures and transformed into a form that may then be operated on by some other procedure. It's amazing to see the tremendous enthusiasm that map produce which purports to do that for a very simple class of problems that's generated, I mean it's almost laughingly simple but people see it as a somewhat liberating tool for data analysis. And then I couldn't think of how to illustrate this part of it with beautiful pictures so let's just talk. You know, what do we want to do to make this happen? We need to be able to run any program, store any data without regard for format, without regard for issues of scaling. And of course scaling issues always arise in practice, but we do want to be able to operate on terabytes and perhaps petabytes of data and perform computations that involve the comparison of many elements within different terabyte and even petabyte data set. So that means substantial scale. It also means the ability to do away with the platform differences that currently hinder so much into operation. So, you know, the tremendous popularity of platforms like Amazon EC2 clearly have something to tell us there. There should be built in indexing so we can keep track of what's in there and what can operate and what. And then under the cover provisioning mechanism so that we can allocate resources to these different potentially very large demands fairly easily. And then what comes out, so I was particularly proud of this graphic, so what I'm trying to illustrate here though is as this data comes out of this thing that we should be able to, you know, it will be rich data that has a provenance information associated with it. So you know, this data was generated by this Taverner workflow, one of its inputs was this network that came from somewhere else and maybe these are the, you know, the people that have expressed some opinions on the validity of that workflow. You know, and over here we've got a parallel computation that generated this piece of data that operated on this MRI data using an algorithm described in this piece of scientific literature. I think you get the picture. Okay. So and then finally, you know, I think the notion that analysis is not just a one-off procedure, it's a process, we need to take that into -- bear that in mind. You know, as we over time we add data, we transform it, we annotate it, we search for it, we add to it, we tag it. We may visualize it. Others may come in and discover it, extend it, we may group things together in different ways. We may share it or not with others. These are things that need to be supported in this sort of environment. Okay. So I guess I'll skip this slide just to -- so so far I've talked in the very abstract sense. And an obvious question that you might ask was was this some huge central, what's the word nowadays, cloud, or is it some distributed environment that links many different subclouds, if you like? I think the answer has got to be both. Many of the algorithms that we need to be able to execute require strong locality of reference and so have to be performed on data that's century located but ultimately data will end up being distributed and to match those two worlds things need to be able to move from one place to another. Okay. So now I want to talk a bit about some of the things that we're doing at Chicago in an effort to, you know, realize some of these capabilities. And of course we realize that this is a vast endeavor and we're not going to be able to undertake it by ourselves and that's one of the reason why I was interested in coming out and visiting here. So first of all, here is a list of some of the applications that we're working with. There are quite a few others, but these are all areas in which we have particular expertise, either at the university or at the Argonne National Lab. Astrophysics. So this is a exploding type two -- type 1A supernova. So they're using supercomputer simulations. A group at Chicago has shown for a putative mechanism that seems for the first time believable in explaining why supernova happened. This is turns out to be a data problem because what they produce is tens of terabytes of data that then needs to be made available to the community and compared observational data so it becomes a data resource. Cognitive science. Huge amounts of work on different aspects of MRI studies, for example. East Asian studies. You might not have thought of that as being data intensive but it turns out this group has been going to China and taking extensive photographic surveys of various temples and thanks to help from (inaudible) we've had the Photosynth group helping them to visualize some of their data. Economics is in my view a discipline that is about to explode as a consumer and producer of data. I'll say a bit more about that in a minute. But this is a fairly detailed map of occupational choice and wealth in Thailand. Something that turns out to be of great interest computationally. Environmental science of course is familiar to us. Epidemiology. Also genomic medicine. We're doing a lot of work in that area, although I won't say too much about it here. Neuroscience, so this is a little chip thing you can stick in typically a monkey's brain, but you could stick it in your own, if you wanted, and get large amounts of data. Political science. So I have a friend at Google New York and thanks to him we're getting a feed of about two million online news articles a day and sticking them into a database and then performing data intensive analyses on a database to look at things like political slot in newspaper reporting. So sociology, physics, so on and so forth. So there's a huge number of people, each of which has got their own data intensive science to perform, each of which is currently limited in their ability to perform that science. So for example, these guys they perform what was the world's largest turbulent simulation last week. It took them a week, 11 million CPU hours, then took them, I think three weeks to move the data back to Chicago and about three or four months to perform the analysis because of the limitations of the tools they have available. The economics guys have been spending the last year working out how to transform their data to make it available to their analysis routines. The political science people we have started helping them sooner but they're still really struggling with how you reliably download feeds of millions of articles a day and then process the data. But what I find really exciting is the connections that start to appear when you start thinking of this data as being in a single place. So here is a few. Looking at economics of development in Thailand, well, that's in many ways tightly related to issues of environmental modeling and climate change. Those in turn are related to issues of epidemiology if you're trying to plan a malaria eradication program, then you're going to want to have information that these guys have got on access to transportation, wealth, distribution of hospitals and so forth. And you even may be interested in how articles about various eradication schemes are reported in the press. So the political science news feed start to become useful. We have a big economics project that we're getting underway and part of the relating to climate change and one of the issues that people were interested in is how do you persuade people to make environmentally sound choices so then the cognitive science work, perhaps that's (inaudible) as well, but that's taking things just a little bit further. Second thing we've been doing over the last few years is putting in place some very substantial hardware infrastructure. I'll just say a few words about this. So we -- this is a fun computer. Has anyone heard of a sycortex system before? It's based on a mips processor, so it's got 1,000 chips, 6,000 cores, does about 6 tear-offs per second. But the neat thing it only consumes 15 kilowatts. So it's a sort of a very small low power system. This is a somewhat larger medium power system, the second fastest civilian supercomputer in the world, the IBM VGP. >>: (Inaudible). >> Ian Foster: This is at Argonne, also. We're also getting one for the computation institute of Chicago. We're using both of these for data intensive computing. We've got access to remote systems like the TeraGrid. And then we have -- we're just in the midst of acquiring our own so-called Petascale selective data store which is intended to -- it's our first attempt at building a system that can service this cauldron, if you like, the framework for investigating some of these issues. So of course it could be more substantial but this is what you get for a couple million dollars. So it's got about 500 terabytes of reliable storage, a computer cluster attached with very high I/O bandwidth and some build in GPUs, multi core system for high performance analysis. And then the goal is to put in place the mechanisms that will allow us for dynamic provisioning of this both data and computing parallel analysis routines, remote access mechanisms, data realtime and stage data ingest mechanisms and perhaps start looking at how we might offload to remote data centers if they were available. >>: (Inaudible). >> Ian Foster: Yes. So that was a request from some of the biologists who are very concerned about having take the (inaudible). Well, I mean this is 2A raid but of course if that particular device is destroyed, then you've lost your data, so they want to be able to move things to a remote location. >>: I was just questioning (inaudible). >> Ian Foster: Yeah, that's a good question. The -- there's a couple of reasons for it. >>: (Inaudible). >> Ian Foster: No, I know. >>: (Inaudible). >> Ian Foster: Yeah. Tape is always about to become the technology of the past. >>: IBM's system? >> Ian Foster: Yes. That's interesting. So we've just got the money from NSF and IBM is our current vendor, so we haven't actually signed any contracts, but that's one. And this is DDN system. DDN, Data Direct Networks. They're the vendor here. And this is, you know, PC cluster and currently IBM is the system integrator. >>: One more question about that (inaudible) data centers I mean are we physically moving tapes or are we sending it over a network? >> Ian Foster: Yeah, that's a good question. Whenever possible we move data over networks because it's easier, but of course one option is also to move ranks and that clearly has very high bandwidth. >>: Ian in earlier projects you said that they generate enough data that it took three weeks to move it back over to the U.S.? >> Ian Foster: Yes, they were moving it from Livermore to Chicago and it was only moving at 20 megabits per second. >>: Out of curiosity (inaudible) capacity, why wouldn't you put this stuff on tape or disk and ship it over next day? >> Ian Foster: Yeah, that's a good question. Why didn't they do it? Probably no one there was read to mount it on to the disk, maybe they didn't have the capabilities to do that. >>: Okay. >> Ian Foster: But it's certainly in that case would have been a perfectly reasonable solution. I'm no network bigot, as it were. Anyway, let's now go on and say a few words about some of the methods that we've been in various projects developing methods that address various of the problems described in the -- at the beginning of this talk. Now these methods is a complete -- none of them is either individually or in its entirety is a complete solution but they do provide us some tools that have allowed us to move forward in some useful way. So we've got work on HPC system software at Argonne, which is I mean Microsoft has supported some of the work on impatch, parallel virtual filing system, work I'll say a bit more about on collaborative data tagging, data integration work, which addresses this challenge of separating representation and semantics of data, high performance data analytics and visualization, various tools we work with herdoop (phonetic) and we have our own system called Swift, which we think is better for expressing what you might call loosely coupled parallelism, particularly on data intensive applications, dynamic provisioning mechanisms, service authoring tools many of them developed in collaboration with the cancer biology community. Provenance recording and query. We have done a lot of work in service composition and workflow. We're working a lot now with the Taverner system. I guess it was that developed with the UKE science program? >>: (Inaudible). >> Ian Foster: Yes, it's great. We have got the cancer biomedical informatics grid to adopt that as their technology of choice. Virtualization management and distributed data management. And I'll say a few more -- a little bit about some of these things, not too much. >>: (Inaudible) in Taverner, we spent a little time in the UK doing software engineering on the research tool Taverner and when I went to see (inaudible) they complained it was unfair. >> Ian Foster: They gave them an unfair advantage, yes. >>: It was documented work. >> Ian Foster: Yeah. The CAB people, this is the National Cancer Institute's cancer biomedical informatics grid, are incredibly careful and thorough in their evaluations, so I felt very pleased that they chose to use our global software for a lot of their work, but they also adopted Taverner after a many month evaluation process, so that was very pleasing as well. Say a few words. This is work that Zack Onesteroff (phonetic) here has really be leading, so I could even ask, but maybe I'll talk about it. I put up this map here of Thailand because this is some of the data from one of the Thai surveys and if someone has ever worked with social science data this is very typical documentation. You've got some -this is a very rudimentary description of the data set. These are some of the variables. They have helpful names like RR7B, or HC3, and if you go off and read up on a manual you can find out that this is a representation of the name of the household owner I believe. So a typical problem with social science data and I think a lot of scientific data is its impenetrable relatively impenetrable to anybody but the expert. So we believe that collaborative tagging or tagging tools can be used to allow individuals to start to basically tag document the powers that they follow through data perhaps, you know, identify the fields that they use in particular studies maybe to annotate data or procedures that can be used to compute new data products. And then if you allow these tags to be shared, then you get the collaborative or tagging or social networking part of the system where one individual working on a data set can learn if the experience of others. >>: This is different from microformats? >> Ian Foster: I'm not a very deep -- I don't have a deep understanding of microformats but I think yes, you might use microformats I think to structure these tag. >>: And you choose not to use (inaudible) and things like that? >> Ian Foster: Right. So as you probably know, there's a big argument that ranges between the believers and structured and unstructured metadata or tags. I think what we're doing is actually neutral with respect to that, and in fact we have people at the university who want to start defining ontologies that could be used to structure these tags or alternatively, to infer ontologies from the tags that are applied. I think that's sort of an independent issue. But so tagging of course in social networking are very familiar concepts, they are often applied to photos and Web pages. Here we're trying to apply them to structured scientific data. I think it could lead us in very exciting results. Certainly the people we describe it to, the scientists find it very, very attractive. A second thing I'll just mention addresses a problem I was talking about with I think Jose just before the talk was the problem that so much scientific data actually has this sort of structure, this is a typical if you look at a functional MRI data set, this is what you might see. So Yun Zhao here was struggled with this problem for quite a while in his Ph.D. thesis. So here you've got some -- if you have a bit of inside knowledge, you know that these file names and directory names encode information about the structure of the data and that the real structure looks like this, it's a nice logical hierarchically organized collection of MRI images with runs that comprise of volumes that when aggregated describe a subject and subjects form parts of groups and groups form parts of studies and so forth. So what we've been doing with the XML data typing and mapping project is developing tools that allow you to map smoothly between these two views and then use an XML representation of the logical view to describe your analyses with the appropriations happening on the physical world as needed. So you end up with these nice -- well, there's an XML representation but this is the logical structure that you'll use to write your programs and under the covers we're going to be reaching in and accessing data which may be stored in file systems in databases, in other forms. And to come back to that in a second, when we talk about one of the tools that we've developed to perform data analysis, so a common problem that we've encountered and the Dryad group here I think is addressing similar concerns is that you know people could be individuals, groups, et cetera, come in and they want to perform analyses on these logically structured hierarchically organized data sets, for example they may want to take one of these studies and compare a second study or they may want to scale the members of one study in order to adapt it to a reference image and then see how individuals brain function changes over time, for example, following recovery from a stroke, which is one study that we're currently involved in. And that involves can be sometimes literally millions of fairly complex computational tasks. One wants to be able to express that concisely, map this on to the appropriate computing resources and because we want -- we're focused on process, feed the results back into our open analysis environment for further analysis. And so the Swift system that we developed allows us to express these sorts of analyses using nice high level functional programming syntax. So here we for example have a procedure reorient run, which we apply to a run, and for each volume in the run we're going to apply a reorient function. Pretty straightforward stuff but relative to what's actually happening under the covers people are operating on files and complex directory structures. It's quite amazingly simple. So this is the sort of thing that we're using in our own implementations of this open analysis environment. >>: (Inaudible) magic that happens. >> Ian Foster: Yes. >>: Because mapping from that into what's actually taking place (inaudible) is the key thing. >> Ian Foster: Right. >>: So could you just say a little bit more about ->> Ian Foster: Yeah, so the magic, much of which was implemented by young Yun Zhao by the way, was ( fairly straightforward. So here you're calling this reorient run function, you're passing in a point or two. A run now this for each function, for each volume in that -- well, there's two runs being passed in, I guess. For each volume in that run -- well, that involves an operation to look at the underlying physical construct, which in this case is a directory, find out how many files in the directory and return an XML structure representing the files in that directory. You're then going to invoke this program, and it turns out this program via another little interface definition is a I think probably a Netlab program in this case, so there will be some logic to dispatch that computation to the appropriate place for the computation to occur, and this could be on a machine such as the sycortex machine or our -- or the ped's cluster. And then eventually the results are either left there for further computation or gathered back to some archival storage to be stored. And so on and so forth. And well, yeah, the result is something like this, where each of these blue dots is a potentially complex computation. I have another -- so we're tracking provenance using this simple provenance data model. You know, so you've got procedures that are called and multiple procedure calls linked together into workflows and workflows operate on data sets and data sets may have annotations and so on and so forth. There's been quite a bit of work in building communities consensus for how to -- for issues of how to record provenance. Roger Baja (phonetic) has been at various of these meetings, the international provenance and annotations workshops which we've been holding jointly with again the UK people. I'll say just a few words about multi level-scheduling. This is sort of interesting. So you know, the common in our view, you know, it's not enough to simply make data available for people to query, we need to be allow people to compute on data. And if you want to be able to compute on data, then you'll sometimes have very large amounts of computation to perform, particularly when you're performing a comparison. So we need to be able to dynamically allocate resources for that computation to occur. And we've done a favorite of work on building so you can ignore this faded out part, but this is sort of the runtime system for the Swift system. We dynamically allocate execution nodes, virtual execution nodes on a variety of different computational platforms, Amazon EC2, open science grid, TeraGrid and more recently on our big local parallel computing platforms. >>: (Inaudible). >> Ian Foster: Yes? >>: So the kind of applications that you're considering how much of sort of the smarts at what you need to do involve managing I/O versus managing CPU? >> Ian Foster: Of course the answer is it depends and so I'd say in the early ones we've worked with the smarts have focused on CPU and increasingly now we're addressing problems where the smarts involve I/O and we need to be either caching data or moving computation to the data. >>: That change happen because you sort of feel like you (inaudible) that are necessary on the computation or do you see a trend with that sort of pushing things in the direction of placing higher (inaudible) consideration by (inaudible)? >> Ian Foster: No, I think it's more that we feel we know how to address the computation so now we can address a wider range of problems. This shows what happens when you run this is a problem where I/O is not a big concern. Each of these little dots is a task running on this 6,000 core sycortex machine. So this is a large parameter study using a molecular dynamics code. I'll say a few words also about the issues of distributed data management. So this is a, you know, a typical use case with which we work where we data is being produced at a scientific observatory, in this case the Gravitational Wave Observatory, and I guess they are not a big believer in sophisticated data management. So they simply replicate all of their data to all of the sites participating this experiment, which of course makes data look held the issues for analysis fairly straightforward. And that means they are replicating about a terabyte of data per day to eight sites and using some of the software that we've developed, they are able to do this with a lag time of around about typically not more than an hour from when the data is initially produced at the observatory. Now clearly data replication is only a partial solution but it's often a useful solution for making sure that the data you need to analyze is available where you need to analyze it. So now I wanted to take a few minutes and talk about a couple of applications before we wrap up here. And show how we're applying these methods in a -- in various context. So the first one is a system called the social informatics data grid developed by Ben Berthtol (phonetic) and his colleagues at Chicago. He's now actually at Indiana University but the development continues at Chicago and UIC. And as you may guess from this picture their interest is in enabling the multimodal analysis of cognitive science data, so they've got people doing things and they're filming them and videoing them and sometimes they'll be taking various out of sense of data and of course they then want to compare them with other people, other experiments and also analyze the data in various ways. And historically this community has been incredibly bad at sharing data. So what the Sigrid (phonetic) system has done despite the name it's really basically open analysis environment, it's a system into which many people can pour their experimental data in different forms into which we can register analysis programs, mostly Swift scripts, we can associate metadata and then using a variety of platforms, many of them both Web based and other client, rich clients we can browse data, search it, we can preview content, we can transfer from one format to another, download it, analyze it and so forth. So in a sense in a very disciplined specific way it's instantiation of an open analysis environment. And this apparent has been very successful, they're starting to get a lot of data loaded up into this, and you can look at it using Web portal tools that let's you search things or download things. Elan is a system from I'm not sure where from, maybe Germany, it's a rich client tool. This is a system from the UK called the, what is it, the digital replay system, which you can use to search and then grab data from this repository and, yeah, it's a very -- I thought I might have had an example of a Swift script running. But it's a -as far as the average user is concerned, they have their nice traditional client tools but it's reaching out over the network and accessing these very rich databases which themselves are backed up by the power of high performance parallel computers as needed. Let me mention one more -- one or two more examples. So this is a project that we're getting underway with some startup funding from Chicago and Argonne and we're also talking to the MacArthur foundation about it. The goal here is to address the challenges of socioeconomic and environmental modeling. We have a very grand title, the community integrated model for economic and resource trajectories for human kind or (inaudible). We have economists, environmental scientists, numerical methods people, geographers, et cetera, all involved in addressing these questions. If you want to ask questions like, okay, there's a shortage of food in be Africa, is that due to biofuels policies in the U.S. So it's easy to say yes, but it's not obvious one way or the other. If you want to be able to answer that question, you need to be able to gather data from many sources, incorporate models of agriculture, transport, (inaudible) policy, et cetera. Put in place state of the art economics methods that address dynamics, foresight, uncertainty, have high resolution and put this in -- pull these together in order to well answer that question then. What we believe, you know, the community is ready for is a sort of open platform in which different people can contribute different component models and data in order to ask these questions. And I would say, you know, this is also potentially applicable to problems and areas like epidemiology and development economics. In fact, one of the early applications we're working with is in development economics. So this is a picture I showed earlier. So this is actually a model showing predicted levels of entrepreneurship, i.e., non-subsistence farming in various parts of Thailand based on a very simple model that assumes that development is only based on your access to wealth where wealth is determined by your sort of family structure. And previous occupation. The red shows that we're predicting that entrepreneurship is occurring when it's not, so we're overpredicting in this northeast part of the country. If we then build a model that takes into account distance to major cities, then you can see we (inaudible) there, green shows that we're getting a pretty much perfect match. So this emphasizes that in order to study these economic and development issues we need access to real large amounts of geocoded data and then the computational models that can take this data and perform first computation. So this is running on a modest scale parallel computer, it's using data that's obtained from Thailand and manually transcoded, but you know we think we can automate a lot more of those processes. I think I've got one last application. So this is maybe of interest to some of you, relates to text mining. So this is a typical biologist, you know, there are literally certainly hundreds of thousands of biological articles produced every year. Clearly no one reads them all. So how to we make sense of what is still in many cases the primary means of communicating scientific data? So ideally of course people would encode their data in semantic Web representations but in practice they don't, they write a paper instead because that gets them tenure or promotion. So there's a group in the computation institute led by Andre Rizetski (phonetic) that's been building tools to basically automate some of the knowledge extraction processes from this data. So he has a system called gene ways initially developed at Colombia before we hire him away from there. But it takes so far biomedical articles, he's gone through about half a million at this point I think. Looks for statements of the form, you know, this enzyme catalyzes this reaction or this gene signals for this and then builds up putative networks, cellular networks which basically capture what is the information, at least a subset of the information contained in those articles. And once you've done that you can do really wonderful things. One thing you can do is combine this with information about observed phenotypes, for example, you might have some of these, one of these families that people have studied where you find out the certainly people have certain diseases. Now, if you're looking at anything other than very simple single gene diseases there isn't enough data to make reasonable assignments of potential cause and effect in particular genes but if you add in this reaction network from information you can start to look for genes that affect things in the same area of the network and use that. And they are getting some good results with for example finding potential multi gene causes for breast cancer. You can also do other things. You can take that data, these vast collections of assertions of either belief or fact, depending on how you view them, and then combine them with other information, for example citation network information, so this is some citation network data obtained by another guy and use that perhaps as another source of evidence regarding the strength of belief you may put into various statements. You know, five people make one statement and one person makes another contradictory statement you might believe that the first statement is probably true and the second is not, but maybe you'll find out that the first five people all come from the same lab in which case you put less belief in that statement. So there's lots of interesting things like that that we're working to do. >>: Is there a relationship between the algorithms that you've shown in the consecutive slides to relationship between Andre's argument made (inaudible). >> Ian Foster: No. Quite different. But we are combining the where with Andre James and I are working now to combine some of this information with this information in useful ways. Okay. So that was the -- I just wanted to show a bit more detail about three applications, all of which involve lots of complex data and compilated computation. My belief is -- our belief is that all three of these have a common need for what this open analytics environment, something that will allow us -- will basically reduce the barriers to putting data in, putting programs and rules in mixing those together, matching what matches with what matches -- matching things together and then taking results out in a way that allows you to make sense of what happened and then addressing these other concerns. Of course the challenge we face is that we've got various building blocks, we can buy hardware, the hardware is far from adequate and the building blocks we've got to certainly far from adequate so we'll be eager to work with people here on doing something like this on a much larger scale. And you know, I'd say you know my belief is that if lots of people are sort of sniffing around this problem of how you create community repositories for data that are more than just ifty (inaudible) and Google has announced or will announce plans to create a huge place where all the world scientific data can be placed. There's sites like swivel that do that on a smaller scale but I think there is an opportunity to do something that is overtakes the competition in that way and really has a big impact. Okay. So thank you very much for your attention. (Applause) >> Daron Green: Any questions for Ian? >>: Ian, maybe it was the (inaudible) slide again. Does the prove dent (inaudible) is the tightly coupled with Swift so does Swift naturally carry some of the metadata that allows you to track provenance? >> Ian Foster: Yes, that's the idea. And I say there's a slide hesitation there just because in earlier version it was completely integrated. I think we've just finished reintegrating it into the latest version of the system. But the data model was fairly mutual with respect to the system that you're using to perform the computation but the idea is and the reality is just recently that we now do generate such assertions as computations perform. And that's actually got all sorts of interesting consequences as probably Roger could also tell you. You could start, you know, reason about what computations have been performed and which have not. You can, you know, talk about you know, how efficiently people are using the system, which data is popular and which is not and so forth. Yes? >>: So (inaudible) the dream of sharing all this data and then is the way to do that is through federation or are we actually just putting this stuff so that everybody can access it and not really worrying about, you know, retaining access control or basically security privileges and stuff like that for different users? >> Ian Foster: There's quite a few questions mixed up there together. I know the Google guys, their plan is that you upload the data and it's got to be on a creating commons license that anyone can use. And of course there's a lot of data that has that property but also a lot of data that does not. So I believe that access control does have to be addressed and is important. Secondly, you know, I think ultimately we have to be federating because of course they'll be more than one data center in the world. At the same time, you know, an awful lot of people don't want to maintain their own data. I think, you know, this cloud notion is very compelling to many people. Not certainly to many it is not, but you know, if I look around the university of Chicago and Argonne, there are literally thousands of little databases, mostly write only essentially, you know, XL files, floppy disks, tapes of various sorts and I think also many people that have whether the exponential growth is sort of crossed some threshold in terms of their ability to manage. And so I think and that's in centralization is a very powerful and quite attractive concept for them. >>: In some areas like the National Library of Medicine David Lipman (phonetic) believes that it's better to have it all under his control. But I don't think that's fine for a subset of the literature but I think that's a scalable model. >> Ian Foster: Yeah. And even there, well, the literature, you're right. So even the various ->>: (Inaudible). >> Ian Foster: Even those databases there are other people who grab that data and perform their own curation of that data and their own annotations perhaps automatically generated annotations regarding, you know, inferred function or who perhaps, you know, have different versions of the same gene and so you know, I think there are a potentially a big distributed truth maintenance problem trying to find out what's going on. That's an area also of interest to us. >>: At the cyber infrastructure advisory committee a couple weeks ago Bruce Takay (phonetic) and I got them to agree that they would provide the (inaudible) they should have an open data publication policy. >> Ian Foster: Absolutely. >>: (Inaudible). And I don't know what will happen to that suggestion but it's in the minutes and it's official. >> Ian Foster: I mean, it's a great idea. I think at the same time it's -- as long as there's a significant cost associated with publication. >>: That's the problem. >> Ian Foster: Then people are going to be reluctant to perform it. So we need to find ways of stream lining the process. >>: The other thing is also the cost of curating and updating data. I mean, you need (inaudible) for example I'm told it's over 100 full-time curators. >> Ian Foster: I do wonder how much of that, perhaps I'm naive here, but how much of that we can automate. I'm not sure. Yes? >>: I think we have to ->>: I'm not very familiar with the Swift, but to what degree has (inaudible) basically unleash the power of (inaudible)? I mean it seems like even though it looks like a lot of functions (inaudible) every function can be very (inaudible). >> Ian Foster: Yes, so it ->>: (Inaudible). >> Ian Foster: It's not a -- I mean it's really parallel logic programming in another guise but it's addressing a very at least currently I mean there's something very simplistic really making it -- allowing people to coordinate the execution of many individual sequential programs. So they can be parallel programs that are coordinated. In some cases they are. But it turns out that's how many people do actually think of the work that they need to do. Certainly many problems don't fit into that framework, but especially in the -- we find in the social sciences and biological sciences a lot seem to have that property. >>: So how do I know whether my (inaudible)? >> Ian Foster: I think we'd find out in a few minutes if we sat down and talked about it. >>: I mean is it a more general model ->> Ian Foster: Yes, it is, because it has, you know, herdoop is -- has basically of course map and reduce. Here we can perform, we have map operations but also more complex forms of interaction. Yeah? >>: (Inaudible). >> Ian Foster: Yes. Yes, I think -- I can't remember if we tried running it on Windows but it's all written in Java so it would run, yeah. >>: So (inaudible) environment that you present these big part of, it's a huge super main frame has people you me has people looking at ways to extract subsets of that to warranty ->> Ian Foster: Right. I think that's and that's obvious got to be part of the model. You know, for example in the economics people that we're working with, they're very interested in allowing -- I mean, this is actually in a sense, you know, the current mode of working is your data is remotely and when you want to operate on that, you grab a subset and of course that's something that people want to do, and they'll continue to want to do. But I think the more to the extent we can start to push analysis into the cloud, you know, I think we will empower users because the sheer quantities of data that people want to operate on are too large for people to bring them locally. Another large project we are involved in is called the Earth System Grid which is -provides access to all of the intergovernmental panel on climate change data which is several -- I think 100 terabytes or more. And the current figure of merit that we have there is the amount of data that people download to their work stations for analysis. And that's running at a terabyte of data a day or so. But you know in my view ultimately the goal should be that that number goes down, not up, because clearly it's climate model data sets goes from tens of terabytes to petabytes in size is no longer possible for people to >>: (Inaudible) analysis of data. So when (inaudible) you are talking about terabytes, you know, terabytes and more. But them as people start. >> Ian Foster: Yes. >>: Finding facts and (inaudible) and looking and aggregating the data, that data is reduced and (inaudible) increases. >> Ian Foster: Absolutely. So given ->>: That level to makes. (Brief talking over) >> Ian Foster: And I think ultimately you are interacting with the systems from your desktop, right, and that should be a rich interface, not a simple interface. >>: (Inaudible) commingling of economic issues, how do you pay for this issue along with computation issues. When you say data curation, you just mean I'm going to store it somewhere and I'm going to make sure. >>: That's not what I mean by -(Brief talking over) >>: But I think part of the point we should consider, making sure their compute resources. >> Ian Foster: Absolutely. >>: So that you can do function shipping basically rather than data shipping and it becomes more of a (inaudible) optimization you know. >> Ian Foster: Because otherwise the data is just put away somewhere to die and it's never accessed. And you know, one of -- I think, we -- many people are stall a bit, think in terms of data analysis as a person looking at data but increasingly it's got to be programs looking at data. As Alex Solay (phonetic) says, the amount of data is increasing exponentially, the number of analysts is constant, more or less, so clearly each question is going to involve an exponential amount of data unless you are missing opportunities. >>: But the people aspect is actually pretty (inaudible) because the question as to what degree, you know, how much in terms of compute resources as a relative to the actually storage of data or (inaudible) free, you know, along with the data versus you know I'm willing to pay so you either make sure they're compute resources close to the data or I pay for you to give me tapes and I'll do it on my own and (inaudible). >> Ian Foster: We need to wrap up. But another potential opportunity is there are, you know, there's people ask the same question repeatedly and that, I mean, certainly in an environmental data, you know, climate data, what they do is they go through and compute some predefined set of commonly asked questions, you know, seasonal means of sea-surface temperature. So if you want a seasonal mean, that's really easy to get, but if you want something slightly different then you better download the whole data set. But maybe many people would like something different. >> Daron Green: Okay. I think we better wrap up. >> Ian Foster: Yes. Thank you. >> Daron Green: And those of you want to talk to him afterwards I'm sure he's willing to stick around. So thanks very much indeed. (Applause)