>> Kenji Takeda: Okay, so I hope you enjoyed... interesting session this afternoon. The one here is on...

>> Kenji Takeda: Okay, so I hope you enjoyed the morning session. We've got a very interesting session this afternoon. The one here is on Communities in the Cloud. So I think one of the key areas where cloud computing can really help is to allow researchers from around the world to work together, so we have four talks in this session, and across a variety of different areas, actually. And Steven Roberts from the University of Washington is going to kick off for us. He's actually in the School of Aquatic and Fishery Sciences, and he used to transfer gels and I guess do wet science, but now he says he spends most of his time transferring files. And so hopefully we'll hear about how the cloud might be able to help him, so please welcome Steven. >> Steven Roberts: They showed me the keyboard. Where was it? Okay, yes, thanks for inviting me here today. As Kenji said, I'm a biologist, and what I thought I would do today, and based on some of the feedback I got from Kenji, was talk about how we use the cloud from a domain scientist perspective, and I'm focusing a lot on our challenges. So I'm going to kind of show you what I normally do every day and places where we are thinking a lot about where it could be a little bit better and a little bit easier on us. And I've never been set up so well before ever in terms of a talk before, because I won't have to talk about any of the genomics, but that's essentially what we do. That was great. So this is a slide I really put up on all my talks. It's a little bit meta here, because I will actually talk about open notebook science at the end of it and the cloud and how we deal with our science, but this is just an indication. You're free to share these slides, and more information about how to contact me, but I'll get back to the open notebook science near the end of the talk. So this is really not what I look at. I normally look at a computer screen every day, but to give you a little bit of background about my science, is we really study shellfish. I'm by training a physiologist, comparative physiologist, and I try and understand how the environment affects oysters. This is an oyster farm in Puget Sound. This is an oyster very close up. What we do is we get a lot of the data that we heard of before in terms of genomics. We look at variation. We look at gene expression patterns, we look at protein data. We look at epigenetic data, and we try to answer questions about biology and physiology as it relates to both environmental biology and aquaculture, too. And so, as was mentioned before, we are not in need of data. We can get data very readily, and I'm just going to show you, this is kind of my schematic of how we deal with it. So remember that sequencer before and different type of sequencer, that's kind of represented by that pink arrow. So a ton of it's kind of flowing into our lab. Normally, we house it in some kind of a network attached storage device, and what we're doing to do the analysis, similar to what you saw before, is use the cloud. So we're using things such as -- let me see if this is a pointer or not. We're doing such things as those assemblies, gene expression analysis, finding SNPs. We're doing it locally. That's what that's representing, so a lot of computers. We're going back and forth to our NAS, locally, and then we're using the cloud. So a lot of that primary analysis that we're doing is done in the cloud and things like Galaxy and iPlant, which I think I'll show you a little bit more about, and I'll talk a little bit about SQLShare. Hyak is just our in-house kind of supercomputing system, but that's all command line. These are more GUI interfaces, those middle ones. And this is just to represent, too, while we're doing that sequence analysis, we have to rely on these external sources of data, so these are just gene databases, annotations, what a gene does and so on. I am going to give you a little bit more context, so this is just one of our studies and to get an idea of some of the questions we might ask and what kind of data that we need, we look at DNA methylation in the oyster genome. So this is just a one gene. We want to know how the black lollipops or the white lollipops affect the assembly within a gene, so that's representation of DNA methylation. So we need to know from the sequencer how they are being arranged. We're actually interested in gene expression levels. We're also actually interested in SNPs and variations on top of all that. So we're trying to aggregate a lot of data together. And when we think of data, this is kind of what it looks like. So this might be the very raw. It's got a little bit of quality information in there and the As and the Cs and the Ts and the Gs. We spend a lot of time mapping it back to reference genomes, and then we're making tables and then we're joining it to other kind of aggregate data and different data sets. And we saw some numbers before, but just in like one-seventh of a sequence machine, some of the raw data would be on the orders of 70 gigabytes. Once you map it, it doesn't get too much smaller to a genome. And then we're making all kinds of tables and we're going back and doing it again. It gives you an idea of the size. So we use the cloud for a lot of these, because it's big and it's big. So what I mean -- it's just not a typo. It's big in the sense that it's large size, so we leverage the cloud. It's also big kind of in genomics, particularly in what we do, there are a lot of questions that people can ask about our data that let's say we don't really care about, or it's not really our focus. So in order to share the data, so there's a lot of big questions. It's compute intensive, so we can take some of those computations, both CPUs and RAM in the cloud, and the cloud is great for education purposes. So I teach -- I also teach, and so it's a wonderful tool in order to not have everybody having to configure their laptop for analyses. So some of that primary analysis, what we heard in that other talk, like just doing those assemblies, RNA seq analysis, finding SNPs, finding DNA methylations, is done in these kind of two platforms, is what we rely heavily on in addition to our own kind of small clusters and high-end computers, and we go back and forth. I don't really want to talk about these, but they're similar to what you can imagine what we saw before. Most of us have to deal with the command line. These are some nice tools that allow the GUI dropdown interface, high-speed computing resources. We actually spend most of our time, once we put all that together, and if you add up the time we spend in SQLShare, which is developed at the eScience Institute at the University of Washington with Bill Howe here. We work a lot with him and Dan Halprin. And so we do all this analysis, we do it in iPlant, we do it in Galaxy, we do it on our own computers, and we're left with all these tables. But they're in disparate locations. So what we use SQLShare for is to bring these tables, which are just tab-delimited or commaseparated files and do simple SQL statements, SQL statements that even a biologist can remember and know how to do, to answer questions like what genes are expressed during high growth, how many exons does a gene have, any kind of thing like that. This is just the interface, freely available on the Microsoft cloud. I'll show you a couple of examples of how we actually use it. So this is an example. In the very background, you can kind of see -- when I log on, you can see all of my tables, and you can just upload data. It's pretty straightforward. This is an example. This is kind of a long statement. This is some of our work. One of my graduate students, who wants to understand the effects of ocean acidification on the proteome of oysters, so she has all of these mass spec tables, tons of them, tons of individuals, and she can kind of aggregate them and sum them in what I would call a complex statement. This is something more common of what we might do. So this is just a simple join. So we have the oyster genome, join it to the actual sequences, so each gene has its own sequence, and then we can just write a simple statement where the term has the word methyl in it or has the word histone in it. So we can easily go in there and grab all the genes that have in that particular -- in the name, either methyl or histone, and then we can study them. So that's more of a realistic way of how we use it on an everyday basis. And we're actually with Bill and Dan's help and a little bit of seed money from the USDA, we're actually building this out for the community, so all these data tables can be public or private, and they're quite large, and a lot of these are universal, so annotating the human genome or every species there is and let's say Swiss-Prot or NCBI, we can get all these tables in there, and then we can -- we've set up, done some very preliminary tutorials about how other researchers could use this. In this case, we're doing oysters, so they can do oysters, but they can see how it can be used for other species. And so this in the background here is just a little bit -- all of these kind of cloud computing resources, SQLShare, Galaxy and iPlant, are very good about being able to share. You can click a button and all the data's public. Galaxy, actually, you can make the workflows public, so I think that this is probably an RNA seq and differential gene expression analysis, so you might take two libraries, you might clean them up, you might align them, you might measure what's up and down. You can actually save your workflow and make it public, which is great. Same thing with iPlant. You can make your data public or you can make your workflows public. And so you can kind of think about that as collaboration, but I would say it's not really ideally collaboration in the sense that we would like it. It's kind of what I heard earlier this morning, is we don't really have the ability or haven't figured it out, is to go in there and work actually together on the same data set, or like I heard this morning, leave a sticky note or modify the data and have some kind of tracking system on that, so that's why I have that asterisk there, and that's kind of one of the challenges we see in that. And then I'm going to kind of shift a little bit in the same vein, but talk a little bit more about this open notebook science and some more challenges in what we do, with a little bit of cloud in there, too. But this is what we do. You can imagine, it's kind of messy. Actually, this might look pretty, but it's very messy. So we go back and forth, back and forth. We might change a little variable, and we'll go back and forth, and it's very hard to keep track of it, especially with these large data sets, and they're so big, we hold them locally. And it's kind of related, but we -- everybody should keep track of what they're doing. We happen to use open notebook science, so our notebooks are digital, mainly because, as was said at the beginning, we've gone from running gels to actually sitting behind a computer all day, so it's a natural progression. This is a quote by Jean-Claude Bradley about kind of a definition of what open notebook science is. He's a chemist, years ago, has kind of led the way on this. We actually use three major platforms for this type of documenting what we do, trying to keep track of our processes and our data. Wiki format, something like an Evernote, and then I use mainly IPython, and that's kind of what we're going to. And I like IPython in a sense, because this is an example of a homemade workflow. It's mainly a bunch of just shell scripts, but it allows us to document the work, but also, for example, we can use the Python client for SQLShare to go in there and run those queries. We don't really have to use the user interface, though it's great, and when we do it our first item, we want to go in there and look at the data and see how it changes. But once we know it, we can actually automate it in something like IPython and use a lot of different programs. So this is kind of one of our biggest challenges, and what we spend most of our time dealing with is kind of the reproducible nature of this and trying to document it. I think about when I say we and what I'm thinking about is my graduate students, and being able to understand, remember what they did, and also me, too, to go back and work on a few projects next week, how in the world did I do that? So I'm at fault, too. We haven't found any perfect solution, but I want to just point out some of the positive aspects of each of these platforms that we're using. Wikis are nice. These are the one place that you can collaborate, so I can go in there and make a comment in somebody's notebook and ask them a question, or I can edit it, and it will keep track of it. So that it has that built-in version control, easy to search. I can go in there if I know something was done on a clam on Tuesday, I can just search it and find it, and it's very easy for us to publish, so that's one aspect of what we do is we publish it online. Evernote, something like that, it's very simple. It's cut and paste, multi-platform. IPython is great because you can actually run the programs or the algorithms inside the notebook. You can do versioning. Asterisk there is I have it hooked up to GitHub so that I can kind of keep track of any kind of modifications in my notebook. It does have a lot of downfalls, though. It's very hard, at least for us, to collaborate, work for us at the same time. And it's very hard -- if it's not in this line, that means it's a negative. So it's very hard to search, and it's quite challenging to publish, too, so that's kind of what we're dealing with, and we haven't really found a great solution to keep track of what we're doing, although we imagine there could be one. And this is kind of just a summary, and when I'm kind of thinking about the challenges of where we are now. If you were a biologist, you'd get this, because this is a male oyster and this is a female oyster, and so the beginning. This is how we reproduce. We just put it in a jar and we make baby oysters. But, anyway, one of the big things is talking to each other. That's what that white arrow means, is we really would love it if we could integrate everything together, if one would talk to each other in terms of documentation and make it almost error proof, idiot proof, okay? Any time we start cutting and pasting, copying, we make mistakes, and that goes back to the reproducibility, actually. Versioning, we can kind of deal with that with our notebooks, but we have a very hard time about data. These large data sets, medium size, however you might think about them, 40 gigs, we might tweak them, change them, three or four or five, six, 10, 20 times and how to keep track of what one person did to that data set, what another one did, we have no idea how to deal with that. I guess that gets along with the problem when I was like, what happened to the data along its trajectory from that raw to those many, many, many, many steps? Collaboration, we haven't really found a good solution. I can go in and see what somebody did in terms of a workflow but not really like tweak it and put a sticky note on it, something like that. That would be wonderful. Simple sharing, this is more along the lines of probably the notebooks side of it. Some of the platforms are easy, but something like IPython, we haven't found a very streamlined solution for that. And discoverability. Remember, not everybody is can or interested in the open science aspect of it, but we happen to be, and we would like it, if we make it publicly accessible, somebody to be able to find it, too, and so we haven't really tackled that. So those are kind of the challenges, and we think about these things a lot, so I think about these things a lot, and these are the kind of things that kind of drive what we're doing. We want to make all of our science reproducible, and then we want to make it open in terms of kind of solutions. So with that, I just wanted to make a few acknowledgements. As I mentioned before, Bill and Dan Halprin have been great in terms of the SQLShare and kind of working with us, as just biologists and how to aggregate data, answer biological questions. These are the people in the lab and the major funding sources that support my newer work. So with that, I thank you very much. >> Kenji Takeda: That was fantastic, Steven, so I think we've got time for some questions. So Dennis has a question. >> Dennis Gannon: Have you looked at NBViewer, the notebook viewer for IPython? >> Steven Roberts: That's how I do it now. >> Dennis Gannon: But that's still not enough? >> Steven Roberts: Still not enough, because it's one page, and I want to go to a page where you can search everything I've done, but that is a great tool. >> Kenji Takeda: I think one thing I would say, I was actually in Oxford. There was a workshop around reproducible research a few weeks ago, and a lot of these issues came up, and we have a few of the other projects on the Azure for Research Program, and I think that later today, on the Microsoft Research Connections website, there will be a blog post which highlights some of these projects. There's one in France called Xanadu, which is actually collaborative IPython notebooks with publication channels for different research groups, and there'll be a link to that on the blog later today, so it's a very topical subject. Question in the back. >>: Question for you on publishing results. Do you do anything with tiered access to them? I work with scientists who want to be able to have everybody see it, some people see it and only themselves. Are there any tools you've found that are particularly useful for that? >> Steven Roberts: Well, I know you have the capability in things such as iPlant to do that, and Galaxy. There's definitely tiered access in that, and that's -- yes. And I should mention, I didn't really talk about it, but iPlant is not just for plants. I don't work on plants, but it's a great resource. It's a great tool. >> Kenji Takeda: Okay, any more questions? Okay. Thanks, Steven. And so Evelyne Viegas from our Microsoft Research Connections Team is going to talk to us about collaboration and around machine learning that we heard about this morning, so just as we switch laptops. So Evelyne is our Director of Semantic Computing and is interested in computational intelligence, and she is hopefully -- there we go -- going to tell us about one of our new projects called CodaLab that we've been working on for quite a while now, and so over to Evelyne. >> Evelyne Viegas: Yes, thank you, Kenji. So let me first thank the organizers and the chairs, because the way they have set up the talks, it's like half of my talk was part of the previous one, so I'll be able to spend more time and answer questions. The reason I'm saying that, because in the previous talk, we just heard issues around data sharing, through reproducibility, collaboration. As you can see, that's exactly -- I'm going to address part of it with CodaLab. So in my current role at Microsoft Research, I'm working on data-driven resource, and what I'm focusing on is really trying to accelerate the pace of innovation by creating a community, an online community, around sharing data, but also code, be able to execute code and collaborate in a more effective way, by really trying hard to actually lower the barrier for the researchers to be more productive when they do experimentations. I'm focusing, as Kenji mentioned, on this project. We started looking at the research process when we do machine learning. As you may all know, it's a data-driven world. I think the audience here, we all know that we're bathed in data, overflow of data, and we're using machine learning as an enabler of decisionmaking. But there are still a lot of issues, and some of them, when I'm going to show them here, are going to look, resonate like the deja vu of the previous talk. So first of all, there is a lot of duplication of effort. So when we talk about that, when somebody starts with an idea, what happens? Well, you have to start looking for some data. You have to spend time preprocessing the data so that you can use it, and because we are all scientists, then you also need to compare what you've done to what other people have done. So this about, then, finding out the methods, other algorithms, potentially code, and implementing those competing methods so that you can compare them and do science. You need to run all your experiments, and then finally, you create tables, because as we know, as humans, it's easier to look at data which has been visualized rather than just that HTML or XML. And then we summarize the results in papers. So that's the first one, the issue of duplication of effort. Now, assuming that we have access to all that, as we just heard earlier, there is the lack of reproducibility, which is still a very big issue. And the reason, some of the reasons why it's so difficult to reproduce is because, first, you need to find the data, have access to the data, but even when you find the data, it's rarely made in a way which is really and readily usable, so that's another issue here. Finally, when we talk about the absence of comparable baselines, most of the time, we end up comparing apples and oranges. And the reason for that is that even, actually, if we can find the data, if we can find it in a way which is usable and that we can reproduce, well, often, it's maybe a different version of the data. We don't know which version, something we just heard in the previous talk. So are we really comparing apples with apples, right? Another issue. And finally, the last one, which I don't think I've heard addressed yet here, is one way, it doesn't relate completely to the first three issues, but what we're trying to do in CodaLab, it still relates to some issue which is going to resonate for you, and specifically for people who talk about sharing data, is that your data, and I will add your code, is my data, my code. But my data and my code are my data and my code, right? So one way of being able to focus the community on working on the specific data sets, trying to share some code, is by doing some challenges or some competitions. But often, I will argue that a lot of those challenges are just kind of wasted. If we just do a challenge, a competition, we have some winners, and then that's it. What happened to the data? What happened to the code? We go back to the first three issues that we were having there. So, having said that, just to give you an example, I think probably everybody here is very familiar with the research process. This is just there are various forms of those tables and processes, but this has find the data, clean the data, format, e-mail the authors to find the data, compile the code, reimplement, or rather -- yes, reimplement, run experiments, etc. And so basically what happens is that even when we do that, what I was talking earlier about the way why we cannot do really reproducible research is because we cannot have first exhaustive comparisons. It's very difficult to do. Even if we work on some data sets, the same data, assume it's the same version, maybe, when we start comparing, we compare the same measure. In this case, we select the accuracy, but often the comparisons are just uncontrollable, right? If I look at the previous method, so we took -- the measure is the same, right? We're talking about accuracy, but actually, we may have been using different sampling. Some might have been using optimization. We use different cross-validations for comparisons, and the list goes on and goes on, and this is assuming that the software is bug free. So what we need is one way where we can start reducing the time it takes to prepare all these data, how we can enable true reproducibility, which means taking into account of the issues that Steven was talking about previously in terms of versioning and provenance and how can we document with metadata all that and establish benchmarks, common measures so that now we can compare apples with apples and not apples and oranges. And the last one, which I won't touch too much here, is evolving from just competitions, which is we run a competition and then we have some results and then what's next, to the idea of competition, which is kind of this live competition, so to speak. Actually, out of curiosity, how many people here in the room participate in challenges or competitions, over the one year, during the year? Okay, I see two timid hands. Okay, so I won't be talking about competition too much here. So the answer is a software that we are developing, an open-source software, working with the external research community, which is called CodaLab, which is an open-source platform to empower communities to create and explore experiments on one side, to do that together, to collaborate, and also be able to create competitions which are not completely separated then of the experimentation side, so that people are able to come back to the results of the competition and use that in their further experimentations. Our community leads are Percy Liang from Stanford University, and he's a machine learning and natural language processing professor, and Isabelle Guyon, who's been working on challenges in machine learning for decades. The main principles of CodaLab, and by the way, I'll be doing some demos for those of you who are interested, later in the day, if you want to see what it looks like, but there are three principles, underlying principles, for CodaLab. One is modularity. If you were talking about artificial intelligence problems, which require the effort of the entire community, and some people, we cannot be experts in everything. It's very, very difficult, right? So some people may be experts in some areas but together can just participate to solve some of those hard AI problems. So it's about modularity. The other principle is about imitability, and that's the main idea is the idea of the Git version control, so that now I have data sets, and if somebody's going to modify it, it's another version of the data set, so how do we do that, right? So the main idea is all programs, like I said, are just run and write once and then capture metadata. Enable collaboration without chaos, so still lots of work to be done there, how do we enable that? And then finally, capture the research process in a way which is truly reproducible? And that means understanding the data, the version of the data, the provenance and the same thing for the code. And, finally, the last principle is the principle of literacy, so this is very much in line and it may loop a little bit. This comes from the idea of the IPython, the IPython notebook, with the idea of being able to mix text and code. And basically, when I do an experiment, I may as well capture it as I do it and be able to put my code and have other people come share with other people, directly, the experiments I'm doing. So with that, I have -- I'm not going to do a demo here. It's a short video of two minutes, which shows two aspects of the platform. Actually, this is the competition aspect of the platform. This is the one we started working on. This is a challenge in machine learning, and as you can see, you can just put your competition, describe your competition, with describing the data, the tasks. You can also -- you also can provide the education on the screen for people to use. So this is just like text, right, like explaining a little bit also the terms of use of the competitions and what you're going to do. In a second, we should be able to see the little box. So then the run details are really the definition of the competition. Then, as a participant, I can participate once the competition organizers have published the competition, and here are some examples that we have already there. Now, the next thing we're going to look into is the worksheets. It looks like -- it looks stable on my machine. You know what? It's a bit loose, but I don't know what I can do. So that's an example. This is an example of the experimentation, where we're using some libraries, in this case from Stanford, to do some parsing. I'm sorry. I don't know, it's really -- you'll see it at the demo later, because it's looking fine on my machine, but I'm not sure. And so the main idea is you can actually use a comment line to do agile exploration on the data, and then you can publish it. So here is just the output we're seeing of a program which has parsed a little poem that I had there. And then on that, we want to visualize. So as you can see, there is a mix of code, of text, with some metadata which has been captured. This is really bad. I'm going to stop, sorry, but here it looks -- yes, I think that's -- okay, that's the end, anyway, and some visualization which is going to part of speech. Just if you're interested, just come -- oops. That was not. All right. So to summarize, so what CodaLab is about is really about building this community and making it easy -- it's not about building. It's actually building the software so that it makes it easier for a community who is doing data-driven research to participate and share with the rest of the community. And it's kind of a cycle, right? You can participate in CodaLab if you want to organize a competition, share some data, but you can also just go there as an individual or with a research team to do some experiments and then share it with the rest of your community and, more broadly, with the world. So in conclusion, let's be part of the solution here. In CodaLab, as I mentioned, this is an open-source project. It's all on GitHub, which means the community can add more features if the community wants to. So if you have a competition to create, just you can use CodaLab, and then if you want to start looking at some example of data-driven research, and currently, what we're starting to put on CodaLab, we're working, as I mentioned, with the machine learning community and natural language processing, so we don't have much on the experimentation side on bio, but that could be another area and just trying to reproduce some of the experiments in CodaLab. And with that, I believe that's the end. Thank you. >> Kenji Takeda: All right. Thank you, Evelyne. So we've got time for a few questions, if people have some questions. So Evelyne is going to be sharing this live later on in our demo fest, so she can have a deep dive there with any of you there. >>: No jumping in the street. >> Evelyne Viegas: I don't know. If you enjoyed that, I can try to reproduce that, too. Yes. >> Tanya Berger-Wolf: So right now CodaLab is primarily designed for running scientific competitions, but presumably, one can also turn it into a tool for citizen science and open it for public competitions for data collection, for bird counts, for things like that. How easy would that be? >> Evelyne Viegas: You could. If what you want to do -- so turning the competition into citizen science I don't know exactly, but if you're talking about like, being able to share data, so CodaLab is -- think of it to a certain extent, kind of a repository, so to speak, for data. You can have links to data, so that would be one of the -- and then the code to process that data. So, I mean, in terms of data sharing, execution of code on the data and sharing modules, that's what CodaLab is for. And I would say right now we are focusing -- we start with the machine learning community, specifically on the context of the experimentation, but really, it's all about data-driven research, so in theory, that could be possible. And probably in practice, too. >> Kenji Takeda: Do we have any more questions? No? Okay. Thank you, Evelyne. So Tanya, who asked the last question, is our next speaker. So Tanya Berger-Wolf is from the University of Illinois in Chicago, and she's a computer scientist but is very passionate about ecology, and we had the pleasure of hosting her in the Microsoft Research Lab in Cambridge for several months, where she was working with some of our scientists in the Computational Ecology Environmental Science Group, in fact, the group who created FetchClimate. And Tanya's -- you may have seen some of her work, where she was in Kenya in January on fieldwork, running a class there, and she created, actually, I say sort of nature's ultimate barcode scanner, so she's basically created some software that will identify zebras based on their stripes, which is basically a barcode scanner. And so over to Tanya. >> Tanya Berger-Wolf: Thank you, Kenji. So actually, that barcode scanner is an integral part of the Ecological Information System, which is what I'm going to talk about. So if you look up - if you just use any search engine and do ecological field data collection or ecological fieldwork, this is pretty much what you're going to get. This is just a screenshot of the search results, and it goes on and on, and notice, most of it is a few humans looking at some natural world data and writing down things on paper, mostly. All right, so clearly this is not big data. This is -- and it is from plant and nutrient cycles in ecology all the way to behavioral ecology and ecosystem kind of data collection. A lot of it is about one, two, three, five humans looking at data. This is my postdoc currently. I mostly work with ecologists who are interested in social behavior of animals, so they do this version of that data collection process. They look at, let's say gelada, who are relatives of baboons, and they write down which baboon is interacting with which other baboon. That doesn't scale, but what's coming is a lot more of very different kind of that data, of data on who is interacting with whom, how do animals behave, and in general, we get much more than just who is interacting with whom. So the pretty early source of data in that field is GPS collars or any other type of tracking collars, radio collars and so on, so this is not a stripe. This is a solar panel on a GPS tracking collar. Part of the project ZebraNet a long time ago in 2003, with Margaret Martonosi from Princeton, she was the one. But Microsoft's own collaboration Technology for Nature is producing cheap GPS trackers that you can put on zebras or whatever your animal of choice, and they can very high resolution information about which animal is where. We're collaborating with the University of Illinois Urbana-Champaign, Robin Kravets, to design proximity sensors. So you can put also -- we've put this on baboons. Some of you have seen these data yesterday, so [indiscernible], which is part of Technology for Nature, but high-resolution GPS collars that tracked a population of baboon every second for 30 days, so there is much more data. This is 1.4 million data points per day. This is over 20 million points of just GPS locations and all the other associated metadata that goes into the millions for one month. There they are, those baboons. So there's also one source of data that's coming up that is extremely cheap, abundant and high coverage and available. That's images coming from camera traps, those stationary cameras, motion activated, from tourists going everywhere on safari rides, parks, and just uploading their Flickr streams or their albums online or just taking pictures and leaving that on their cameras. And what you get from there is many, many -- this is from camera traps -- many, many, many pictures. And so in 2010, we asked Microsoft, could we leverage those data for information about the interactions of animals or the natural world in general? And that's where the idea of, well, if you are going to do that, you would need to be able to identify each individual animal. Zebra, we started with zebras. They're easy. It's less of a barcode scanner, although everybody liked to call it -- it's more of a fingerprint. And now we're onto second generation of that software. I'll show a little bit. But now, if you can leverage images, you can actually get a lot more information out of them, and in fact, there are many, many other sources of images, so you can fly drones. This is a drone in Kenya. Light airplanes with camera mounted. You can put GoPros on vehicles. You can put little cameras to track your ants, if you want to, if you're interested in insects, not zebras. You can put autonomous little tiny ones, underwater vehicles with tiny cameras to scan -- I was talking just about the coral reef. There is tons and tons of image data that are now coming in, so now we're in the realm of big data. So what do we do with it? How do we deal in fact -- there is also all these citizen science platforms for processing or collecting image data, including iNaturalist and Instant Wild, which is also part of Microsoft Research collaboration through Lucas Joppa in Cambridge. So what we need -- so this is coming from before this avalanche of data, before we can even take advantage of it, we need a system that will be able to deal with it. So my talk is probably the only one that is in the stages of design. We're deploying the first version of it in July, so I can't show you demos of all of this working on the cloud yet. Hopefully in a year, but so all this data coming from images, and we're building Image Based Ecological Information System that takes all of it, processes it, puts it in a database, and then you can produce scientific queries. And clearly, there are many related systems out there. Don't do that. There are many related systems out there that vary in flavor and nature from a similar project like Andes that uses citizen science to collect data about the natural world of Andes, but without the image processing and without the focusing more on the citizen science aspect of it, there is the animal tracking data repository, Movebank. There is Zooniverse, which is image based and other meta information based data collection. There are data standards projects, DataOne and NEON. There is Ocean Biogeograhic Information System, focused on organizing data about ocean. There is the Citizen Science Research Center. You name it, there is probably a flavor and version of it right now coming into existence. All of these are very, very new. The oldest ones are probably just a couple of years old, probably. So the field is maturing. So here is our version of it, and so you have all these images coming in. Eventually, we see them coming in from Flickr streams and Google+ and wherever else the images are coming and all my albums, and the cloud clearly being the source of data, as well as the repository of data. But we start small. We start with tourists' cameras. And the thing is, when you start with data that is in the field, let's say in Kenya, just for the sake of example, you don't have connectivity. You don't have the wireless, typically. You don't have reliable electricity, so connection to the cloud, before you can put the data on the cloud, you actually have to jump through some hoops. And so that is very different in field data collection, which is not like genomic data, because there is a barrier between getting data and getting it on the cloud and sharing it through the cloud. They call it the truck Internet, because you have to deliver the data quite often to the server by a truck. And that's what we're going to do at the beginning. So you prefilter the data. Locally, you do a lot of work locally on the server, and part of building this pipeline, this system, the initial deployment, is to figure out how much we have to do locally versus how much we can do globally and what is the protocol for sharing, because there is three Nature Conservancies to begin with. There is going to be the whole world at the end of it. To give you an idea about the numbers, one Nature Conservancy, one tour company, 30,000 images per day. If we go back to these systems that I showed you, they talk about 500 images in the first three months uploaded. We do 500 images in a couple of minutes. So we have to build the system from the beginning that within the constraints of unreliable Internet and electricity and field conditions and constrained resources is still able to leverage the abilities of the infrastructure and the both software and hardware infrastructure and the cloud to share the data and make it available and make it useful to scientists who are in the US, who are in Australia, but studying those zebras in Kenya, tourists who have been on safari but went back to Russia and uploaded their album later, right, so we can get their data as well. And so this is our sort of architecture. This is the main point where the information gets, the data gets on the cloud, and after initial processing, things like -- mundane things like timestamp and location, because you forget to set your time zone on your camera to change -- you're still in Moscow. It goes through our image battery of algorithms, and on this part, there is also collaboration here within Microsoft Research on image search and object detection to identify all the images that contain zebras, giraffes and so on and so forth. And so here, it goes through our own -- that second version of the zebra barcode scanners, called HotSpotter now, which can identify any animal that's striped or spotted. So not only to say that this is a leopard or an elephant or a nautilus or a zebra, but this is Joe the leopard, this is Cathy the elephant and this is nautilus number 126. So we can now get down to individual animal on anything that's striped and spotted, even things you don't think of as striped or spotted, like elephants, wrinkled elements. So then, we also extract -the thing is, a lot of these data, there are statistics that are not developed. So if you have all of these data combined, how do you do ecological queries on these data? What concerns us is what's the unit of identification? Is it one photograph? Is it an animal in one photograph? Is it a whole series of photographs around one animal that are close in time? And so we're developing a language, also, of how to process that information, because when you want to sequence from GeneBank, there is a protocol, and the object of a sequence is well defined. What's an object of an encounter? What's an ecological unit of analysis here? So this is what we spend a lot of our time on right now, figuring it out? And then connecting to all of the other -- through all the other useful resources such as FetchClimate, such as Movebank, tracking data, tracking animals, such as all the satellite imagery and so on and so forth. And so at the end, all of it goes into a database, cloud instance of database, which is a version of Wildbook. The nonprofit Wild Me started with sharks and now scaling it up to do it for any animals and works with the standards of ecological data. We worked with the organizations that maintained standards of ecological and biological data collection. So here is what we think about an ecological unit. Here is, for example, a series of photographs. You can put them together into a habitat unit or a collection of animals, or maybe individual animals. This is using Image Composite Editor. This is through our own HotSpotter identification, so we can string the images together and say here is all the images around one time, one animal. We can annotate -- we can start annotating these images and put together a story. And the types of queries that we will be able to do, hopefully by the end of the month, is so we can now use this unit of data from images, from a collection of images, to answer queries such as population count. So through site-resite or mark-recapture techniques, we can now -- but in the images, we can ask how many animals are there, what's the population size? And it has been used now in various contexts -- for example, BBC did a little show. They used this program to maintain the counts of harbor seals as the indicator of the harbor health in the UK. It's being used to estimate population sizes of snow leopard -population size of snow leopards in Nepal. Or we did it for zebras in Kenya, and it turns out, unfortunately, that there's about half the number of zebras that they thought they were, and they're severely endangered, so that's not good news. We can look at population dynamics. We can ask questions, death, birth, with uncertainty, but we can estimate that from observations and photographs. Habitat use, which species use particular habitats. The movement tracking of animals through photographs instead of putting GPS collars on them, and finally, my favorite, social network analysis. We can push it out -- that's the social network analysis. We can push it out to citizen science, back to citizen science and education and outreach through -- so this is an app based on the same idea, connecting to data now not collected from the safaris but from the backyard of students at a school that they used camera traps that they put out in the school's backyard. That's in Chicago. When you go to the zoo in two years, maybe if you come to Brookfield Zoo, instead of the 19th century technology that you have about explanations, printed explanations about the species you're looking at, the zebras that you are seeing in front of you, you will have an LCD screen that will show you what these zebras are doing in the wild right now in Kenya, where they are. But all of it also comes back to several issues, the workflow. People have been talking about all the cloud aspects of it, that we run also the scaling, how long will this -- it will take to run this query. The privacy, data provenance and data security -- we would love to share the data about the snow leopards. We cannot. They are a severely endangered species, highly protected, and you don't want to give the poachers information of where they are and when they are there. So we have to design different levels of information sharing. You also don't want to have the data that the students are collecting in their backyard and the thoughts that they are having, but you do want to connect them to other data from the real world. So all this resolving the data security, data access issues, is going to be a big deal in a system like this, and this is my own social network of collaborators that are helping in -- that we're working in various forms and shapes and funding agencies. Thank you. >> Kenji Takeda: Fantastic. So has anybody got questions? Yes. >>: Yes. The images collected from, you say, citizen scientists? I have a question about how you would handle the bias -- possible bias from the citizen on scientists because they might perform the touristic [indiscernible] touristic sites, or you want to study the ones that are usually hard to observe and/or how much limit does it pose on your finding accuracy? >> Tanya Berger-Wolf: Right, so this is a great question, and this is one of the reasons ->>: Could you repeat the question? >> Tanya Berger-Wolf: Yes, the question was about how do we handle data biases, and data biases, there are many, many levels of data bias that are in a system like this. There is the data biases that are coming from the people who are collecting the data and the data biases that are coming from the objects on which the data are collected. So the animals themselves, versus the people who are collecting the data. And it's a great question. One of the reasons we're deploying the system in a very, very light IBEIS light version in July is to estimate data -- there is no -right now, there is no data on data biases of a system like this. so what we're going to do, we're deploying. We have GoPros, GoPro cameras, on each vehicle, four on each one, and a GPS tracker. So we know what the tourists and everybody else could have been photographing versus what they are photographing. So we're going to estimate the -- there is known things like camera fatigue, like species fatigue. The first time they see a zebra, they take three, five, 100 pictures of it. Twenty minutes later, when they realize there are thousands of them, you don't see that many pictures of zebras anymore. Then there is also the biases -- they say that there are four filters when people post pictures. There's which pictures they take, which pictures remain on camera, which pictures they upload on the cloud or on the computer, which pictures they decide to share. So all of this is -- that's why we're working with citizen science experts. But there is the other side of it, is which animals -- they're photographing these animals because they're on this nature preserve, and they're being taken to this location, but what about all the animals that are not there? This is why we have UAVs, drones and camera traps and all the other, and so we can compare the biases from different systems. And we also ideally would like to use estimates of the query estimates -- say, population at site or the home ranges that are coming from this data, and with uncertainty, so we need to build the whole statistical engine that comes from the data to estimates, statistical estimates, with uncertainties, and then to go back and say, oh, we really would like to have those data. So one of the projects that I've been doing during this year, working on with Microsoft Piero Visconti and Lucas Joppa from Microsoft Research Cambridge UK is active crowdsourcing. So when you estimate what data you would like to have, given the data that you know that you have in cryptic species, species that are hard to get to, locations that are hard to get to or people that are taking -- have their own biases in data collection -- how do you ask citizen science people then to go and get the necessary data to get better estimates about your ecological factors. >> Kenji Takeda: Curtis? >>: What tools are you using to visualize mobility data, and what are they not doing that will prevent them from doing better science? >> Kenji Takeda: So the question is, what tools are you using for visualization, and what do you want from the tools, I guess? >> Tanya Berger-Wolf: Oh, I want a lot. So we're starting to -- one of the side projects -- is Rob here? Hi, Rob. We're using Layerscape actually to visualize a lot of our mobility data. Rob demoed yesterday the baboon visualization, and one of the first things we asked is can we make this interactive? Google Earth can visualization as well our GPS tracks of baboons, but the problem is, they can't then click and say, oh, I need to label this. I need to label this as activity. I need to label this as this baboon, oh, it's doing something weird. Let me annotate this. I want to grab this group of baboons and circle around them and say, oh, focus on this, or even for machine-learning data, I need to have activity and timeline labeling. So for -- and we want to put it on the landscape right now, and so what's missing is a very high-resolution, accurate landscape. We don't have the 3D very well right now. What's missing is the ability to do it on my laptop, but it is -- I recognize the limitation of the computational constraints that it's a highly intensive process, but I would still like to be able to do it on my laptop. >> Kenji Takeda: Jeffrey. >>: I had a technical comment about your track Internet. So I think in this both experimental observation, projects like this, when it gets to the generalization of the idea of streaming. Maybe you stream data in blocks, months' worth of data. And the same, for instance, is true in seismic exploration. I think they take 20 petabytes of data every month, and then they bring that to the analysis system. So as we look at streaming data and streaming algorithms, we should look at block streaming algorithms, as well. >> Tanya Berger-Wolf: So the comment was that the notion of streaming is different in different science domains. When we talk about streaming, sometimes we mean every second, when you talk about financial data, but in this scientific domain, quite often it's block streaming. So in seismic, it was pointed out, it's a month worth of data uploaded in chunks. In this case, it's going to be a couple of days' worth of data uploaded in chunks, probably using the truck Internet. But when you scale it up to more than one nature preserve, it's still going to be burst-y, but it's not going to be block streaming, necessarily. It's going to be constant trickling with bursts of blocks. >> Kenji Takeda: I think actually the session after the break, we're going to be talking a bit about streaming data. Any final questions? Yes, I have just one more at the back. Thank you. >>: So I understand that you are in the process of developing a querying language for this, and so how dynamic the data is, how frequently that it is changing or evolving, and do you envision the traditional database query processing techniques to be useful here to expedite the query processing challenges that you are seeing? >> Tanya Berger-Wolf: So the question was about developing the language of ecological queries, how dynamic are the queries, how much can we take from traditional query processing and how much do we need to do from scratch? So the answer is both. We are taking full advantage of what's out there already, so we're relying on Wildbook infrastructure, so Wildbook is a database instantiation, and I can talk a lot more, with a fully developed data scheme using Darwin Core, which is data standards for animal data collection and sites. And we're implementing it, so they already handle things like on a small scale, like mark-recapture, queries. They connect to some population genetic queries. They can connect to those. But the problem is, while we have the language of this, we have this data schema of these queries. We have the databases that we can build on and the data schema that we can build on, how do you do a statistically accurate query of site-resite or mark-recapture-based population count from these kinds of data, from image-based biased data? So we don't know, and the problem is nobody knows, so this is why we're deploying the pilot first, to get estimates, to see what are the baselines. So Evelyne talked about the baselines. We need to collect the baseline first. We need to have those GoPros to see what people are collecting. So how does this all relate also to traditional data collection techniques? What worries me and biologists is that we'll find that the answers vary so widely from these kinds of data versus traditional data collection techniques, which one we're going to trust? The high-resolution coverage of the population, which is photographed every few minutes or the once-a-day sighting of that population? >> Kenji Takeda: Thank you, Tanya. I think we should probably move on now to Chaitan's going to come up and set up. So Chaitan Baru is from San Diego Supercomputer Center, so we're delighted to have him here. He wears several hats, so he's Associate Director of Data Initiatives at SDSC, Director of the Center for Large-scale Data Systems Research and also leading the Institute for Data Science and Engineering. So as said, he works across many different disciplines. Today, he's going to talk about NSF EarthCube and a new cloud initiative as part of that. >> Chaitan Baru: Thank you. And one more hat, Tanya and I were just talking. So actually, I've been involved in the project for the last six years or so, funded by the Moore Foundation, called the Tropical Ecology Assessment and Monitoring Network, which has camera traps all around the tropics, so we're going to connect on that. Okay, so what I'm going to talk about now, it actually is a new initiative, just started I guess a few weeks ago, so it's opportunity for all of you to jump in. I can't even say that the ideas are half-baked, because we're just needing the dough. So you can bring the yeast and everything. And actually, I do think it's an opportunity to influence some of the thinking in the community and probably some of the thinking at NSF about how some of these things should be done. At least we can try. And Wenming Ye from Microsoft has been extremely helpful with this. In fact, he has already gone and started building something using Azure, and that's the page I'm showing you. If you go to eccearthcube.cloudapp.net, you'll see it, but we'll talk a little bit more about all that. So let's get started. So I'm going to talk about this thing called the EarthCube Cloud Commons Working Group, and the outline is let me tell you a little bit about EarthCube, because I'm not sure if everyone knows or how many folks know about that, and then within that, what is this EarthCube Cloud Commons Working Group, and then what are the next steps? Actually, Wenming was at SDSC last Friday, and gave us a tutorial on Azure, and we already started making some plans for next steps, and as I say, this is an open really community activity, so everybody is welcome to participate. So what is EarthCube? It's a vision that started a few years ago in NSF to create this national data infrastructure for Earth systems science, so encompassing all areas of geoscience, Earth, atmosphere, ocean, and it was collaborative within NSF between the Geoscience Directorate and this division called Advanced Cyberinfrastructure, which is part of the Computer Science Directorate. The first meeting was back in November of 2011. It was the first community meeting. It was called a charrette. It's the first time I've gone to something called a charrette. But one of the outcomes of that was formation of a set of interest groups, what are the kind of issues -- and none of this will be very surprising. This community is pretty sophisticated. You've seen these kinds of issues, but clearly, folks said, well, we have the data issue to worry about, governance. I think there is some notion that this should be a coordinated, maybe not so much top-down but bottom up, but still coordinated, so there has to be some governance involved with it. In this presentation, we don't want to get too caught up in whether - how EarthCube, per se, is proceeding. I think the opportunity here is the cloud computing opportunity, for us to bring it into this community. So there's a governance group. Of course, semantics was a big issue, what does the data mean? And then workflows, because all of science gets done through these kind of processes. And then there was a second meeting in June of 2012, so things progressed along the way. All these sort of community groups in that one year, made some progress about some directions, and then there were almost about 50 what they call end user workshops. These are the actual scientists across all sorts of domains of geoscience meeting and saying what do they think they might want from such a national infrastructure? And in the end, September 2013, there were actually a bunch of projects that were funded, about $14.5 million across a number of things, one of which was called the Test Enterprise Governance, which is I guess a pre-alpha for a governance. So they're trying to figure out what kind of governance mechanism there should be, so they funded an activity to think about that. But one of the things that the governance group did is to come up with -- they call these assembly groups, but these are groups in big areas that need to be investigated, one of which was an assembly group involving industry and free, open-source software. So it's the whole issue of software in this community, how does it get maintained and what do we do about it? And that's actually the meeting where we were, where this idea came out. So it was maybe about 40 people in the room. We had multiple different discussions and four or five different activities came out to it, and a few of us got together to talk about the cloud idea. >>: Just a quick question. This community is used to paying for services like Esri, which are not open source, so I'm just curious about that. >> Chaitan Baru: Well, but they also have a lot of open-source software. So it's both, and that's exactly why they wanted to have the commercial folks there. I'm trying to remember -- I'm sure somebody from Esri was invited, may have been there. I can't recollect. But there were others. Okay. So yes, and they used to pay a lot of money for Esri, actually. So let's talk a little bit about this particular activity. So we called it the EarthCube Cloud Commons, also informally called the GeoCloud. It's an outcome of the discussions at that workshop, and we've set it up as the charter being to evaluate, encourage, facilitate and provide education on adoption of cloud computing by the geosciences community, so it's a big charter, right? So how do we take the geoscience community and just push them into the cloud? So those of us who were at the meeting call ourselves the leadership team. This is pretty open, so I don't think we want the leadership team to have 50 people in there, but if some more want to join, that's fine. So that includes me, Emily Law from JPL, Charles Nguyen from Minnesota, Judy Pechmann from Utah and Wenming from Microsoft. So we had to submit actually a document back to the EarthCube folks about what this group would do, so a statement we had was this was formed based on the recognition that creation of private, what you might call on-premise, IT infrastructure by individual researchers and/or research groups in the NSF community may not only be unsustainable but may, in fact, be detrimental to creating synergies in the community and thus an impairment to collaborative research. So you can see we are being a little edgy with the language. But I think what we are saying here is the current model, where everybody puts a budget item in their proposals and gets equipment, puts stuff in their closet, there is a limit to how far you can go with all that. And you all know that with data, you can create islands of data and data that's lost, etc. Plus, the flipside of it, if you're involved in a common environment, it actually already creates a situation where you could have synergies and sharing of data, if it's in some kind of a cloud environment. So some of the issues that we talked about and that we would like to inspect -- we have about a year, by the way. So the idea is that this group is supposed to meet and do a few things, which I'll mention here, in terms of action items over the next year. We'll get some money from EarthCube to actually run another workshop, so again, that's another opportunity to meet, and then see how we are making progress on some of these kind of things. So the first idea was the community may not know exactly what cloud computing is, how it can be used, so there's a role as a broker, an honest broker, for a group like this within EarthCube to help the scientists say, okay, what is the thing you are trying to solve? What's your problem? And then look at that and say, maybe you can use a cloud, or maybe you shouldn't use a cloud. Maybe you need to go to a supercomputer, whatever it is, right? And there are many cloud providers, as well, so which one would you do? EarthCube itself, as I mentioned, there were $14.5 million of funding that has gone to various projects, some of which are creating software and infrastructure, and one of our thoughts was, maybe the EarthCube, the services that the EarthCube projects themselves are creating, maybe they should reside in the cloud right away. You should start by the presumption that maybe they should be cloud services and then see why they shouldn't, right? The other one is to really evaluate costs and business models, and that's partly your question of where are we spending the money today and where would we be spending the money in the future, if we went to the cloud, and it's all the money, not just the hardware. It's the people, it's the fact that you might be using grad students to run systems. Is that a good idea, a bad idea? All of those kind of things. And like I say, I think the answer is going to be different for whether you should use cloud computing or not based on whether your project is a small, medium or large one, and also small, medium and large in terms of your computational and data needs. Project management and sustainability, so what is the access, cost and benefit of implementing these long-term resources? We talked long-term resources, examples of those are, say, IRIS, which is actually right here in Seattle, the Seismologic Archive, or the Geodetic Archive that's [indiscernible]. NCAR, for example, we talked about is too big. You can't say let's start running NCAR and do it all in the cloud, but there may be others that you could say, well, what would happen if you run it in the cloud? And we thought one of the interesting side effects of that could be that if a facility was running in the cloud, actually, the management of that facility may be easier to transition from one group to the other. So you don't have a lockdown situation where a group owns a bunch of physical resources and that's the reason why they should continue to run that resource, right? Another idea we talked about was equipment reuse. It's a side issue, but there's actually a lot of discussion about it. So there's a lot of money and effort spent on acquiring huge systems which, after four or five years, are basically sold for scrap metal, in most cases. But could you take something like that, reconfigure them and maybe use them as a cloud environment that could be on premises or an intermediate cloud that maybe NSF runs? And there are many reasons why you want to do it. This could be an on ramp into a public cloud. It could be for doing certain kinds of computation that maybe could not be in a public cloud, etc. And finally, we also talked about a group like this of experts could help scientists think through what we might call the pre-award and post-award situations, so when you're writing a proposal, how should you think about what should you budget for? Should you really use the cloud? Do you have all the cost things in there? And once you get funded, how would you then actually go about implementing these things? So these are all the issues that we talked about. What is the plan of action we came up with? So we thought, okay, what we should do -- what we can envisage happening is that there is some kind of a repository of VMs. There are geoscience-related VMs. So if a scientist says I've got some seismic data, I want to use GMT and this and that other software package, and if it's already there, there's a VM that has all of that capability, they could just spin it up and they're off and running, right? So we thought it would be interesting to create a depot or a repository of sample VMs with sample data sets already there. Then we need infrastructure. So it would be good to find partners who can do it, and this is where, right on the spot in the meeting, thanks to Microsoft, we got a $50,000 contribution towards resources in Azure, so we are going to run with Azure first. Education, training and collaboration, so there are other groups. These are the other assembly groups in EarthCube that talk about education, that are talking about cataloging resources, so all the metadata issues. As I mentioned, there are existing data facilities in the geosciences, so there's an assembly group of those folks. So we figured that our group should be talking with all of them on different issues. And also, there are other existing fairly big community groups like ESIP and the Earth Science Working Group, who have already thought through actually some of these. So some of the folks who were at the workshop, I'm not personally involved with, but others in our group were actually involved already in some of these kind of thinking that's happening in other communities, maybe the NASA-oriented communities and other agency communities and international. And so clearly, we need to connect those. All right, so with that, we should create an interface or a portal that would have this kind of community interface to cloud resources that they can get ahold of. We also talked about this notion that, if you knew what sort of software you wanted and what data you want, maybe you could come to a portal like this and say, I need this, this and that, and then we could create the virtual set of resources for you and then the equipment reuse question. So here's what -- I'm down to the next steps now. So here's what we thought we would do in terms of some concrete things we can try. It's going to be with Azure, because that's our initial resources we have freely available to us and the expertise. Wenming has also provided us some programming support for this. So there is a project I am familiar with, since I am the PI on this. It's one of the data facilities funded as a facility by the Geosciences Directorate. It's called Open Topography, that basically collects topographic data from the community. Anybody who's flying a campaign can contribute data into this, and our job is simply to serve it to the community. We don't collect the data. We don't actually process it and do science. We only provide -- we are just a data resource. That's what this is funded for. So you can come, go to opentopography.org, and you'll come to a portal. Behind that, we run a bunch of servers. We have data-hosting facilities, so that's already here. And so we could take that as an example and say, okay, how would we -- it turns out that even in our project right now, we're thinking about how would we burst into the cloud, because these are physically restricted resources we have. We're already right now looking at bursting into the supercomputer, for example. We could go into Garden or one of these resources that NSF has, but we are also looking at how do we burst into the cloud. So it kind of is a natural fit, so the notion would be could we create an OpenTopo VM that's readily available with some data. So you put some data in the cloud. That's the read-only data. That's your input data sets, and the scientists could come and spin up a VM, do their work and create some work product. So there's a work space. So what we will do is we'll take some sample data sets, so talking literally in terms of next steps -- we have heard of San Andreas, so we have a San Andreas data set, imaging of the entire fault, so we could make a copy of that. So in general, what will happen is the data sets of interest will have to be copied and made available. And then, we're not going to do this in the first time around, but in general, what could happen is you do work in the cloud, and at some point, you might decide to publish it. You might say, this is a good result, I need to persist this. And maybe at that point, it goes off into some other resource, maybe a library somewhere or whatever, which has the job of being a much more long-time, persistent resource. So that's the idea there. So here's what we are thinking of doing. We'll have an ECCC, EarthCube Cloud Commons portal, which you would come to. And that's the first screenshot I showed you, is I guess the pre-alpha version or something like that. What you might see there is a VM depot. You see a bunch of VMs, so initially, what we talked about is we can take what we do in OpenTopo, strip down all this stuff, create a sort of core VM out of it, and you might see some data sets. So you might see some disks that say, here's the San Andreas data set. So you could spin a VM, attach a disk to it, and you're off and running. You can attach multiple disks to it if you wanted. And other examples in the group, Charles Newman from Minnesota works with the Polar Project, so they have a lot of -- huge amounts of imagery and data from the polar region, so there could be maybe a polar imagery VM. There could be a seismic VM, etc. So there's just examples of the kinds of things that could be there. And then you have the user community coming into that and using it. So that's really it, and the last set of things to think about is, so as we started thinking more about this, when Wenming was in San Diego last Friday, what I showed you here in a sense is sort of infrastructure as a service. I mean, there's a VM, there's a disk. So as a scientist, I come and know this idea that I'm actually asking for a machine. I'm asking for a disk. Some scientists, this may not be that natural a way of interacting, so the other option is maybe more as a platform. That is, the entire -- in this particular example, it would be like saying, take the entire Open Topography system and just move it into the cloud. And when you come there, you see the interface you see today, which I have to say, is a fairly user-friendly interface. Our users like it a lot, where you just click on stuff and things happen behind the scenes. It's kind of interesting, and this is where I have no idea how we should proceed, but if you did it that way, in a sense, it's a vertical lab. Open Topography is a vertical lab for Lidar, and it doesn't necessarily have things built in there that allow you to spread and connect to other data. The whole idea of EarthCube is to connect to lots of data. So there is some systems design thing involved that we should think about. It doesn't have to be vertical that way. Maybe there is a much broader catalog and OpenTopo is just one of these things that goes and refers to that catalog and says, oh, I only deal with Lidar data, and this broad ocean of data -- that's some term that's used, but it's pools and lakes that are used by other companies, so it would just be an EarthCube pool, EarthCube lake of data. So, anyway, there is that issue. There's this other issue of -- so therefore, if you try to do this as a VM depot kind of thing, one concern is are you going to confuse the end user? So if they're scientists and they're coming and seeing a lot of gadgets and widgets and stuff, there's a concern there, though we do know that many of the science folks are also quite power users. So how do we balance that? And as I said, the second bullet is what I just mentioned. That is, if I go too vertical on the apps, then I might be missing out opportunities in terms of connecting all these data sets together in a horizontal way. So how do we do all that? So how do we provide interfaces? I think there's a big issue of not just what the conceptual level at which you want to create these resources, but what should the interfaces look like? I mean, I really love the Azure interface, which I got introduced to last week, but then I thought, yes, that's me as a computer science geek, but I like it. But a geoscientist might say, what the heck is all this? A cube, and what do I do with this? But I think you won't have -- so I think you will think very carefully about what's the right level of abstraction, or maybe it's like layers in an onion. Some people get very high-level abstraction. Others, if you're brave enough to press that button, then you go down into that world. Because the point being, I think at the same time, you do want to provide some flexibility. I certainly know geoscientists who can do this stuff. They just say, give me the machine, I'll do this. So that's where we are. >> Kenji Takeda: Fantastic. Thank you, Chaitan. So that's fantastic. So we've got lots of questions. Jeff. >>: So I really like this. I think one thing that you need in terms of your final remarks is there's a difference between providing data and providing data sets, and so one of the things that I think you had -- one of the models to be thought about here is the USGS has, and FetchClimate, also, is the idea that it's sort of a seamless data search, that I can go to a map and I can draw or specify a bounding box, and then from whatever source is available, I can get the elevation data from that bounding box. Whereas in the data system, what you've got is a set of tiles, probably. And you've got files for each tile, and they'll be in different formats, depending on the source of the data. And being able to have a seamless source of topographic data, specified by the sort, whether it's Lidar or SRTM or something like that, I think is a really cool idea. And that really would make it useful to a very wide spectrum of sophisticated -- even a sophisticated user would like it, and the naive user would really like it. >> Chaitan Baru: In fact, to that, I would say right now in Open Topography, we do see the 8020 rule. So if you go there, point clouds is what we serve, and that's what we all get excited with, because they're complicated data sets, but we also have precomputed DEMs, including we have SRTM. And 80% of the traffic is to the precomputed DEMs. People just want to come there, draw a box, clip the DEM they want and go away, and they trust that the facility knows what it's doing. The 20% are the power users. That's where this flexibility thing comes in, and interestingly, in the power users, increasingly, people are beginning to say -- and there are not too many, but there are certainly more than two scientists who have told us, don't give me just the point cloud data. Give me the stuff that went into making the point cloud data, because I want to make it >>: That's Lidar data that you've got. >> Chaitan Baru: Right. So I think it would be very cool to be able to support that whole chain, up and down. >>: Is this going to be integrated with things like the IEDA data sets, the [indiscernible], sedimentary. All that stuff seems like they've got a pretty good handle on how to do some mapping and integration across those different kinds of data sets. >> Chaitan Baru: Yes, so you're asking the EarthCube, the cosmic EarthCube question. So the IEDA is on the geology, geochemistry and stratigraphy and that kind of stuff. I think that's what -- we are not planning. You saw what we are planning to do. Well, EarthCube's idea is to bring all of that together in some nice, seamless sort of way, but bring the geology stuff with the geophysics stuff and the ocean stuff, all of that. Yes. And IEDA is a big player in EarthCube. >>: How is it linked with the OOI, the observatory infrastructure. They're undoubtedly building streaming data infrastructure. >> Chaitan Baru: Well, that's a good question. So this is one more of those things where I think the community can help NSF, and it's hard. >>: But it comes from a different part of NSF money. >> Chaitan Baru: Correct. That's the problem. >>: All right. >> Chaitan Baru: But I wish -- well, since you gave me the opening, I've got to walk in. Both NEON and OOI, which I'm very familiar with -- NEON, I was on the planning committee. I was just the original team. I think there's big opportunities there. Actually, I think they can open up the tab and at least let the computer science community at the data. It's unfortunate, I think they're working on very sort of old protocols of when data will be made available. I know OOI tried to use the Amazon approach, but I'm actually not sure where they are right now. But they originally tried to do this thing or stream the data into Amazon, and some of OOI is here, at UW. Yes. >> Kenji Takeda: Excellent. Just a comment, really, that the session I think has been fantastic, because we've seen lots of different angles, and this project is exactly what's happening at the British Library, where they've got virtual machines for a million images of scanned 17th, 18th and 19th century books, and they started with 20 terabytes of data in virtual machines, and they're now building it out in this PaaS with Azure blob storage. And I think it shows how the cloud cuts across different disciplines with common challenges, but as Marty said, common research opportunities. So with that, I just wanted to close the session and thank all of the speakers. Thank you.

>> Kenji Takeda: Okay, so I hope you enjoyed... interesting session this afternoon. The one here is on...

Related documents

Products

Support

&gt;&gt; Kenji Takeda: Okay, so I hope you enjoyed... interesting session this afternoon. The one here is on...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Kenji Takeda: Okay, so I hope you enjoyed... interesting session this afternoon. The one here is on...