>> Kenji Takeda: Okay, so I hope you enjoyed... interesting session this afternoon. The one here is on...

advertisement
>> Kenji Takeda: Okay, so I hope you enjoyed the morning session. We've got a very
interesting session this afternoon. The one here is on Communities in the Cloud. So I think one
of the key areas where cloud computing can really help is to allow researchers from around the
world to work together, so we have four talks in this session, and across a variety of different
areas, actually. And Steven Roberts from the University of Washington is going to kick off for
us. He's actually in the School of Aquatic and Fishery Sciences, and he used to transfer gels and
I guess do wet science, but now he says he spends most of his time transferring files. And so
hopefully we'll hear about how the cloud might be able to help him, so please welcome Steven.
>> Steven Roberts: They showed me the keyboard. Where was it? Okay, yes, thanks for
inviting me here today. As Kenji said, I'm a biologist, and what I thought I would do today, and
based on some of the feedback I got from Kenji, was talk about how we use the cloud from a
domain scientist perspective, and I'm focusing a lot on our challenges. So I'm going to kind of
show you what I normally do every day and places where we are thinking a lot about where it
could be a little bit better and a little bit easier on us. And I've never been set up so well before
ever in terms of a talk before, because I won't have to talk about any of the genomics, but that's
essentially what we do. That was great. So this is a slide I really put up on all my talks. It's a
little bit meta here, because I will actually talk about open notebook science at the end of it and
the cloud and how we deal with our science, but this is just an indication. You're free to share
these slides, and more information about how to contact me, but I'll get back to the open
notebook science near the end of the talk. So this is really not what I look at. I normally look at
a computer screen every day, but to give you a little bit of background about my science, is we
really study shellfish. I'm by training a physiologist, comparative physiologist, and I try and
understand how the environment affects oysters. This is an oyster farm in Puget Sound. This is
an oyster very close up. What we do is we get a lot of the data that we heard of before in terms
of genomics. We look at variation. We look at gene expression patterns, we look at protein data.
We look at epigenetic data, and we try to answer questions about biology and physiology as it
relates to both environmental biology and aquaculture, too. And so, as was mentioned before,
we are not in need of data. We can get data very readily, and I'm just going to show you, this is
kind of my schematic of how we deal with it. So remember that sequencer before and different
type of sequencer, that's kind of represented by that pink arrow. So a ton of it's kind of flowing
into our lab. Normally, we house it in some kind of a network attached storage device, and what
we're doing to do the analysis, similar to what you saw before, is use the cloud. So we're using
things such as -- let me see if this is a pointer or not. We're doing such things as those
assemblies, gene expression analysis, finding SNPs. We're doing it locally. That's what that's
representing, so a lot of computers. We're going back and forth to our NAS, locally, and then
we're using the cloud. So a lot of that primary analysis that we're doing is done in the cloud and
things like Galaxy and iPlant, which I think I'll show you a little bit more about, and I'll talk a
little bit about SQLShare. Hyak is just our in-house kind of supercomputing system, but that's
all command line. These are more GUI interfaces, those middle ones. And this is just to
represent, too, while we're doing that sequence analysis, we have to rely on these external
sources of data, so these are just gene databases, annotations, what a gene does and so on. I am
going to give you a little bit more context, so this is just one of our studies and to get an idea of
some of the questions we might ask and what kind of data that we need, we look at DNA
methylation in the oyster genome. So this is just a one gene. We want to know how the black
lollipops or the white lollipops affect the assembly within a gene, so that's representation of
DNA methylation. So we need to know from the sequencer how they are being arranged. We're
actually interested in gene expression levels. We're also actually interested in SNPs and
variations on top of all that. So we're trying to aggregate a lot of data together. And when we
think of data, this is kind of what it looks like. So this might be the very raw. It's got a little bit
of quality information in there and the As and the Cs and the Ts and the Gs. We spend a lot of
time mapping it back to reference genomes, and then we're making tables and then we're joining
it to other kind of aggregate data and different data sets. And we saw some numbers before, but
just in like one-seventh of a sequence machine, some of the raw data would be on the orders of
70 gigabytes. Once you map it, it doesn't get too much smaller to a genome. And then we're
making all kinds of tables and we're going back and doing it again. It gives you an idea of the
size. So we use the cloud for a lot of these, because it's big and it's big. So what I mean -- it's
just not a typo. It's big in the sense that it's large size, so we leverage the cloud. It's also big kind
of in genomics, particularly in what we do, there are a lot of questions that people can ask about
our data that let's say we don't really care about, or it's not really our focus. So in order to share
the data, so there's a lot of big questions. It's compute intensive, so we can take some of those
computations, both CPUs and RAM in the cloud, and the cloud is great for education purposes.
So I teach -- I also teach, and so it's a wonderful tool in order to not have everybody having to
configure their laptop for analyses. So some of that primary analysis, what we heard in that
other talk, like just doing those assemblies, RNA seq analysis, finding SNPs, finding DNA
methylations, is done in these kind of two platforms, is what we rely heavily on in addition to our
own kind of small clusters and high-end computers, and we go back and forth. I don't really
want to talk about these, but they're similar to what you can imagine what we saw before. Most
of us have to deal with the command line. These are some nice tools that allow the GUI
dropdown interface, high-speed computing resources. We actually spend most of our time, once
we put all that together, and if you add up the time we spend in SQLShare, which is developed at
the eScience Institute at the University of Washington with Bill Howe here. We work a lot with
him and Dan Halprin. And so we do all this analysis, we do it in iPlant, we do it in Galaxy, we
do it on our own computers, and we're left with all these tables. But they're in disparate locations.
So what we use SQLShare for is to bring these tables, which are just tab-delimited or commaseparated files and do simple SQL statements, SQL statements that even a biologist can
remember and know how to do, to answer questions like what genes are expressed during high
growth, how many exons does a gene have, any kind of thing like that. This is just the interface,
freely available on the Microsoft cloud. I'll show you a couple of examples of how we actually
use it. So this is an example. In the very background, you can kind of see -- when I log on, you
can see all of my tables, and you can just upload data. It's pretty straightforward. This is an
example. This is kind of a long statement. This is some of our work. One of my graduate
students, who wants to understand the effects of ocean acidification on the proteome of oysters,
so she has all of these mass spec tables, tons of them, tons of individuals, and she can kind of
aggregate them and sum them in what I would call a complex statement. This is something more
common of what we might do. So this is just a simple join. So we have the oyster genome, join
it to the actual sequences, so each gene has its own sequence, and then we can just write a simple
statement where the term has the word methyl in it or has the word histone in it. So we can
easily go in there and grab all the genes that have in that particular -- in the name, either methyl
or histone, and then we can study them. So that's more of a realistic way of how we use it on an
everyday basis. And we're actually with Bill and Dan's help and a little bit of seed money from
the USDA, we're actually building this out for the community, so all these data tables can be
public or private, and they're quite large, and a lot of these are universal, so annotating the
human genome or every species there is and let's say Swiss-Prot or NCBI, we can get all these
tables in there, and then we can -- we've set up, done some very preliminary tutorials about how
other researchers could use this. In this case, we're doing oysters, so they can do oysters, but
they can see how it can be used for other species. And so this in the background here is just a
little bit -- all of these kind of cloud computing resources, SQLShare, Galaxy and iPlant, are very
good about being able to share. You can click a button and all the data's public. Galaxy,
actually, you can make the workflows public, so I think that this is probably an RNA seq and
differential gene expression analysis, so you might take two libraries, you might clean them up,
you might align them, you might measure what's up and down. You can actually save your
workflow and make it public, which is great. Same thing with iPlant. You can make your data
public or you can make your workflows public. And so you can kind of think about that as
collaboration, but I would say it's not really ideally collaboration in the sense that we would like
it. It's kind of what I heard earlier this morning, is we don't really have the ability or haven't
figured it out, is to go in there and work actually together on the same data set, or like I heard
this morning, leave a sticky note or modify the data and have some kind of tracking system on
that, so that's why I have that asterisk there, and that's kind of one of the challenges we see in
that. And then I'm going to kind of shift a little bit in the same vein, but talk a little bit more
about this open notebook science and some more challenges in what we do, with a little bit of
cloud in there, too. But this is what we do. You can imagine, it's kind of messy. Actually, this
might look pretty, but it's very messy. So we go back and forth, back and forth. We might
change a little variable, and we'll go back and forth, and it's very hard to keep track of it,
especially with these large data sets, and they're so big, we hold them locally. And it's kind of
related, but we -- everybody should keep track of what they're doing. We happen to use open
notebook science, so our notebooks are digital, mainly because, as was said at the beginning,
we've gone from running gels to actually sitting behind a computer all day, so it's a natural
progression. This is a quote by Jean-Claude Bradley about kind of a definition of what open
notebook science is. He's a chemist, years ago, has kind of led the way on this. We actually use
three major platforms for this type of documenting what we do, trying to keep track of our
processes and our data. Wiki format, something like an Evernote, and then I use mainly IPython,
and that's kind of what we're going to. And I like IPython in a sense, because this is an example
of a homemade workflow. It's mainly a bunch of just shell scripts, but it allows us to document
the work, but also, for example, we can use the Python client for SQLShare to go in there and
run those queries. We don't really have to use the user interface, though it's great, and when we
do it our first item, we want to go in there and look at the data and see how it changes. But once
we know it, we can actually automate it in something like IPython and use a lot of different
programs. So this is kind of one of our biggest challenges, and what we spend most of our time
dealing with is kind of the reproducible nature of this and trying to document it. I think about
when I say we and what I'm thinking about is my graduate students, and being able to
understand, remember what they did, and also me, too, to go back and work on a few projects
next week, how in the world did I do that? So I'm at fault, too. We haven't found any perfect
solution, but I want to just point out some of the positive aspects of each of these platforms that
we're using. Wikis are nice. These are the one place that you can collaborate, so I can go in
there and make a comment in somebody's notebook and ask them a question, or I can edit it, and
it will keep track of it. So that it has that built-in version control, easy to search. I can go in
there if I know something was done on a clam on Tuesday, I can just search it and find it, and it's
very easy for us to publish, so that's one aspect of what we do is we publish it online. Evernote,
something like that, it's very simple. It's cut and paste, multi-platform. IPython is great because
you can actually run the programs or the algorithms inside the notebook. You can do versioning.
Asterisk there is I have it hooked up to GitHub so that I can kind of keep track of any kind of
modifications in my notebook. It does have a lot of downfalls, though. It's very hard, at least for
us, to collaborate, work for us at the same time. And it's very hard -- if it's not in this line, that
means it's a negative. So it's very hard to search, and it's quite challenging to publish, too, so
that's kind of what we're dealing with, and we haven't really found a great solution to keep track
of what we're doing, although we imagine there could be one. And this is kind of just a
summary, and when I'm kind of thinking about the challenges of where we are now. If you were
a biologist, you'd get this, because this is a male oyster and this is a female oyster, and so the
beginning. This is how we reproduce. We just put it in a jar and we make baby oysters. But,
anyway, one of the big things is talking to each other. That's what that white arrow means, is we
really would love it if we could integrate everything together, if one would talk to each other in
terms of documentation and make it almost error proof, idiot proof, okay? Any time we start
cutting and pasting, copying, we make mistakes, and that goes back to the reproducibility,
actually. Versioning, we can kind of deal with that with our notebooks, but we have a very hard
time about data. These large data sets, medium size, however you might think about them, 40
gigs, we might tweak them, change them, three or four or five, six, 10, 20 times and how to keep
track of what one person did to that data set, what another one did, we have no idea how to deal
with that. I guess that gets along with the problem when I was like, what happened to the data
along its trajectory from that raw to those many, many, many, many steps? Collaboration, we
haven't really found a good solution. I can go in and see what somebody did in terms of a
workflow but not really like tweak it and put a sticky note on it, something like that. That would
be wonderful. Simple sharing, this is more along the lines of probably the notebooks side of it.
Some of the platforms are easy, but something like IPython, we haven't found a very streamlined
solution for that. And discoverability. Remember, not everybody is can or interested in the open
science aspect of it, but we happen to be, and we would like it, if we make it publicly accessible,
somebody to be able to find it, too, and so we haven't really tackled that. So those are kind of the
challenges, and we think about these things a lot, so I think about these things a lot, and these are
the kind of things that kind of drive what we're doing. We want to make all of our science
reproducible, and then we want to make it open in terms of kind of solutions. So with that, I just
wanted to make a few acknowledgements. As I mentioned before, Bill and Dan Halprin have
been great in terms of the SQLShare and kind of working with us, as just biologists and how to
aggregate data, answer biological questions. These are the people in the lab and the major
funding sources that support my newer work. So with that, I thank you very much.
>> Kenji Takeda: That was fantastic, Steven, so I think we've got time for some questions. So
Dennis has a question.
>> Dennis Gannon: Have you looked at NBViewer, the notebook viewer for IPython?
>> Steven Roberts: That's how I do it now.
>> Dennis Gannon: But that's still not enough?
>> Steven Roberts: Still not enough, because it's one page, and I want to go to a page where you
can search everything I've done, but that is a great tool.
>> Kenji Takeda: I think one thing I would say, I was actually in Oxford. There was a
workshop around reproducible research a few weeks ago, and a lot of these issues came up, and
we have a few of the other projects on the Azure for Research Program, and I think that later
today, on the Microsoft Research Connections website, there will be a blog post which highlights
some of these projects. There's one in France called Xanadu, which is actually collaborative
IPython notebooks with publication channels for different research groups, and there'll be a link
to that on the blog later today, so it's a very topical subject. Question in the back.
>>: Question for you on publishing results. Do you do anything with tiered access to them? I
work with scientists who want to be able to have everybody see it, some people see it and only
themselves. Are there any tools you've found that are particularly useful for that?
>> Steven Roberts: Well, I know you have the capability in things such as iPlant to do that, and
Galaxy. There's definitely tiered access in that, and that's -- yes. And I should mention, I didn't
really talk about it, but iPlant is not just for plants. I don't work on plants, but it's a great
resource. It's a great tool.
>> Kenji Takeda: Okay, any more questions? Okay. Thanks, Steven. And so Evelyne Viegas
from our Microsoft Research Connections Team is going to talk to us about collaboration and
around machine learning that we heard about this morning, so just as we switch laptops. So
Evelyne is our Director of Semantic Computing and is interested in computational intelligence,
and she is hopefully -- there we go -- going to tell us about one of our new projects called
CodaLab that we've been working on for quite a while now, and so over to Evelyne.
>> Evelyne Viegas: Yes, thank you, Kenji. So let me first thank the organizers and the chairs,
because the way they have set up the talks, it's like half of my talk was part of the previous one,
so I'll be able to spend more time and answer questions. The reason I'm saying that, because in
the previous talk, we just heard issues around data sharing, through reproducibility,
collaboration. As you can see, that's exactly -- I'm going to address part of it with CodaLab. So
in my current role at Microsoft Research, I'm working on data-driven resource, and what I'm
focusing on is really trying to accelerate the pace of innovation by creating a community, an
online community, around sharing data, but also code, be able to execute code and collaborate in
a more effective way, by really trying hard to actually lower the barrier for the researchers to be
more productive when they do experimentations. I'm focusing, as Kenji mentioned, on this
project. We started looking at the research process when we do machine learning. As you may
all know, it's a data-driven world. I think the audience here, we all know that we're bathed in
data, overflow of data, and we're using machine learning as an enabler of decisionmaking. But
there are still a lot of issues, and some of them, when I'm going to show them here, are going to
look, resonate like the deja vu of the previous talk. So first of all, there is a lot of duplication of
effort. So when we talk about that, when somebody starts with an idea, what happens? Well,
you have to start looking for some data. You have to spend time preprocessing the data so that
you can use it, and because we are all scientists, then you also need to compare what you've done
to what other people have done. So this about, then, finding out the methods, other algorithms,
potentially code, and implementing those competing methods so that you can compare them and
do science. You need to run all your experiments, and then finally, you create tables, because as
we know, as humans, it's easier to look at data which has been visualized rather than just that
HTML or XML. And then we summarize the results in papers. So that's the first one, the issue
of duplication of effort. Now, assuming that we have access to all that, as we just heard earlier,
there is the lack of reproducibility, which is still a very big issue. And the reason, some of the
reasons why it's so difficult to reproduce is because, first, you need to find the data, have access
to the data, but even when you find the data, it's rarely made in a way which is really and readily
usable, so that's another issue here. Finally, when we talk about the absence of comparable
baselines, most of the time, we end up comparing apples and oranges. And the reason for that is
that even, actually, if we can find the data, if we can find it in a way which is usable and that we
can reproduce, well, often, it's maybe a different version of the data. We don't know which
version, something we just heard in the previous talk. So are we really comparing apples with
apples, right? Another issue. And finally, the last one, which I don't think I've heard addressed
yet here, is one way, it doesn't relate completely to the first three issues, but what we're trying to
do in CodaLab, it still relates to some issue which is going to resonate for you, and specifically
for people who talk about sharing data, is that your data, and I will add your code, is my data, my
code. But my data and my code are my data and my code, right? So one way of being able to
focus the community on working on the specific data sets, trying to share some code, is by doing
some challenges or some competitions. But often, I will argue that a lot of those challenges are
just kind of wasted. If we just do a challenge, a competition, we have some winners, and then
that's it. What happened to the data? What happened to the code? We go back to the first three
issues that we were having there. So, having said that, just to give you an example, I think
probably everybody here is very familiar with the research process. This is just there are various
forms of those tables and processes, but this has find the data, clean the data, format, e-mail the
authors to find the data, compile the code, reimplement, or rather -- yes, reimplement, run
experiments, etc. And so basically what happens is that even when we do that, what I was
talking earlier about the way why we cannot do really reproducible research is because we
cannot have first exhaustive comparisons. It's very difficult to do. Even if we work on some
data sets, the same data, assume it's the same version, maybe, when we start comparing, we
compare the same measure. In this case, we select the accuracy, but often the comparisons are
just uncontrollable, right? If I look at the previous method, so we took -- the measure is the
same, right? We're talking about accuracy, but actually, we may have been using different
sampling. Some might have been using optimization. We use different cross-validations for
comparisons, and the list goes on and goes on, and this is assuming that the software is bug free.
So what we need is one way where we can start reducing the time it takes to prepare all these
data, how we can enable true reproducibility, which means taking into account of the issues that
Steven was talking about previously in terms of versioning and provenance and how can we
document with metadata all that and establish benchmarks, common measures so that now we
can compare apples with apples and not apples and oranges. And the last one, which I won't
touch too much here, is evolving from just competitions, which is we run a competition and then
we have some results and then what's next, to the idea of competition, which is kind of this live
competition, so to speak. Actually, out of curiosity, how many people here in the room
participate in challenges or competitions, over the one year, during the year? Okay, I see two
timid hands. Okay, so I won't be talking about competition too much here. So the answer is a
software that we are developing, an open-source software, working with the external research
community, which is called CodaLab, which is an open-source platform to empower
communities to create and explore experiments on one side, to do that together, to collaborate,
and also be able to create competitions which are not completely separated then of the
experimentation side, so that people are able to come back to the results of the competition and
use that in their further experimentations. Our community leads are Percy Liang from Stanford
University, and he's a machine learning and natural language processing professor, and Isabelle
Guyon, who's been working on challenges in machine learning for decades. The main principles
of CodaLab, and by the way, I'll be doing some demos for those of you who are interested, later
in the day, if you want to see what it looks like, but there are three principles, underlying
principles, for CodaLab. One is modularity. If you were talking about artificial intelligence
problems, which require the effort of the entire community, and some people, we cannot be
experts in everything. It's very, very difficult, right? So some people may be experts in some
areas but together can just participate to solve some of those hard AI problems. So it's about
modularity. The other principle is about imitability, and that's the main idea is the idea of the Git
version control, so that now I have data sets, and if somebody's going to modify it, it's another
version of the data set, so how do we do that, right? So the main idea is all programs, like I said,
are just run and write once and then capture metadata. Enable collaboration without chaos, so
still lots of work to be done there, how do we enable that? And then finally, capture the research
process in a way which is truly reproducible? And that means understanding the data, the
version of the data, the provenance and the same thing for the code. And, finally, the last
principle is the principle of literacy, so this is very much in line and it may loop a little bit. This
comes from the idea of the IPython, the IPython notebook, with the idea of being able to mix text
and code. And basically, when I do an experiment, I may as well capture it as I do it and be able
to put my code and have other people come share with other people, directly, the experiments
I'm doing. So with that, I have -- I'm not going to do a demo here. It's a short video of two
minutes, which shows two aspects of the platform. Actually, this is the competition aspect of the
platform. This is the one we started working on. This is a challenge in machine learning, and as
you can see, you can just put your competition, describe your competition, with describing the
data, the tasks. You can also -- you also can provide the education on the screen for people to
use. So this is just like text, right, like explaining a little bit also the terms of use of the
competitions and what you're going to do. In a second, we should be able to see the little box.
So then the run details are really the definition of the competition. Then, as a participant, I can
participate once the competition organizers have published the competition, and here are some
examples that we have already there. Now, the next thing we're going to look into is the
worksheets. It looks like -- it looks stable on my machine. You know what? It's a bit loose, but
I don't know what I can do. So that's an example. This is an example of the experimentation,
where we're using some libraries, in this case from Stanford, to do some parsing. I'm sorry. I
don't know, it's really -- you'll see it at the demo later, because it's looking fine on my machine,
but I'm not sure. And so the main idea is you can actually use a comment line to do agile
exploration on the data, and then you can publish it. So here is just the output we're seeing of a
program which has parsed a little poem that I had there. And then on that, we want to visualize.
So as you can see, there is a mix of code, of text, with some metadata which has been captured.
This is really bad. I'm going to stop, sorry, but here it looks -- yes, I think that's -- okay, that's
the end, anyway, and some visualization which is going to part of speech. Just if you're
interested, just come -- oops. That was not. All right. So to summarize, so what CodaLab is
about is really about building this community and making it easy -- it's not about building. It's
actually building the software so that it makes it easier for a community who is doing data-driven
research to participate and share with the rest of the community. And it's kind of a cycle, right?
You can participate in CodaLab if you want to organize a competition, share some data, but you
can also just go there as an individual or with a research team to do some experiments and then
share it with the rest of your community and, more broadly, with the world. So in conclusion,
let's be part of the solution here. In CodaLab, as I mentioned, this is an open-source project. It's
all on GitHub, which means the community can add more features if the community wants to.
So if you have a competition to create, just you can use CodaLab, and then if you want to start
looking at some example of data-driven research, and currently, what we're starting to put on
CodaLab, we're working, as I mentioned, with the machine learning community and natural
language processing, so we don't have much on the experimentation side on bio, but that could
be another area and just trying to reproduce some of the experiments in CodaLab. And with that,
I believe that's the end. Thank you.
>> Kenji Takeda: All right. Thank you, Evelyne. So we've got time for a few questions, if
people have some questions. So Evelyne is going to be sharing this live later on in our demo
fest, so she can have a deep dive there with any of you there.
>>: No jumping in the street.
>> Evelyne Viegas: I don't know. If you enjoyed that, I can try to reproduce that, too. Yes.
>> Tanya Berger-Wolf: So right now CodaLab is primarily designed for running scientific
competitions, but presumably, one can also turn it into a tool for citizen science and open it for
public competitions for data collection, for bird counts, for things like that. How easy would that
be?
>> Evelyne Viegas: You could. If what you want to do -- so turning the competition into citizen
science I don't know exactly, but if you're talking about like, being able to share data, so
CodaLab is -- think of it to a certain extent, kind of a repository, so to speak, for data. You can
have links to data, so that would be one of the -- and then the code to process that data. So, I
mean, in terms of data sharing, execution of code on the data and sharing modules, that's what
CodaLab is for. And I would say right now we are focusing -- we start with the machine
learning community, specifically on the context of the experimentation, but really, it's all about
data-driven research, so in theory, that could be possible. And probably in practice, too.
>> Kenji Takeda: Do we have any more questions? No? Okay. Thank you, Evelyne. So
Tanya, who asked the last question, is our next speaker. So Tanya Berger-Wolf is from the
University of Illinois in Chicago, and she's a computer scientist but is very passionate about
ecology, and we had the pleasure of hosting her in the Microsoft Research Lab in Cambridge for
several months, where she was working with some of our scientists in the Computational
Ecology Environmental Science Group, in fact, the group who created FetchClimate. And
Tanya's -- you may have seen some of her work, where she was in Kenya in January on
fieldwork, running a class there, and she created, actually, I say sort of nature's ultimate barcode
scanner, so she's basically created some software that will identify zebras based on their stripes,
which is basically a barcode scanner. And so over to Tanya.
>> Tanya Berger-Wolf: Thank you, Kenji. So actually, that barcode scanner is an integral part
of the Ecological Information System, which is what I'm going to talk about. So if you look up - if you just use any search engine and do ecological field data collection or ecological fieldwork,
this is pretty much what you're going to get. This is just a screenshot of the search results, and it
goes on and on, and notice, most of it is a few humans looking at some natural world data and
writing down things on paper, mostly. All right, so clearly this is not big data. This is -- and it is
from plant and nutrient cycles in ecology all the way to behavioral ecology and ecosystem kind
of data collection. A lot of it is about one, two, three, five humans looking at data. This is my
postdoc currently. I mostly work with ecologists who are interested in social behavior of
animals, so they do this version of that data collection process. They look at, let's say gelada,
who are relatives of baboons, and they write down which baboon is interacting with which other
baboon. That doesn't scale, but what's coming is a lot more of very different kind of that data, of
data on who is interacting with whom, how do animals behave, and in general, we get much
more than just who is interacting with whom. So the pretty early source of data in that field is
GPS collars or any other type of tracking collars, radio collars and so on, so this is not a stripe.
This is a solar panel on a GPS tracking collar. Part of the project ZebraNet a long time ago in
2003, with Margaret Martonosi from Princeton, she was the one. But Microsoft's own
collaboration Technology for Nature is producing cheap GPS trackers that you can put on zebras
or whatever your animal of choice, and they can very high resolution information about which
animal is where. We're collaborating with the University of Illinois Urbana-Champaign, Robin
Kravets, to design proximity sensors. So you can put also -- we've put this on baboons. Some of
you have seen these data yesterday, so [indiscernible], which is part of Technology for Nature,
but high-resolution GPS collars that tracked a population of baboon every second for 30 days, so
there is much more data. This is 1.4 million data points per day. This is over 20 million points
of just GPS locations and all the other associated metadata that goes into the millions for one
month. There they are, those baboons. So there's also one source of data that's coming up that is
extremely cheap, abundant and high coverage and available. That's images coming from camera
traps, those stationary cameras, motion activated, from tourists going everywhere on safari rides,
parks, and just uploading their Flickr streams or their albums online or just taking pictures and
leaving that on their cameras. And what you get from there is many, many -- this is from camera
traps -- many, many, many pictures. And so in 2010, we asked Microsoft, could we leverage
those data for information about the interactions of animals or the natural world in general? And
that's where the idea of, well, if you are going to do that, you would need to be able to identify
each individual animal. Zebra, we started with zebras. They're easy. It's less of a barcode
scanner, although everybody liked to call it -- it's more of a fingerprint. And now we're onto
second generation of that software. I'll show a little bit. But now, if you can leverage images,
you can actually get a lot more information out of them, and in fact, there are many, many other
sources of images, so you can fly drones. This is a drone in Kenya. Light airplanes with camera
mounted. You can put GoPros on vehicles. You can put little cameras to track your ants, if you
want to, if you're interested in insects, not zebras. You can put autonomous little tiny ones,
underwater vehicles with tiny cameras to scan -- I was talking just about the coral reef. There is
tons and tons of image data that are now coming in, so now we're in the realm of big data. So
what do we do with it? How do we deal in fact -- there is also all these citizen science platforms
for processing or collecting image data, including iNaturalist and Instant Wild, which is also part
of Microsoft Research collaboration through Lucas Joppa in Cambridge. So what we need -- so
this is coming from before this avalanche of data, before we can even take advantage of it, we
need a system that will be able to deal with it. So my talk is probably the only one that is in the
stages of design. We're deploying the first version of it in July, so I can't show you demos of all
of this working on the cloud yet. Hopefully in a year, but so all this data coming from images,
and we're building Image Based Ecological Information System that takes all of it, processes it,
puts it in a database, and then you can produce scientific queries. And clearly, there are many
related systems out there. Don't do that. There are many related systems out there that vary in
flavor and nature from a similar project like Andes that uses citizen science to collect data about
the natural world of Andes, but without the image processing and without the focusing more on
the citizen science aspect of it, there is the animal tracking data repository, Movebank. There is
Zooniverse, which is image based and other meta information based data collection. There are
data standards projects, DataOne and NEON. There is Ocean Biogeograhic Information System,
focused on organizing data about ocean. There is the Citizen Science Research Center. You
name it, there is probably a flavor and version of it right now coming into existence. All of these
are very, very new. The oldest ones are probably just a couple of years old, probably. So the
field is maturing. So here is our version of it, and so you have all these images coming in.
Eventually, we see them coming in from Flickr streams and Google+ and wherever else the
images are coming and all my albums, and the cloud clearly being the source of data, as well as
the repository of data. But we start small. We start with tourists' cameras. And the thing is,
when you start with data that is in the field, let's say in Kenya, just for the sake of example, you
don't have connectivity. You don't have the wireless, typically. You don't have reliable
electricity, so connection to the cloud, before you can put the data on the cloud, you actually
have to jump through some hoops. And so that is very different in field data collection, which is
not like genomic data, because there is a barrier between getting data and getting it on the cloud
and sharing it through the cloud. They call it the truck Internet, because you have to deliver the
data quite often to the server by a truck. And that's what we're going to do at the beginning. So
you prefilter the data. Locally, you do a lot of work locally on the server, and part of building
this pipeline, this system, the initial deployment, is to figure out how much we have to do locally
versus how much we can do globally and what is the protocol for sharing, because there is three
Nature Conservancies to begin with. There is going to be the whole world at the end of it. To
give you an idea about the numbers, one Nature Conservancy, one tour company, 30,000 images
per day. If we go back to these systems that I showed you, they talk about 500 images in the first
three months uploaded. We do 500 images in a couple of minutes. So we have to build the
system from the beginning that within the constraints of unreliable Internet and electricity and
field conditions and constrained resources is still able to leverage the abilities of the
infrastructure and the both software and hardware infrastructure and the cloud to share the data
and make it available and make it useful to scientists who are in the US, who are in Australia, but
studying those zebras in Kenya, tourists who have been on safari but went back to Russia and
uploaded their album later, right, so we can get their data as well. And so this is our sort of
architecture. This is the main point where the information gets, the data gets on the cloud, and
after initial processing, things like -- mundane things like timestamp and location, because you
forget to set your time zone on your camera to change -- you're still in Moscow. It goes through
our image battery of algorithms, and on this part, there is also collaboration here within
Microsoft Research on image search and object detection to identify all the images that contain
zebras, giraffes and so on and so forth. And so here, it goes through our own -- that second
version of the zebra barcode scanners, called HotSpotter now, which can identify any animal
that's striped or spotted. So not only to say that this is a leopard or an elephant or a nautilus or a
zebra, but this is Joe the leopard, this is Cathy the elephant and this is nautilus number 126. So
we can now get down to individual animal on anything that's striped and spotted, even things you
don't think of as striped or spotted, like elephants, wrinkled elements. So then, we also extract -the thing is, a lot of these data, there are statistics that are not developed. So if you have all of
these data combined, how do you do ecological queries on these data? What concerns us is
what's the unit of identification? Is it one photograph? Is it an animal in one photograph? Is it a
whole series of photographs around one animal that are close in time? And so we're developing
a language, also, of how to process that information, because when you want to sequence from
GeneBank, there is a protocol, and the object of a sequence is well defined. What's an object of
an encounter? What's an ecological unit of analysis here? So this is what we spend a lot of our
time on right now, figuring it out? And then connecting to all of the other -- through all the other
useful resources such as FetchClimate, such as Movebank, tracking data, tracking animals, such
as all the satellite imagery and so on and so forth. And so at the end, all of it goes into a
database, cloud instance of database, which is a version of Wildbook. The nonprofit Wild Me
started with sharks and now scaling it up to do it for any animals and works with the standards of
ecological data. We worked with the organizations that maintained standards of ecological and
biological data collection. So here is what we think about an ecological unit. Here is, for
example, a series of photographs. You can put them together into a habitat unit or a collection of
animals, or maybe individual animals. This is using Image Composite Editor. This is through
our own HotSpotter identification, so we can string the images together and say here is all the
images around one time, one animal. We can annotate -- we can start annotating these images
and put together a story. And the types of queries that we will be able to do, hopefully by the
end of the month, is so we can now use this unit of data from images, from a collection of
images, to answer queries such as population count. So through site-resite or mark-recapture
techniques, we can now -- but in the images, we can ask how many animals are there, what's the
population size? And it has been used now in various contexts -- for example, BBC did a little
show. They used this program to maintain the counts of harbor seals as the indicator of the
harbor health in the UK. It's being used to estimate population sizes of snow leopard -population size of snow leopards in Nepal. Or we did it for zebras in Kenya, and it turns out,
unfortunately, that there's about half the number of zebras that they thought they were, and
they're severely endangered, so that's not good news. We can look at population dynamics. We
can ask questions, death, birth, with uncertainty, but we can estimate that from observations and
photographs. Habitat use, which species use particular habitats. The movement tracking of
animals through photographs instead of putting GPS collars on them, and finally, my favorite,
social network analysis. We can push it out -- that's the social network analysis. We can push it
out to citizen science, back to citizen science and education and outreach through -- so this is an
app based on the same idea, connecting to data now not collected from the safaris but from the
backyard of students at a school that they used camera traps that they put out in the school's
backyard. That's in Chicago. When you go to the zoo in two years, maybe if you come to
Brookfield Zoo, instead of the 19th century technology that you have about explanations, printed
explanations about the species you're looking at, the zebras that you are seeing in front of you,
you will have an LCD screen that will show you what these zebras are doing in the wild right
now in Kenya, where they are. But all of it also comes back to several issues, the workflow.
People have been talking about all the cloud aspects of it, that we run also the scaling, how long
will this -- it will take to run this query. The privacy, data provenance and data security -- we
would love to share the data about the snow leopards. We cannot. They are a severely
endangered species, highly protected, and you don't want to give the poachers information of
where they are and when they are there. So we have to design different levels of information
sharing. You also don't want to have the data that the students are collecting in their backyard
and the thoughts that they are having, but you do want to connect them to other data from the
real world. So all this resolving the data security, data access issues, is going to be a big deal in a
system like this, and this is my own social network of collaborators that are helping in -- that
we're working in various forms and shapes and funding agencies. Thank you.
>> Kenji Takeda: Fantastic. So has anybody got questions? Yes.
>>: Yes. The images collected from, you say, citizen scientists? I have a question about how
you would handle the bias -- possible bias from the citizen on scientists because they might
perform the touristic [indiscernible] touristic sites, or you want to study the ones that are usually
hard to observe and/or how much limit does it pose on your finding accuracy?
>> Tanya Berger-Wolf: Right, so this is a great question, and this is one of the reasons ->>: Could you repeat the question?
>> Tanya Berger-Wolf: Yes, the question was about how do we handle data biases, and data
biases, there are many, many levels of data bias that are in a system like this. There is the data
biases that are coming from the people who are collecting the data and the data biases that are
coming from the objects on which the data are collected. So the animals themselves, versus the
people who are collecting the data. And it's a great question. One of the reasons we're deploying
the system in a very, very light IBEIS light version in July is to estimate data -- there is no -right now, there is no data on data biases of a system like this. so what we're going to do, we're
deploying. We have GoPros, GoPro cameras, on each vehicle, four on each one, and a GPS
tracker. So we know what the tourists and everybody else could have been photographing versus
what they are photographing. So we're going to estimate the -- there is known things like camera
fatigue, like species fatigue. The first time they see a zebra, they take three, five, 100 pictures of
it. Twenty minutes later, when they realize there are thousands of them, you don't see that many
pictures of zebras anymore. Then there is also the biases -- they say that there are four filters
when people post pictures. There's which pictures they take, which pictures remain on camera,
which pictures they upload on the cloud or on the computer, which pictures they decide to share.
So all of this is -- that's why we're working with citizen science experts. But there is the other
side of it, is which animals -- they're photographing these animals because they're on this nature
preserve, and they're being taken to this location, but what about all the animals that are not
there? This is why we have UAVs, drones and camera traps and all the other, and so we can
compare the biases from different systems. And we also ideally would like to use estimates of
the query estimates -- say, population at site or the home ranges that are coming from this data,
and with uncertainty, so we need to build the whole statistical engine that comes from the data to
estimates, statistical estimates, with uncertainties, and then to go back and say, oh, we really
would like to have those data. So one of the projects that I've been doing during this year,
working on with Microsoft Piero Visconti and Lucas Joppa from Microsoft Research Cambridge
UK is active crowdsourcing. So when you estimate what data you would like to have, given the
data that you know that you have in cryptic species, species that are hard to get to, locations that
are hard to get to or people that are taking -- have their own biases in data collection -- how do
you ask citizen science people then to go and get the necessary data to get better estimates about
your ecological factors.
>> Kenji Takeda: Curtis?
>>: What tools are you using to visualize mobility data, and what are they not doing that will
prevent them from doing better science?
>> Kenji Takeda: So the question is, what tools are you using for visualization, and what do you
want from the tools, I guess?
>> Tanya Berger-Wolf: Oh, I want a lot. So we're starting to -- one of the side projects -- is Rob
here? Hi, Rob. We're using Layerscape actually to visualize a lot of our mobility data. Rob
demoed yesterday the baboon visualization, and one of the first things we asked is can we make
this interactive? Google Earth can visualization as well our GPS tracks of baboons, but the
problem is, they can't then click and say, oh, I need to label this. I need to label this as activity. I
need to label this as this baboon, oh, it's doing something weird. Let me annotate this. I want to
grab this group of baboons and circle around them and say, oh, focus on this, or even for
machine-learning data, I need to have activity and timeline labeling. So for -- and we want to put
it on the landscape right now, and so what's missing is a very high-resolution, accurate
landscape. We don't have the 3D very well right now. What's missing is the ability to do it on
my laptop, but it is -- I recognize the limitation of the computational constraints that it's a highly
intensive process, but I would still like to be able to do it on my laptop.
>> Kenji Takeda: Jeffrey.
>>: I had a technical comment about your track Internet. So I think in this both experimental
observation, projects like this, when it gets to the generalization of the idea of streaming. Maybe
you stream data in blocks, months' worth of data. And the same, for instance, is true in seismic
exploration. I think they take 20 petabytes of data every month, and then they bring that to the
analysis system. So as we look at streaming data and streaming algorithms, we should look at
block streaming algorithms, as well.
>> Tanya Berger-Wolf: So the comment was that the notion of streaming is different in different
science domains. When we talk about streaming, sometimes we mean every second, when you
talk about financial data, but in this scientific domain, quite often it's block streaming. So in
seismic, it was pointed out, it's a month worth of data uploaded in chunks. In this case, it's going
to be a couple of days' worth of data uploaded in chunks, probably using the truck Internet. But
when you scale it up to more than one nature preserve, it's still going to be burst-y, but it's not
going to be block streaming, necessarily. It's going to be constant trickling with bursts of blocks.
>> Kenji Takeda: I think actually the session after the break, we're going to be talking a bit
about streaming data. Any final questions? Yes, I have just one more at the back. Thank you.
>>: So I understand that you are in the process of developing a querying language for this, and
so how dynamic the data is, how frequently that it is changing or evolving, and do you envision
the traditional database query processing techniques to be useful here to expedite the query
processing challenges that you are seeing?
>> Tanya Berger-Wolf: So the question was about developing the language of ecological
queries, how dynamic are the queries, how much can we take from traditional query processing
and how much do we need to do from scratch? So the answer is both. We are taking full
advantage of what's out there already, so we're relying on Wildbook infrastructure, so Wildbook
is a database instantiation, and I can talk a lot more, with a fully developed data scheme using
Darwin Core, which is data standards for animal data collection and sites. And we're
implementing it, so they already handle things like on a small scale, like mark-recapture, queries.
They connect to some population genetic queries. They can connect to those. But the problem
is, while we have the language of this, we have this data schema of these queries. We have the
databases that we can build on and the data schema that we can build on, how do you do a
statistically accurate query of site-resite or mark-recapture-based population count from these
kinds of data, from image-based biased data? So we don't know, and the problem is nobody
knows, so this is why we're deploying the pilot first, to get estimates, to see what are the
baselines. So Evelyne talked about the baselines. We need to collect the baseline first. We need
to have those GoPros to see what people are collecting. So how does this all relate also to
traditional data collection techniques? What worries me and biologists is that we'll find that the
answers vary so widely from these kinds of data versus traditional data collection techniques,
which one we're going to trust? The high-resolution coverage of the population, which is
photographed every few minutes or the once-a-day sighting of that population?
>> Kenji Takeda: Thank you, Tanya. I think we should probably move on now to Chaitan's
going to come up and set up. So Chaitan Baru is from San Diego Supercomputer Center, so
we're delighted to have him here. He wears several hats, so he's Associate Director of Data
Initiatives at SDSC, Director of the Center for Large-scale Data Systems Research and also
leading the Institute for Data Science and Engineering. So as said, he works across many
different disciplines. Today, he's going to talk about NSF EarthCube and a new cloud initiative
as part of that.
>> Chaitan Baru: Thank you. And one more hat, Tanya and I were just talking. So actually,
I've been involved in the project for the last six years or so, funded by the Moore Foundation,
called the Tropical Ecology Assessment and Monitoring Network, which has camera traps all
around the tropics, so we're going to connect on that. Okay, so what I'm going to talk about now,
it actually is a new initiative, just started I guess a few weeks ago, so it's opportunity for all of
you to jump in. I can't even say that the ideas are half-baked, because we're just needing the
dough. So you can bring the yeast and everything. And actually, I do think it's an opportunity to
influence some of the thinking in the community and probably some of the thinking at NSF
about how some of these things should be done. At least we can try. And Wenming Ye from
Microsoft has been extremely helpful with this. In fact, he has already gone and started building
something using Azure, and that's the page I'm showing you. If you go to
eccearthcube.cloudapp.net, you'll see it, but we'll talk a little bit more about all that. So let's get
started. So I'm going to talk about this thing called the EarthCube Cloud Commons Working
Group, and the outline is let me tell you a little bit about EarthCube, because I'm not sure if
everyone knows or how many folks know about that, and then within that, what is this
EarthCube Cloud Commons Working Group, and then what are the next steps? Actually,
Wenming was at SDSC last Friday, and gave us a tutorial on Azure, and we already started
making some plans for next steps, and as I say, this is an open really community activity, so
everybody is welcome to participate. So what is EarthCube? It's a vision that started a few years
ago in NSF to create this national data infrastructure for Earth systems science, so encompassing
all areas of geoscience, Earth, atmosphere, ocean, and it was collaborative within NSF between
the Geoscience Directorate and this division called Advanced Cyberinfrastructure, which is part
of the Computer Science Directorate. The first meeting was back in November of 2011. It was
the first community meeting. It was called a charrette. It's the first time I've gone to something
called a charrette. But one of the outcomes of that was formation of a set of interest groups,
what are the kind of issues -- and none of this will be very surprising. This community is pretty
sophisticated. You've seen these kinds of issues, but clearly, folks said, well, we have the data
issue to worry about, governance. I think there is some notion that this should be a coordinated,
maybe not so much top-down but bottom up, but still coordinated, so there has to be some
governance involved with it. In this presentation, we don't want to get too caught up in whether - how EarthCube, per se, is proceeding. I think the opportunity here is the cloud computing
opportunity, for us to bring it into this community. So there's a governance group. Of course,
semantics was a big issue, what does the data mean? And then workflows, because all of science
gets done through these kind of processes. And then there was a second meeting in June of
2012, so things progressed along the way. All these sort of community groups in that one year,
made some progress about some directions, and then there were almost about 50 what they call
end user workshops. These are the actual scientists across all sorts of domains of geoscience
meeting and saying what do they think they might want from such a national infrastructure? And
in the end, September 2013, there were actually a bunch of projects that were funded, about
$14.5 million across a number of things, one of which was called the Test Enterprise
Governance, which is I guess a pre-alpha for a governance. So they're trying to figure out what
kind of governance mechanism there should be, so they funded an activity to think about that.
But one of the things that the governance group did is to come up with -- they call these
assembly groups, but these are groups in big areas that need to be investigated, one of which was
an assembly group involving industry and free, open-source software. So it's the whole issue of
software in this community, how does it get maintained and what do we do about it? And that's
actually the meeting where we were, where this idea came out. So it was maybe about 40 people
in the room. We had multiple different discussions and four or five different activities came out
to it, and a few of us got together to talk about the cloud idea.
>>: Just a quick question. This community is used to paying for services like Esri, which are
not open source, so I'm just curious about that.
>> Chaitan Baru: Well, but they also have a lot of open-source software. So it's both, and that's
exactly why they wanted to have the commercial folks there. I'm trying to remember -- I'm sure
somebody from Esri was invited, may have been there. I can't recollect. But there were others.
Okay. So yes, and they used to pay a lot of money for Esri, actually. So let's talk a little bit
about this particular activity. So we called it the EarthCube Cloud Commons, also informally
called the GeoCloud. It's an outcome of the discussions at that workshop, and we've set it up as
the charter being to evaluate, encourage, facilitate and provide education on adoption of cloud
computing by the geosciences community, so it's a big charter, right? So how do we take the
geoscience community and just push them into the cloud? So those of us who were at the
meeting call ourselves the leadership team. This is pretty open, so I don't think we want the
leadership team to have 50 people in there, but if some more want to join, that's fine. So that
includes me, Emily Law from JPL, Charles Nguyen from Minnesota, Judy Pechmann from Utah
and Wenming from Microsoft. So we had to submit actually a document back to the EarthCube
folks about what this group would do, so a statement we had was this was formed based on the
recognition that creation of private, what you might call on-premise, IT infrastructure by
individual researchers and/or research groups in the NSF community may not only be
unsustainable but may, in fact, be detrimental to creating synergies in the community and thus an
impairment to collaborative research. So you can see we are being a little edgy with the
language. But I think what we are saying here is the current model, where everybody puts a
budget item in their proposals and gets equipment, puts stuff in their closet, there is a limit to
how far you can go with all that. And you all know that with data, you can create islands of data
and data that's lost, etc. Plus, the flipside of it, if you're involved in a common environment, it
actually already creates a situation where you could have synergies and sharing of data, if it's in
some kind of a cloud environment. So some of the issues that we talked about and that we would
like to inspect -- we have about a year, by the way. So the idea is that this group is supposed to
meet and do a few things, which I'll mention here, in terms of action items over the next year.
We'll get some money from EarthCube to actually run another workshop, so again, that's another
opportunity to meet, and then see how we are making progress on some of these kind of things.
So the first idea was the community may not know exactly what cloud computing is, how it can
be used, so there's a role as a broker, an honest broker, for a group like this within EarthCube to
help the scientists say, okay, what is the thing you are trying to solve? What's your problem?
And then look at that and say, maybe you can use a cloud, or maybe you shouldn't use a cloud.
Maybe you need to go to a supercomputer, whatever it is, right? And there are many cloud
providers, as well, so which one would you do? EarthCube itself, as I mentioned, there were
$14.5 million of funding that has gone to various projects, some of which are creating software
and infrastructure, and one of our thoughts was, maybe the EarthCube, the services that the
EarthCube projects themselves are creating, maybe they should reside in the cloud right away.
You should start by the presumption that maybe they should be cloud services and then see why
they shouldn't, right? The other one is to really evaluate costs and business models, and that's
partly your question of where are we spending the money today and where would we be
spending the money in the future, if we went to the cloud, and it's all the money, not just the
hardware. It's the people, it's the fact that you might be using grad students to run systems. Is
that a good idea, a bad idea? All of those kind of things. And like I say, I think the answer is
going to be different for whether you should use cloud computing or not based on whether your
project is a small, medium or large one, and also small, medium and large in terms of your
computational and data needs. Project management and sustainability, so what is the access, cost
and benefit of implementing these long-term resources? We talked long-term resources,
examples of those are, say, IRIS, which is actually right here in Seattle, the Seismologic Archive,
or the Geodetic Archive that's [indiscernible]. NCAR, for example, we talked about is too big.
You can't say let's start running NCAR and do it all in the cloud, but there may be others that you
could say, well, what would happen if you run it in the cloud? And we thought one of the
interesting side effects of that could be that if a facility was running in the cloud, actually, the
management of that facility may be easier to transition from one group to the other. So you don't
have a lockdown situation where a group owns a bunch of physical resources and that's the
reason why they should continue to run that resource, right? Another idea we talked about was
equipment reuse. It's a side issue, but there's actually a lot of discussion about it. So there's a lot
of money and effort spent on acquiring huge systems which, after four or five years, are basically
sold for scrap metal, in most cases. But could you take something like that, reconfigure them
and maybe use them as a cloud environment that could be on premises or an intermediate cloud
that maybe NSF runs? And there are many reasons why you want to do it. This could be an on
ramp into a public cloud. It could be for doing certain kinds of computation that maybe could
not be in a public cloud, etc. And finally, we also talked about a group like this of experts could
help scientists think through what we might call the pre-award and post-award situations, so
when you're writing a proposal, how should you think about what should you budget for?
Should you really use the cloud? Do you have all the cost things in there? And once you get
funded, how would you then actually go about implementing these things? So these are all the
issues that we talked about. What is the plan of action we came up with? So we thought, okay,
what we should do -- what we can envisage happening is that there is some kind of a repository
of VMs. There are geoscience-related VMs. So if a scientist says I've got some seismic data, I
want to use GMT and this and that other software package, and if it's already there, there's a VM
that has all of that capability, they could just spin it up and they're off and running, right? So we
thought it would be interesting to create a depot or a repository of sample VMs with sample data
sets already there. Then we need infrastructure. So it would be good to find partners who can do
it, and this is where, right on the spot in the meeting, thanks to Microsoft, we got a $50,000
contribution towards resources in Azure, so we are going to run with Azure first. Education,
training and collaboration, so there are other groups. These are the other assembly groups in
EarthCube that talk about education, that are talking about cataloging resources, so all the
metadata issues. As I mentioned, there are existing data facilities in the geosciences, so there's
an assembly group of those folks. So we figured that our group should be talking with all of
them on different issues. And also, there are other existing fairly big community groups like
ESIP and the Earth Science Working Group, who have already thought through actually some of
these. So some of the folks who were at the workshop, I'm not personally involved with, but
others in our group were actually involved already in some of these kind of thinking that's
happening in other communities, maybe the NASA-oriented communities and other agency
communities and international. And so clearly, we need to connect those. All right, so with that,
we should create an interface or a portal that would have this kind of community interface to
cloud resources that they can get ahold of. We also talked about this notion that, if you knew
what sort of software you wanted and what data you want, maybe you could come to a portal like
this and say, I need this, this and that, and then we could create the virtual set of resources for
you and then the equipment reuse question. So here's what -- I'm down to the next steps now.
So here's what we thought we would do in terms of some concrete things we can try. It's going
to be with Azure, because that's our initial resources we have freely available to us and the
expertise. Wenming has also provided us some programming support for this. So there is a
project I am familiar with, since I am the PI on this. It's one of the data facilities funded as a
facility by the Geosciences Directorate. It's called Open Topography, that basically collects
topographic data from the community. Anybody who's flying a campaign can contribute data
into this, and our job is simply to serve it to the community. We don't collect the data. We don't
actually process it and do science. We only provide -- we are just a data resource. That's what
this is funded for. So you can come, go to opentopography.org, and you'll come to a portal.
Behind that, we run a bunch of servers. We have data-hosting facilities, so that's already here.
And so we could take that as an example and say, okay, how would we -- it turns out that even in
our project right now, we're thinking about how would we burst into the cloud, because these are
physically restricted resources we have. We're already right now looking at bursting into the
supercomputer, for example. We could go into Garden or one of these resources that NSF has,
but we are also looking at how do we burst into the cloud. So it kind of is a natural fit, so the
notion would be could we create an OpenTopo VM that's readily available with some data. So
you put some data in the cloud. That's the read-only data. That's your input data sets, and the
scientists could come and spin up a VM, do their work and create some work product. So there's
a work space. So what we will do is we'll take some sample data sets, so talking literally in
terms of next steps -- we have heard of San Andreas, so we have a San Andreas data set, imaging
of the entire fault, so we could make a copy of that. So in general, what will happen is the data
sets of interest will have to be copied and made available. And then, we're not going to do this in
the first time around, but in general, what could happen is you do work in the cloud, and at some
point, you might decide to publish it. You might say, this is a good result, I need to persist this.
And maybe at that point, it goes off into some other resource, maybe a library somewhere or
whatever, which has the job of being a much more long-time, persistent resource. So that's the
idea there. So here's what we are thinking of doing. We'll have an ECCC, EarthCube Cloud
Commons portal, which you would come to. And that's the first screenshot I showed you, is I
guess the pre-alpha version or something like that. What you might see there is a VM depot.
You see a bunch of VMs, so initially, what we talked about is we can take what we do in
OpenTopo, strip down all this stuff, create a sort of core VM out of it, and you might see some
data sets. So you might see some disks that say, here's the San Andreas data set. So you could
spin a VM, attach a disk to it, and you're off and running. You can attach multiple disks to it if
you wanted. And other examples in the group, Charles Newman from Minnesota works with the
Polar Project, so they have a lot of -- huge amounts of imagery and data from the polar region, so
there could be maybe a polar imagery VM. There could be a seismic VM, etc. So there's just
examples of the kinds of things that could be there. And then you have the user community
coming into that and using it. So that's really it, and the last set of things to think about is, so as
we started thinking more about this, when Wenming was in San Diego last Friday, what I
showed you here in a sense is sort of infrastructure as a service. I mean, there's a VM, there's a
disk. So as a scientist, I come and know this idea that I'm actually asking for a machine. I'm
asking for a disk. Some scientists, this may not be that natural a way of interacting, so the other
option is maybe more as a platform. That is, the entire -- in this particular example, it would be
like saying, take the entire Open Topography system and just move it into the cloud. And when
you come there, you see the interface you see today, which I have to say, is a fairly user-friendly
interface. Our users like it a lot, where you just click on stuff and things happen behind the
scenes. It's kind of interesting, and this is where I have no idea how we should proceed, but if
you did it that way, in a sense, it's a vertical lab. Open Topography is a vertical lab for Lidar,
and it doesn't necessarily have things built in there that allow you to spread and connect to other
data. The whole idea of EarthCube is to connect to lots of data. So there is some systems design
thing involved that we should think about. It doesn't have to be vertical that way. Maybe there
is a much broader catalog and OpenTopo is just one of these things that goes and refers to that
catalog and says, oh, I only deal with Lidar data, and this broad ocean of data -- that's some term
that's used, but it's pools and lakes that are used by other companies, so it would just be an
EarthCube pool, EarthCube lake of data. So, anyway, there is that issue. There's this other issue
of -- so therefore, if you try to do this as a VM depot kind of thing, one concern is are you going
to confuse the end user? So if they're scientists and they're coming and seeing a lot of gadgets
and widgets and stuff, there's a concern there, though we do know that many of the science folks
are also quite power users. So how do we balance that? And as I said, the second bullet is what
I just mentioned. That is, if I go too vertical on the apps, then I might be missing out
opportunities in terms of connecting all these data sets together in a horizontal way. So how do
we do all that? So how do we provide interfaces? I think there's a big issue of not just what the
conceptual level at which you want to create these resources, but what should the interfaces look
like? I mean, I really love the Azure interface, which I got introduced to last week, but then I
thought, yes, that's me as a computer science geek, but I like it. But a geoscientist might say,
what the heck is all this? A cube, and what do I do with this? But I think you won't have -- so I
think you will think very carefully about what's the right level of abstraction, or maybe it's like
layers in an onion. Some people get very high-level abstraction. Others, if you're brave enough
to press that button, then you go down into that world. Because the point being, I think at the
same time, you do want to provide some flexibility. I certainly know geoscientists who can do
this stuff. They just say, give me the machine, I'll do this. So that's where we are.
>> Kenji Takeda: Fantastic. Thank you, Chaitan. So that's fantastic. So we've got lots of
questions. Jeff.
>>: So I really like this. I think one thing that you need in terms of your final remarks is there's
a difference between providing data and providing data sets, and so one of the things that I think
you had -- one of the models to be thought about here is the USGS has, and FetchClimate, also,
is the idea that it's sort of a seamless data search, that I can go to a map and I can draw or specify
a bounding box, and then from whatever source is available, I can get the elevation data from
that bounding box. Whereas in the data system, what you've got is a set of tiles, probably. And
you've got files for each tile, and they'll be in different formats, depending on the source of the
data. And being able to have a seamless source of topographic data, specified by the sort,
whether it's Lidar or SRTM or something like that, I think is a really cool idea. And that really
would make it useful to a very wide spectrum of sophisticated -- even a sophisticated user would
like it, and the naive user would really like it.
>> Chaitan Baru: In fact, to that, I would say right now in Open Topography, we do see the 8020 rule. So if you go there, point clouds is what we serve, and that's what we all get excited
with, because they're complicated data sets, but we also have precomputed DEMs, including we
have SRTM. And 80% of the traffic is to the precomputed DEMs. People just want to come
there, draw a box, clip the DEM they want and go away, and they trust that the facility knows
what it's doing. The 20% are the power users. That's where this flexibility thing comes in, and
interestingly, in the power users, increasingly, people are beginning to say -- and there are not
too many, but there are certainly more than two scientists who have told us, don't give me just
the point cloud data. Give me the stuff that went into making the point cloud data, because I
want to make it
>>: That's Lidar data that you've got.
>> Chaitan Baru: Right. So I think it would be very cool to be able to support that whole chain,
up and down.
>>: Is this going to be integrated with things like the IEDA data sets, the [indiscernible],
sedimentary. All that stuff seems like they've got a pretty good handle on how to do some
mapping and integration across those different kinds of data sets.
>> Chaitan Baru: Yes, so you're asking the EarthCube, the cosmic EarthCube question. So the
IEDA is on the geology, geochemistry and stratigraphy and that kind of stuff. I think that's what
-- we are not planning. You saw what we are planning to do. Well, EarthCube's idea is to bring
all of that together in some nice, seamless sort of way, but bring the geology stuff with the
geophysics stuff and the ocean stuff, all of that. Yes. And IEDA is a big player in EarthCube.
>>: How is it linked with the OOI, the observatory infrastructure. They're undoubtedly building
streaming data infrastructure.
>> Chaitan Baru: Well, that's a good question. So this is one more of those things where I think
the community can help NSF, and it's hard.
>>: But it comes from a different part of NSF money.
>> Chaitan Baru: Correct. That's the problem.
>>: All right.
>> Chaitan Baru: But I wish -- well, since you gave me the opening, I've got to walk in. Both
NEON and OOI, which I'm very familiar with -- NEON, I was on the planning committee. I was
just the original team. I think there's big opportunities there. Actually, I think they can open up
the tab and at least let the computer science community at the data. It's unfortunate, I think
they're working on very sort of old protocols of when data will be made available. I know OOI
tried to use the Amazon approach, but I'm actually not sure where they are right now. But they
originally tried to do this thing or stream the data into Amazon, and some of OOI is here, at UW.
Yes.
>> Kenji Takeda: Excellent. Just a comment, really, that the session I think has been fantastic,
because we've seen lots of different angles, and this project is exactly what's happening at the
British Library, where they've got virtual machines for a million images of scanned 17th, 18th
and 19th century books, and they started with 20 terabytes of data in virtual machines, and
they're now building it out in this PaaS with Azure blob storage. And I think it shows how the
cloud cuts across different disciplines with common challenges, but as Marty said, common
research opportunities. So with that, I just wanted to close the session and thank all of the
speakers. Thank you.
Download