Dennis Gannon: Good morning. Glad to see we... 9:00 on the second day. That's usually a good...

advertisement
Dennis Gannon: Good morning. Glad to see we still have a reasonably strong number that show up at
9:00 on the second day. That's usually a good sign for a good day. As I mentioned yesterday, we had
90 people continuously at any given time, at least 90 people were watching the live streaming. And I
heard we have actually hundreds of people involved. So not everybody was on at the same time, but
there was many hundreds. I can't remember the number. But actually it was quite interesting. There
was a Tweet stream. You people don't Tweet a lot. There was only three Tweets yesterday. Not a lot of
people watching.
>>: [Inaudible].
Dennis Gannon: Oh, it's not on the website. It was actually -- oh, you want to Tweet? I don't want to
take away from your time here.
Once again, we have -- this is one of the sessions where we have four presentations. So this will be a
little bit shorter than others. But I want to get started first of all with Mercè Crosas from Harvard. And
this is not her title. It's my mistake. It's close. It's an approximation.
So now, let's see; we have to do something here.
Merc Crosas: Okay. Okay, I'll be talking about a little bit more than just Dataverse and Consilience.
First I'll introduce. The data science team at the Institute for Computative Social Science at Harvard
University and [indiscernible]. We just created -- just put together a website, it's going to be probably
the end of this week, to describe all the research tools that we're working on now. Some, as you will
see, some of them will be working on for years and some are new from the last month.
What we do with the data science group at IQSS, we combine -- well, mainly three groups. Some are
researchers that have a focus on a statistical computative methods of statistics and analytics. Then we
have a software engineering team, professional team that many of the people there I brought actually
from working in industry in the past before I went to join this group at Harvard, and then we have also
data curation and storageship team that helps with data archiving, preserving data, understanding -well, cataloging and understanding data. And we think that all these three aspects are essential to build
a data science team that could provide research tools for data-driven science. The team is 20 people,
and you see all the different groups, including, well, the statistical analytics, software development, the
curation and archiving, usability and UI, because we also do a lot of usability testing for our tools. And
we find that also to be a critical part of our group of QA.
So let's talk about the applications first. And I'm keeping that a little bit short because that is -- it's a
short presentation and then have time for discussions. So feel free to ask us any questions at the end.
So two of our main research tools or frameworks that we've been building in the last about eight years,
six to eight years is Zelig and Dataverse, the Dataverse Network. Zelig is a framework -- not
framework, it's a common interface that allows to access a large number of statistical methods.
[Indiscernible] by different contributors. Often the problem with [indiscernible] is that there are a lot
of computative methods that are hard to use because each one has been contributed from somebody
with different documentation, a different interface. So Zelig brings these common interface to all these
models and it makes it easier to use them to -- well, also makes it easier to understand them because it
has a common documentation for all of them. It's used by hundreds of thousands of researchers.
Actually it's been used very, very heavily in computative social science research but also we're
expanding into vital statistics and other fields.
Dataverse is also used widely -- we have a Dataverse [indiscernible] at Harvard that is free and open to
all researchers in the world where it provides a data publishing framework. So a lot of researchers to
go there, deposit the dataset, generates a citation, a clear part of Dataverse, so that they have a data
citation that you can use to reference the data from a publication, from an article or from a book. And
that way the researcher gets all the data provided, gets credited for the data, they have a way to be cited
and recognized for the data that might have been hard to collect and put together and clean up. And
they keep control of the data, they can set different permissions and how to access the data. We
encourage open access to data, but sometimes there is need for restrictions. And the framework helps
you archive it and preserve it, and also you'll see that it links with Zelig to provide analysis.
More recently in the last months we've been working on a couple of projects, SolaFide, which is an
interface to the Zelig statistical models and also it interfaces with Dataverse so that you can -- well
researchers -- normally researchers are well familiar with the statistical models; but any user, if they
don't have a good understanding of the statistics behind, they can be guided in what statistical model
they can use if they understand the data. This is going to be released in June; it's now in development
and testing, I'm sure.
And DataTags is also a new application that is a large collaboration with the computer science
department at Harvard. Berkman Center, the Law School of Harvard, and the data privacy lab, Latanya
Sweeney, she's also part of IQSS. And it provides a framework to be about -- to tag a dataset with the
level of sensitivity based on the -- well, if it has house data or student data, based on the degree of how
private that data is, if gives a different tag, and that tag, basically it's actionable -- well, it has a format
that it can be carried with the dataset. So it's a policy for that dataset to be able to understand how to
handle it, how to store it, if it needs to be encrypted or double encrypted, how it can be accessed. So
we try to maximize the sharing data or making part of the data open, but still restricting the parts that
have to be -- well, have sensitive data.
Two other software applications that we're working on, Consilience and RBuild. Consilience is a text
analysis application that uses about a hundred different clustering methods of a level in the literature,
and combines them to be able to give you a way to interactively or assist you in discovering new
clustering, new ways of organizing your documents. Not based on the existing methods, but based on
new clusterings that are in the clustering space. I'll show a little bit more about that.
And then we're just starting to work with RBuild application that helps researchers who are developing
higher statistical methods to be able to build them and bring them to CRAN, which is our repository for
our packages.
So if we put them all together, basically the research tools we're building are part of the entire research
cycle from the point of developing computative methods to analyzing the data, or for quantitating
datasets or for instruction text, then publishing that data, making sure that that data are cited from
journals, from any published results, be able to share also sensitive data and allow others to explore,
reanalyze or validate the analysis of the datasets published in Dataverse through SolaFide.
So let's talk about in the Cloud and the [indiscernible] of more interests here. So first we've been
working with Dennis and Christopher for -- well, we're using Azure for Consilience and PowerForce
and Dataverse and SolaFide also. So in the development of Consilience -- well, this is part of the user
interface that has a clustering space where you can go over that space and browse all the possible
clustering solution for your augment set. This is in development and we'll be releasing in June for a
few researchers to give us feedback and continue building based on that feedback. But how it is set up
or what is the system overflow basically and you'll see how this maps into how it is set up in Azure.
We have your load of documents set that can be from hundred to ten thousand, or at some point we
want to extend to millions of documents to tens of thousands for now. You run a clustering map, which
involves calculating a term document matrix, running about a hundred clustering methods, although
ones that are existing in the literature. And this part is the part that can be done in parallel. We then
project the results from those clustering methods in a two-dimensional space, and then from there we
calculate new clustering solutions in the entire space, building a large grid for the two-dimensional
space of the useful clustering solutions for those document set.
So there is a paper [indiscernible] and Justin Grimmer that is grad students that now is a professor at
Stanford that explains part of the methodology for out here.
So once we ran all these, then users can explore the clustering map, they can discover -- I just want to
point out this is totally different than the analysis of other clustering tools because it doesn't give you
just a solution, it also gives you a space to explore all the possible solutions and you choose the one
that it fits better for what you're looking in your documents. And then it allows you to obtain the
documents.
So it's set up by using multiple nodes, and we've been testing, expanding the part of the
supercomputing using [indiscernible]. Part of the application is returning in Scala and Java, and the
Scala component follows the AKKA -- I don't know how many people are familiar here, but have
additional information if you follow the link, it's the AKKA distributed workers pattern, so you can just
distribute easily well as much nodes -- use as many nodes to run all the clustering methods that need to
be run based on when the nodes are ready to do that and are available.
It's actually -- I have to say, it's been a very nice experience, and I'm not a developer doing that, but the
developers of that project, they didn't have a chance to do the training, but they've been able to catch up
very quickly and be able to -- they'll be happy to do the training when available, but they were able to
set that up easily and make it run very well. So that's been a great experience.
The other part is Dataverse in the Cloud. We've been setting up Dataverse instance, because one of the
-- we wanted to do two things with that. As for now in Azure, one is that we're testing CDL with
Patricia [indiscernible]. So the data app integration with Dataverse, and the other part is we're just
setting up an image for the [indiscernible] to be able to provide for an easy way of setting up a
Dataverse instance in the Cloud. We're doing that for Dataverse 4.0.
That's why it's not ready there yet. It's going to be released at the end of June. And it comes with a
whole new -- there are many big changes and we wanted to make sure that we would provide the latest
when we put this in VM Depot. But it has a new faceted search, new data filing -- well, a lot of
enhancing the data, filing [indiscernible], and support for more domains in terms of providing -- reach
the meta data for datasets in neural and social science. But in astronomy which we've been
collaborating with the students of astronomy at Harvard for a while, but also medical science, and we're
collaborating with different groups at Harvard Medical School and hospitals and other groups
worldwide to define a common set of meta data for biomedical datasets.
And another part of -- in the Cloud with Dataverse is SolaFide, which as I said before is the web
application and integrates with Zelig to be able to run the statistical models that are in the R -- well,
originally written in R. Through -- and, well, applied to datasets and Dataverse, so it integrates the two
systems.
And let me see if I can run here -- this is a very short demo movie about how you have all the variables
for quantitative datasets, and you can select -- you can see all the summary statistics. All these will
generate during the ingest or the upload of those computative sets automatically, and then they use the
[indiscernible] more about the dataset can decide what are the explanatory variables or the independent
variable, and then look at -- well, choose here only a few models that now we have available where we
have hundreds of models. And once you set up your -- what is the -- the subset that you want to
analyze, then you can run that model, forget the X and the -- depend the variable that you set up and
you get the results here that you can take and print out. And, well, this is testing and in development,
but it's a way to integrate the sophisticated -- not only linear regression, but quite sophisticated
statistical models with datasets and Dataverse.
So have all these set up in the Cloud. There is a web app that has the main Dataverse business logic
and middle layer -- or what UI in middle layer. In [indiscernible] we have a database node, the storage
data files, and then the ingest of when it processes the quantitative datasets are uploaded to Dataverse
are processed in R, so it has an R [indiscernible] to do that, and then when they are analyzed using
SolaFide and Zelig, uses rApache. These we haven't set up as these distributed nodes, but the two
[indiscernible] of ingest and the Explorer and analysis can be done with expanded -- with multiple
nodes. The idea is to provide a deeper -- one image it ingests for a quick testing and it ingests
everything for -- I mean the nice thing about Dataverse and all the Zelig is you can run it all on your
laptop as a testing installation and you can quickly set it up there with [indiscernible]. But then also
you can set it up as a production solution with several nodes and be able to expand to more nodes for
the analysis. I don't know how much time I have.
But I just quickly introduce all DataTags. This is not in the Cloud yet. But we've been working on the
development of that. As I said, DataTags helps you to be able to share sensitive data by following -well, legal revelations that you need to follow, the HIPAA, FERPA or other -- I think there are two
thousand hundred laws on privacy in the United States. But you can group them all in about thirty
types of revelations.
We've been working with the data privacy lab and with also the Berkman Center, the law school, and
the School of Engineering at Harvard. So this gives an idea of the different levels of sensitivity of the
data that you would apply to a dataset and you would apply that, based on an interview that has maybe
ten to fifteen questions on your data, and this interview is actually mapped to a set of rules that are set
up by the lawyers based on the revelations. So for HIPAA, we got a HIPAA certified setting up here
the rules, and the rules define a set of questions and from those questions you're asked the questions.
When you finish the interview you get finally a tagging for that, your dataset. That tag leaves with
your dataset and tells you how it needs to be transferred and stored and used afterwards.
So that's it and thank you everybody. [applause]
Dennis Gannon: We have time for a quick question if somebody wants to. Anybody?
Yes?
>>: [inaudible] available? These tools are available for researchers to -Merc Crosas: Yes, well, we have -- almost all of our research tools are Open Source. Consilience,
we're still working out how it will be distributed. Just Zelig and Dataverse are now available because
we've been developing for a while and they're available to all the researchers. We have a Dataverse
repository for Harvard that is open free to all researchers, but also institutions are -- well, many
institutions around the world that have set up a Dataverse network for their research data. Zelig is also
available from "Siran" and available to other researchers. All the other ones are going to be released
during the summertime. So they're going to be available to researchers then.
Yes.
Dennis Gannon: And at some point there will be a version, as you say, there will be -Merc Crosas: Yeah, in June, yes.
>>: I was wondering when you have any infrastructure setting blind DataTags that identifies that you
need to [indiscernible] authentication or whatever the tags is in the infrastructure to help the researcher
kind of expose the data with those restrictions. Or is it just these are the considerations you need to -Merc Crosas: No, it is better than that, because that's why we are integrating it with Dataverse, so that
actually once you deposit the data -- I mean it could be [indiscernible] with other types of repositories
infrastructures for holding data, but once you set that tag, it goes into -- it's part of the policy of the -that the Dataverse follows and how -- I mean the researcher really doesn't need to do anything. After
that it's all handled properly because it just follows those -- the instruction in the tag basically. And
we're reviewing these with different groups with [indiscernible] and other people in the computer
science department at Harvard to see the data. Well, actually it follows all the security levels it needs to
follow for the revelations. So it's not complete yet.
>>: How big are the datasets that you have in Dataverse? You have said you have -Merc Crosas: Yes, so the datasets -- a dataset -- well, has the meta data that describes the data house
and it can have as many files as you want. Each file is about -- has a maximum of 2 to 4 gigabytes.
The only thing is that now for example we have one new dataset that has thousands of files from MRI
data, and each file is less than a gigabyte, but altogether is a large dataset. One of the next -- I didn't
talk about all the future projects in the website that we're releasing this week. One of the next things
we're looking for is to expend Dataverse to support more easily much larger datasets, so normally is a
sum of multiple files, but building an adaptive data storage behind Dataverse to be able to query
quickly datasets of [indiscernible] terabytes if we get to extend it to that. Because we're working with
the Connectum group to do that for the image -- nanoscale microscopy imaging of the brain. So when
you put all these together those are terabytes. And also with an astronomy group that is terabytes.
[applause]
Dennis Gannon: Let me introduce Bill Howe and let him get started. Bill, as everybody knows in the
Northwest, is one of our leading data scientists. And he will tell us about what's going on in the
Northwest.
Bill Howe: Thanks. So I want to talk about some of the activities that we're doing at the eScience
Institute, which has recently been reenergized with a grant from the Gordon and Betty Moore
Foundation and the Alfred P. Sloan Foundations, although our activities in the [indiscernible] predates
this award. This has really been able to give us the kind of scale that we've been looking for for a
while. So this is a joint partnership with the NYU, Berkeley, and the University of Washington for a
five-year program to create a data science environment.
So just to give you kind of a scope of this, we're not going to give you a speil about data science and in
general, and eScience in general, because I think the audience is sort of cohesive enough in this area
that we sort of seen these pitches before.
But just to give us a sense, you know, we put out a call for posters for our rollout event and we had 137
from departments all around campus. And you know, some of the areas such as the political science
were actually some of the most interesting posters that we found. So there's really a pervasive need
that we all sort of recognized. But it was nice to sort of see them practiced. So this is a pretty
substantial groundswell of interest in these topics, in eSciences, in the context of this award as trying to
be the umbrella organization to manage this.
And just to throw these covers up that we've all sort of seen, this is a fairly old slide here, but what I
like is Roger was here yesterday and of course here at Microsoft, and I use this quote from him, it's a
great time to be a data geek. But the ugly side of this is this other quote I like to give a lot is this, and
I'll let you read it for a second. [laughter] This is perhaps the ugly side of data science. And what
we're trying to do is, and I think most people in the room here are trying to do, is to take this term data
science and some of the technology and techniques and skills around it and really kind of focus it on
astronomy and oceanography and social sciences, physical life and social sciences as opposed to only
the industry needs. That being said, we really see a lot of interplay between the two sides of it in that
industry can invest deeply in information technology that science sometimes can't; so we can borrow
tools and techniques. And going the other way we see there's kind of a culture of statistical rigor
coming out of sciences that I think businesses are only sort of just starting to acquire. So the line
between them is blurring. And we like to sort of take advantage of that when we can.
Okay. So the way we sort of set this up is just sort of establish a virtuous cycle where, you know, just
new advantages in data science methodologies and techniques and technologies really enable new
discoveries, and new discoveries kind of spur requirements in the methodology fields. And to mediate
this, we have these working groups across these six areas, and in this talk I just want to focus on three
of them and give you some highlights from education, software tools, and this idea of working spaces
in culture where we have this data science studio that we're building up that I'll end on. And the other
one just so you see them -- do I have a pointer here somewhere? No, I don't do that. Maybe I don't
have a pointer. No, there's no pointer on here. That's why it doesn't work. That's too complicated. Oh,
here it is. It's quite nice. Oh, gosh, yeah. Burn somebody with this.
So reproducibility I'll point out. The connection between reproducibility and the Cloud is great, and I
have some slides on that but I'm not going to show them. I'll be happy to talk to you offline. We have
this kind of meta analysis of our own success that we bring up as a first-class citizen. And career paths
is a big deal too that I'll be happy to talk to you about offline about how we're trying to change the
kinds of people we attract and try to keep them around focused on science.
So these three I want to focus on in this talk. So in education we have kind of a broad portfolio of
programs in data science. There's a certificate program that Roger mentioned yesterday very briefly.
He's been involved in helping to design through the University of Washington educational outreach
group. There's a massively open online course that I ran last spring that I'm going to run again this
summer, I hope, that I'll talk about in the next few slides.
There's an NSF grant under the Eiger Program, if you're familiar with this. This is for integrative
graduate education research and traineeship, and this is focused on eventually creating a Ph.D. program
in big data, but through a stepping-stone of a Ph.D. trek.
We have new computer science courses focused on big data and data science, all sorts of boot camps
and work shops. We ran the Azure -- or we hosted the Azure two-day workshop last summer, for
example. This there is also a CS course, but we're trying to sort of open it up to people from all majors,
a different path into computer science topics. Rather than just focusing on games and puzzles, we
focus on data analysis as a first-class activity, which we think has a capacity to sort of open up to larger
cohorts that aren't typically interests in programming in computer science but would be if we focus on
data. We're getting started on a data science master's degree.
And I'll end with this. We have this incubator program that we run out of the studio that's less about a
formal education but kind of a hands-on training flavor.
So a few slides about this MOOC. We ran this through the Coursera platform. And the challenge here,
one of the things I was interested in, was seeing if we could cherry pick topics from introductory
classes that are typically separate; statistics, machine learning, databases, and visualization, and
combine them together in one course, which is, you know, maybe a bit of a magic trick. You run the
risk of being so superficial that nobody learns anything, or sort of having the intersection of
prerequisites be so small that there's only about three people in the world that would actually benefit
from it.
So I'm reasonably happy with the outcome and the idea was to be sort of superficial and have deep
dives and particular topics that we think are key ideas that everyone needs to learn. This was the
syllabus. We sort of focused on what is data science, given that the term is sort of new and people have
different interpretations of it. We talked about databases and MapReduce and NoSQL, and all of this
under this umbrella of data manipulation of scale; statistics and machine learning where we really can't
give a rigorous introduction on mathematical concepts in a couple of weeks, but what we can do is sort
of teach key algorithms that everyone should be familiar with and key concepts that -- I tried to pick
ones that aren't typically taught in a Statistics 101 course. I didn't want to do kind of a compressed
Statistics 101 because, A, I'm not the right guy to teach that, and, B, it would be somewhat boring.
Visualization. And then this week we might swap out with other kinds of special topics. But the first
time we did it, we did graph analytics that seemed to be popular and we had guest lecturers from
various companies.
So the participation numbers are fairly fun to give, but some of the bigger ones are kind of
meaningless, and I'll explain that. So the number of people registered is a ridiculous large number of
120,000, but the bar to register for a course is essentially zero, right. It's you log into Coursera and you
click one button and that's it, right. There's obviously no money being paid, there's not even any kind
of sign-up process or anything. So this includes people that had no intention of going through the
course; they were just exploring things, who were going to watch one video and pick what they want to
take and so on.
So a slightly more realistic number is people who clicked play on at least one video in the first two
weeks, which is still quite large, but this is people that are still kind of feeling things out.
Now we're getting down to the people that are actually taking the course. 10,000 people actually
turned in the first homework. And after that the attritions, I'm reasonable happy with. 10,000 turned in
the first homework and about 9,000 completed all the homework, which is pretty good. This is
partially gamed by me, because the first homework was a bit of a doozy. So we sort of, you know -- if
people got through that, they had a pretty high likelihood of being able to get through the rest of it.
And this overall attrition rate or a MOOC is pretty standard, because the bar is so low. So a lot of times
when people are trying to bash these MOOCS, you know, they like to point out the fact that other
[indiscernible] online courses in the past have had a much better success rate of keeping people around.
But never before has the bar been so low to sign up for these. So it's not a very convincing argument to
show this and argue that the MOOCS are being successful. Although I'm not necessarily a MOOC
cheerleader. This is wholly an experiment on our part. We're not totally sure where this whole thing is
going.
Okay. And the number of people that passed was about 7,000. Other kinds of numbers I thought were
interesting was the discussion forum was a really critical piece of this. The students who were paying
attention, were on top of things, could answer questions faster than we in the TA's could, and that's
really the only reason this works at this kind of scale is because you do have this kind of cross-talk in
the forums. So there's quite a bit of activity there. And this has a dataset to actually do some analysis
on. It's something we're looking into now to detect when people are learning and how people are
learning.
So all this is pretty consistent across, quote, hard courses that have some kind of a technical or
programming component to them.
To just to illustrate the idea of attrition, this is videos watched, or a lot of people watched the first one
and it went down. And again this is pretty standard. This is the same thing with the assignments,
which are a little flatter, which is nice. And this is one assignment and two assignments with a different
part. You can see that as the parts kind of get harder, so there's some dropoff in this sort of Twitter
sentiment analysis assignment, and the same thing with the databases.
We had coverage from all over the world. This is actually not a map that I drew or a survey that I did.
This is a student who posted on the forum, A, where are you from, and then mark it on the map. And I
stole that and used it in talks because it was sort of nice; I didn't actually have to do any of this work.
But there's some concentration in the U.S. that seems to be pretty all over.
This was a little bit disappointing, so one of the stories about motivating MOOCS is that there's people
who don't traditionally have access to higher education materials who are signing up for these things.
And some of the evidence coming out, which has been pointed out before, is that it's not really clear
that that's who's taking these things. A lot of times it's people who already have degrees who are
professionals and working and trying to augment their skill sets as opposed to people who lack access
to higher education materials. And that sort of came here too is that the biggest number here is -attracted a lot of professional software engineers as opposed to undergrads in other developing nations
or something.
>>: But don't you agree those people also have a lack of access to the same materials? It seems like it
does fit your -Bill Howe: Possibly, but I think they're probably doing pretty well. And they're trying to -- they're
interested in this -- in rounding out their skill sets as opposed to -- I mean basically they have a degree
and they're successful and they have a job and so on. And sometimes there's a disconnect between the
reality and how the advantages of the MOOCS are pitched, which is, oh, we're giving opportunities to
people that don't necessarily have the degree or something. It's not all bad. This isn't terrible. It's just
that this is not quite the same story that you hear from the purveyors of the MOOCS. But this is pretty
standard. A lot of people pointed this out before that a lot of these courses are being taken by
professionals.
>>: I would agree with Chris, I think this is good news.
Bill Howe: Okay. Well, good. This is great news.
>>: [inaudible] as, you know, being someone that fits that category, I didn't sign up for your class, I'm
sorry, but being somebody that fits that category, I've taken MOOCS for that very reason because I
don't have the ability to get to a university to take a class. It means I do not have availability or access
to it. Just because I'm working doesn't mean I don't have availability.
>>: But this room is full of people who have been retrained several times.
Bill Howe: That's true. That's true.
>>: Okay. So if I'm a barista, what category am I going to land in there?
Bill Howe: I don't know. There might be ->>: You don't seem to have something like working stiff. [laughter]
>>: I mean everybody's a professional -Bill Howe: Working professional, non-tech is what I tried to do to capture that.
>>: Okay. So let's say that there's 1760 people that are not in tech, some of those people might be
making minimum wage. You don't know.
Bill Howe: Yeah. Yeah. Yeah. Absolutely.
>>: You give them an option to say minimum wage employee.
Bill Howe: Yeah, if we redo the survey, there's lots of ways to improve the survey. That's not the most
->>: Yeah, I guess -Bill Howe: That's not the biggest thing I regret about the way I designed the survey actually.
>>: Yeah, I'm just thinking, yeah, you really can't tell who you're serving in terms of -- let's say, in
representing groups or people who are trying to climb the socioeconomic ladder.
Bill Howe: Absolutely. And I was a little bit nervous about doing too much demography in the survey.
Because I'm not totally sure the legality of some of these things. It's a little bit fuzzy. Because in some
sense, I'm not officially representing the University of Washington in these capacities because
otherwise we're into FERPA requirements. However, it's not totally obvious that there wouldn't be
some flack coming back from the University of Washington if we break some rules.
But any way, I would like a much richer dataset of who we're helping. And there might be a way
around that. But I was being fairly conservative in this case to make sure I wasn't running afoul of
anybody or make anybody feel uncomfortable of trying to answer a question and then getting a bunch
of flack. At this scale everything became a problem. If somebody can complain, you will get emails.
But you're right, I think there probably is somebody that is better at designing these instruments, would
be smart. This was sort of whatever, flying the plane as it's being built, whatever the metaphor is.
>>: I don't mean to quibble with your main query, which is basically half of them are already software
engineers.
Bill Howe: Which wasn't necessarily what I expected, I guess.
And the other thing about this scale is you get these great little anecdotes. If you have 10,000 people,
the odds of something awesome happening is pretty high and [laughter] but of course I get to cherry
pick and pick the ones. But it's actually somebody that ran a company, turned out it was a grad and I
didn't even know that at the time. But he was working on assignments over on his honeymoon and
ended up selling his company and decided to take a job with one of the companies that gave a guest
lecture in my course as a data scientists and sort of completely changed his career and other things like
this. So it's sort of fun to do this, because you do have the impact at this scale.
But okay, so I wanted to mention ->>: [inaudible] my wife's amusement [laughter] my wife's annoyance.
Bill Howe: I was wondering if you paused over the word amusement. Like was that for my benefit or
was that the reality.
Okay. So real quick I want to mention this. That's one activity that we're happy with and we're going
to keep running this. This is Magda,who is the lead PI on the grant. And it's a broad partnership,
because she's really been the prime mover through this. If you know Magda Balazinska, she's a
database faculty at the University of Washington. And the idea here is we stole this from Alex who
claims that he stole it from somebody else, but the problem is unclear. But this idea of pi-shaped
people -- people have heard this analogy before? Anybody know what I'm talking about? So if you're
a T-shaped person, you have kind of a thin veneer of interdisciplinary that we all have and kind of a
deep lag in one area. And a Greek letter pi would have, you know, two deep legs, right. Maybe you're
sort of machine learning and biology, which is some [indiscernible] people or something like that. And
that we see the creation and support and incentivizing, you know, career paths for these pi-shaped
people is kind of one part of our mission. We want to create these people and reward them when they
do this, as opposed to perhaps what the status quo is that if you end up -- if you're a biologist and you
get a little too much into computer science, well, now you're neither fish nor fowl, right; you can't get a
job in a biology department because you do too much of that programming stuff that we don't like and
you're not going to be a computer scientist because you didn't write enough papers in Sigmat or
something like that. And yet these people are kind of, in some contexts, driving science forward. So
we're just in sort of supporting that. So that's the acute analogy here.
The reality is the NSF program, that -- Eiger is a broad program, they do it every year. That particular
year we had the award, they were particularly looking for these data-enabled science engineering, kind
of a data science track. So all the proposals that came in were about kind of big data Ph.D. programs of
various forms. And ours was one of the ones that was funded. The key tenets here is that we tried to
have students in domain science and students in methodology sciences working together as a cohort,
whereby the last chapter of the thesis of a Ph.D. student in computer science would be about how their
techniques were actually applied in domain science and the actual science that came out of it. And the
last chapter in the oceanography student's would be about the new tools they developed and the
software they developed.
And part of this is not just convincing the student to write that chapter, but also convincing the thesis
committees to pressure that work in both directions. So we want them to sort of sit together, work
together, come up with projects that sort of go together. And I won't go into the details here but I'm
happy to talk to you about it offline.
And then actually Siburn Fischer Development where we're deploying this stuff and seeing it used is
another key theme here. And I'll just sort of throw out proof of constant, but actually get real
sustainable tools being built. And we have participating organizations -- or the departments across
campus are participating in this, astronomy and genome sciences as well as computer science and
statistics, for example. All right. So that's a couple points about education.
In the software -- I'll try to go faster here, I think I'm going to run a little long here. We see the data
sciences can be thought of as this three-step work flow. And this isn't our idea; you see this sort of all
over the web, is that there's sort of this first step of preparing data, the second step of actually running
the model, and the third step of interpreting and communicating the results. And what we hear from
people, and we agree, and we've heard of this a little bit of this yesterday from various people is that
this first step is really the hard one. This is 80 percent of the work, so to speak. And then the joke is
that this is the other 80 percent of the work.
But this No. 2 is not really what's keeping people up at night. It's not really about selecting the
methods. And there's a couple of reasons for this. One is that simple methods actually go a really long
way. Having going from we have no predictive analysis capability to we have some predictive analysis
capability is a big step. Getting two percent improvement by using someone's sophisticated state-ofthe-art method is not really the difficult part. Also simple methods are the ones that actually scale the
large data, typically, as opposed to the really intricate ones and so on.
The bad news here is that this step, I claim, is getting the least research attention actually. This one is
very fun and easy to sort of -- well, not easy, it's very fun to find -- come up with new algorithms and
get them published, while this is kind of seen as, oh, this is some stuff you have to go through, some
pain. We'll hire a post-doc who will write a bunch of code and they'll handle that part, and then we'll
get to the cool part, which is the machine learning to cure cancer.
And then similarly there needs to be a little bit more emphasis on this as well. This is visualization.
You know, it can't be sort of a black box decision. You can't spit out a number and expect stakeholders
to make a decision.
Dennis is standing up, so I'm definitely running long here.
You know, if you're going to sort of build software around this stuff, you're going focus on this step
one, you have to sort of answer the question of what are the abstractions of data science. And the fact
that we're using terms like data jujitsu and data wrangling and data munging, and this is what you see
in sort of on the web and so forth, suggests to me that we don't really have any idea what we're talking
about; right? This is bad.
And but the good news is that I think we do sort of know the abstractions of data science, they just
haven't always been recognized as such. Some of the candidates are, you know, matrices and linear
algebra, relations and relational algebra. Maybe everything is objects and methods like in Java or C#,
files and scripts and computational science. I would claim that only the first two are really viable as
potential fundamental abstractions to manage data science work flows. And I would also further argue
that relations and relation algebra is under-recognized as a fundamental concept here. So we try to -some of our software here is really trying to jail-break relational algebra outside of sort of databases
and get it used in more of a data science analytic sort of context.
And SQLShare you heard mentioned yesterday. We tried to simplify databases down to just this simple
three-step work flow of upload data and not worry about a schema; you write your queries and you
share the results with people and that's it. And where you sort of organize it on a web-based interface
and you can write queries and share them and so on.
And we've had a lot of uptake in science for this where previously they never would have used
databases. We've got people that don't write any program code in any language whatsoever, not R, not
Pipe or anything, but they write these sort of 40-line hairy SQL queries that do like interval analysis on
genomic data or whatever. So this is a pretty good sign that maybe there's something -- you know, this
whole decorative language thing actually has some weight behind it.
We see R scripts being translated into SQL queries, and they sort of ostensibly don't get more
complicated when you do that. They sort of have the same expressive power. And yet the one on the
right scales sort of infinitely. We know how to scale this program up, we don't know how to scale the R
program up. "We" being everybody. You know, Stephen Roberts gave a talk yesterday, takes Galaxy
workflows that have this kind of complicated multi-step kind of thing going on and then they get kind
of compressed down to a handful of queries, and so on.
We're teaching people SQL and work shops and so on. But the problem here, and I'll wrap up here, the
problem here is that that's just SQL, that's just relation algebra. There's a certain class of analytics task
that you can't do an SQL. And we're trying to see if we can take what we like about relation algebra
and adjust enough for richness to capture a lot more of the steps. So instead of query as a service we
think about this kind of relational analytics as a service.
There's a big team, a bunch of students working on this. I'll skip the architecture. We have a little
relational database on each node and it's a big parallel sort of engine. The other weakness in SQLShare
perhaps is it wasn't really focused on hundreds of terabytes and Myrias. And the two big things about
Myria are iterative queries, so you can do loops, which means you can express things that converge;
right. You can express analytics tasks and clustering tasks and so on. And then it also scales a lot. It
scales as well as we know how to do at the state-of-the-art.
So we have this kind of theory of -- theory -- we have this notion of relations algorithmics. We want
you to do algorithm design but you're using relational algebra right there in your program kind of like
Lynk and C#. And this is Kmeans in our little language where everything here is relational algebra, but
it's expressing Kmeans. I'll skip this I guess.
This is manually optimizing a program. Here's a simple relational algebra version of it to an
increasingly complex one where it gets faster but there's no hope that the database optimizer's going to
magically turn out this program from the original one. So you do need to have the programmer kind of
working in this space. So it's not just query optimization, it's actually algorithm design, and yet every
single line is relational algebra. And we went query debugging and visual analytics of the running of
these things to be kind of a first-class citizen as opposed to something only DBA's do. We want the
actual analysts and the scientists to be doing this stuff. So we have some visualization work in this
space.
And then let me just wrap up with this. We're conscious of the need for physical proximity, or we
believe -- or our hypothesis that physical proximity between the computer science and statisticians and
methodology people and the domain scientists is sort of important. So we're trying to set up a data
science studio here on campus where they can be -- sit and work together and run programs through
this. And one of the programs we're running is this incubation program where people submit one-page
proposals. There's no money exchanged. They get to come and sit elbow to elbow with our staff, our
data science staff, and thereby hopefully take sort of a two-week problem of trying to figure out how to
use GitHub and Hadoop into kind of a two-hour problem because we have expertise right there in the
room to do it, and really focus on these sort of short-term, small-backed, high-payoff kind of projects as
opposed to the big sprawling two-year "let's write an NSF grant together" kinds of things. And this
seems to be working pretty well.
We're doing a pilot in this quarter and we have projects spanning social sciences, we're here where
there's some manifold -- you know, finding these embedded manifolds but they need to scale this
method up. This is some work by Ian Kelly here, right here in the room who's looking at cell phone
call logs and correlating that with events such as bomb blasts. And we're doing a systematic analysis of
different kinds of tools for this job, you know, in Cloud -- on the Cloud and off the Cloud, and so on.
So this has been pretty successful.
And then if you want to get involved, there's various ways. You can take the MOOC, you can talk
about these incubator products. We'd love to have people come by to work on either side of the desk;
either to come and work on their own project, or to help other people with their projects. We have a
notion of reverse internships where we hope to attract people from industry to come and physically
hang out with us for a quarter. And also we're pretty committed to coming up with reasonable materials
around all these programs so that if other people at other institutions want to try these programs out,
like run their own incubator, they can borrow from us, if that's somehow useful.
And I'll stop there since I'm over, I think, by a lot.
Dennis Gannon: So when Gabriel was setting up, we should -- we can take a few questions.
Yes?
>>: Kind of relate it in terms of the education piece, I guess. So I think it's fair to say that in academia,
you don't get promoted for writing software. So in this data science environment, are you trying to
tackle that or you actually creating a problem?
Bill Howe: No, that's exactly right. I didn't talk about it, but there's one of the working groups is
Career Path and it's exactly that. One of the offerings that we have is that we're producing people with
Ph.D.s who are not -- in some cases, who are not necessarily motivated by the traditional academic
career path. We lose 100 percent of these people to industry to a first approximation; right. There's
really no appropriate job for them on campus. Sometimes there's a staff program or job where you're
beholden to some particular investigator in some particular department, and you live or die by their
interests. But that's not a very competitive, attractive job compared to what they give in industry, even
ignoring salary, which we can't beat on.
So we're trying to say what if we came up with this data science studio where they had some autonomy,
they had some prestige, they have some control over what they work on, they have -- probably still
can't compete on salary, but we can do a decent salary. And we hope to attract these people back
through that and actually give them a career path where they can grow and get promoted in various
ways for doing this kind of work. And we'll see if it -- you know, ask me in five years whether it
worked or not, but we have some early success in that we have people who have been thinking about
what to do next, deciding to come and work with us that are incredibly good. Dan Halprin from
UWCS, he has a Ph.D. in computer science and got excited about this new science stuff and came back
to work with us. Jake Vanderblas, a Ph.D. in astronomy, who's a machine learning expert, decided to
come back and work with us. So we have some success stories already.
Dennis Gannon: Thanks a lot, Bill. This is really terrific stuff. I'm delighted. [applause] You had
enough good stuff in there to come back again and give a two-hour talk.
All right; Gabriel has been working on a project that we have been helping with at Inria for the last
three years and he's going to give us a quick update on it.
Gabriel Antoniu: Right. Thank you very much. First of all I'd like to thank you for the excellent
opportunity to be here. So I greatly appreciate that. It's very exciting. It's a very exciting workshop.
So the projects I'm going to talk about are data-intensive computing on clouds obviously on Azure
clouds for science applications.
So these are projects that have been carried out in a framework of Microsoft Research in the Inria Joint
Center, which has a very nice location in Paris, but it's actually not a centralized center, it involves
themes for several research centers in Inria. So Inria is the main research institute in computer science
in France. It has eight research centers. And in these projects we have things involved from three
centers. So I'm personally in charge of one of the research groups that's called KERDATA, focusing on
scalable data storage and processing for large-scale distributor infrastructures.
So the first project I'm going to talk about is called A-Brain. So it's basically a first Cloud oriented
project within this joint in the Inria Microsoft Research Center. It was started as part of the Cloud
Research Engagement Initiative directed by Dennis, so one of the first French projects in this
framework.
From the infrastructure level, one of the goals was to assess the benefits, potential benefits of the Azure
infrastructure for science projects. In this particular projects, we focused on a large-scale genetics and
in imaging analysis. And two Inria teams were involved here, my team focusing on storage and datarelated issues, and another team called Parietal located in [indiscernible] closer to the joint lab,
focusing on neuroimaging. And of course we were in close connection with Microsoft, with the Azure
teams, and also with people from ATL Europe.
So in this project we have basically three kinds of data. Genetics data, SNPs, but those were presented
yesterday so this term has already been introduced. We also have lots of neuroimaging data, so
basically brain images, and behavioral data. That's smaller data that basically corresponds to the
presence of some brain diseases, for instance. And the goal here is to find correlations between
neuroimaging data and genetic data.
Previous studies done by experts of the area established some links between the genetics data and the
behavior data that is okay. For some specific genetic configurations there are higher risks to develop
some brain disease, for instance. And the idea here was to make correlations between the genetics data
and the neuroimaging data and to use the brain images to predict these kind of risks for the presence of
disease even before the symptoms appear. So this is, let's say, the science challenge.
So now concerning the data sciences, there are millions of variables that present neuroimaging data, so
voxels that present the images there are millions of variables that are presented in genetics data too, so
the SNPs, that could be hundreds or thousands of subjects to be studied. So there are genetics, a lot of
potential correlations to be checked between each individual brain image and all pieces of genetic data.
So in order to do this, we started in a quite, let's say, simple way. We thought it would be interesting to
assess a MapReduce processing approach. Now remember at this time there was no Hadoop on Azure,
so we somehow started from scratch with basic block storage abstractions in the web browse and the
worker roles. So we basically started to build everything up to that level.
So there were several challenges for us at the infrastructure level. One of them was related to high
latency when the VMs accessed the public storage. So I say the simplest solution was to use those
roles to deploy them and then use public storage and try to have the VMs access the public storage.
Well, when you have highly concurrent accesses to shed data, that doesn't work very well. So we had
to try to do something better in order to enable some efficient MapReduce processing. So if we do that,
well we know that we had some previous experience with Hadoop and we realized that it was not really
the best we can get. But we said, okay, let's try with that a simulator; and later on, for instance, we had
some challenge related to the optimization of the reduce operation because just the basic way we apply
standard MapReduce to this application was not optimal.
So what did we achieve on the infrastructure side? Well, we tried to address all these goals. So
basically, so just the first challenge, instead of using public storage, so Azure Blobs, for instance, we
said, okay, maybe we should better leverage locality. And for every virtual machine, there is a way to
use -- to leverage the virtual disk attached to it. So basically that's what we did. We gathered together
the virtual disks of the virtual machines into a uniform storage space and we built a software
infrastructure for that. So now the VMs had kind of a kind of distributor file system deployed within
the VMs.
Now we want a system that is optimized for concurrent accesses, for heavy concurrent accesses data.
In order to do that, we went together -- it didn't just start existing this [indiscernible] system. We
worked on our own and we arrived on the versioning and high throughput -- on versioning techniques
to support high throughput under heavy concurrency. And in order to do that, we had done some
previous work on the team on a software called BlobSeer, which was designed specifically for that
purpose. So I will not have the time to develop that. I will just say that it basically combines a series
of techniques like distributed meta data, lock-free conference rights, thanks to versioning, in order to
allow concurrence rights or read concurrence rights to shared pieces of information. And that's what it
tried to push on Azure or build the Azure version of it for TomusBlobs.
So this is the final result, let's say. So I'm skipping further towards the end of the project, I say three
years old, or three years later. I didn't mention that the project was finished actually a few months ago.
And what it could achieve is actually shown here, is that we could reduce the execution time from the
[indiscernible] application, which is the application that we were talking about in these projects for
doing the joint genetics in neuroimaging analysis, or we could reduce the time from this level here, that
corresponds to what would happen if we used just AzureBlobs to this level here, which corresponds to
the image of TomusBlobs. We could do this, so that's again in the other 45 percent in time.
We also were able to run this across several data centers up to 1,000 cores. There is a demo available
on that. So in order to reach this goal we had to go step by step. And there were some smaller
challenges, some sub-challenges we had to address. For instance, first of all we implemented the naive
MapReduce and then realized, while doing the performance measurements, that the reduce phase was
really not optimal. Actually there was bottleneck level of the reduce phase. Because we needed single
results and we needed to gather the results from all the map personnel to produce something that makes
sense. So we basically worked on a refined version of MapReduce, which is called
MapIterativeReduce, which allows basically the reduce phase to happen in several stages with
intermediate data, and it works as long as you have associative operations. So we could further
improve the execution time this way. So in this picture here you have the time again corresponding to
the usage of AzureBlobs in blue. And then that was in red is what happens when you use TomusBlobs
so we leverage locality. And what we gain further, when we use this MapIterativeReduce approach. So
it was worth doing it.
We have the time to go further. So the next point was to go beyond the frontier of a single datacenter.
There is a limit on the number of VMs that could deploy on a single datacenter. There is no reason to
say it's an arbitrary limit. It is, anyway, interesting to see what happens if you want to deploy a number
of -- a large number of virtual machines, for instance, by leveraging the sources available in different
data centers.
So how to enable scalable MapReduce processing across several datacenters. That was one of the
challenges. So of course this corresponds to a [indiscernible] genetic scenario. It's not specific to our
applications.
Now, the big problem in such configuration is the latency that you have between the datacenter. So the
goal is to minimize those -- that transfers across the datacenters. And we came up with a solution for
that. It's not necessarily the optimal solution, it's something that we started to do that we will continue
to optimize further. Basically we used TomusBlobs within each datacenter and public storage for data
management across datacenters. And this is how we actually could do that. Those measurements I
showed before on 1,000 cores, because we needed three datacenters for that.
Okay. So now what you may wonder what was this useful for. What did the application team do for
that. So we worked closely, as I said, with another team focusing on neuroimaging and genetics. So
basically they use the infrastructure to run their algorithms to improve them and they came up with a
new algorithm for brainwave studies. Basically it's an algorithm that tries to emphasize -- to help in
detecting those correlations between areas of the brain and genetics data. So this approach is called
RPBI. I'm not an expert of the [indiscernible], I'm not going to focus on that. But what we see in this
picture is a comparison between the results being used in this new approach, and the state-of-the-art
approach was, you see better highlighted areas here with more -- which are more colorful than you see
here. So it's a quality improvement in this detection process. So that was nice. It was obtained on the
Azure Cloud. We're very happy about it.
But one thing you might wonder is what nice science result did it obtain? Did this lead to any science
discovery? So actually it did. We had time after this study to do something else. Rather than just
assessing the impact of one specific SNP on the brain, the idea was to study the joint effect of all SNPs
on the brain variables. And this comes to the notion of heritability. And so a metal was developed for
that by our partner team and in the end we could experiment it with real data from a reference dataset,
which is called Imager. It comes from an [indiscernible] project. It's real data obtained, based on very
high resolution images. And the result that was obtained is basically claims the heritability of some
specific brain activation signals, which I call sort of the stop phase of brain activation signaled. So
these signals were shown more heritable than chance and other reasons. So it was an interesting result
that we submit to Frontiers in Neuroinformatics Journal and it was accepted a few weeks ago. And we
are quite happy about this. We think it's a quite representative results for data science. It's probably the
kind of thing that we want to show when infrastructure teams work with science techniques that don't
know how to use the parallel infrastructure, [indiscernible] infrastructure, and we succeeded in doing
that. So we are quite happy about these results.
So the application team learned how to use the Cloud and learned that it was interesting to -- there
could be advantages to not have to manage the cluster and [indiscernible] and just to render resources.
On our side is an infrastructure team, it was interesting to do large-scale experiments on thousands of
nodes on multi-site processing so we could experiment our BlobSeer software in this new infrastructure
and to refine everything. We could run experiments for up to two weeks. So it was quite interesting,
more than $300,000 of computation use, for instance.
If we want to sum up on this project, basically on this infrastructure site, we mapped this, we issued the
results so we could scale up until 1,000 cores and to cut the execution time up to 20 percent by -- sorry,
50 percent. And then on the same side we established this heritability result.
Okay, plenty of publication; I'm going to skip through them. These are other people involved. And
they are especially the focus the presence [indiscernible] here. [Indiscernible] the male involved in this
project doing all this technical work was achieved.
So the question is what's next. Once this was done, as I said a few months ago, we started to think
about something else going beyond what we achieved. For instance, handling not only MapReduce
workflows, but any kind of workflows. We could see several relevant examples yesterday. And also
focus on this multi-site infrastructure and try to understand what are the challenges that are related to
that. So dealing with workflows, I'm not going to explain that, you already know what it is, several
examples were given.
And we will focus on what is difficult in order to enable this kind of workflows on multi-datacenter
clouds, assuming quite natural things, the fact that each Cloud has its own data and some program.
Some data shouldn't leave some site; the processing must happen in different places. So it matches a
real situation.
So there are several challenges related to that, related to how to efficiently transfer data, how to group
tasks and datasets together in a smart way, and how to balance the load. And I'm just going to focus on
the first challenge of the data transfer with two examples. For instance, one approach we started to
explore is the way to leverage network parallelism by enabling multi-path transfers within the sites. So
basically it cost us in allocating dedicated VMs here just for the transfers. So we ran additional VMs to
increase the level of parallelism.
We can do even better than that. We can also allocate VMs on separate datacenters in order to further
increase the level of parallelism. It is -- of course, it can increase the parallelism, but of course
obviously it has some costs, so it's interesting to also assess the tradeoffs that might appear because of
course you increase the [indiscernible] but you pay for that. So the right question here is how much are
you willing to pay to have better performance and actually how much is it worth paying. Because at
some point what you get extra is not worth any more. So these kind of challenges are the answer that
we're going to focus on in the next project.
Streaming was also mentioned in some applications, in some presentations yesterday. So we are going
also to look not only about batch transfers, but also streaming in all these multi-site settings.
So basically -- and this is actually my slide -- my last slide. In this project actually what we're going to
look at several things, how to adapt workflow processing. What we use right now is mainly targeted
for single-site Cloud infrastructures. Also we will also work on the expressivity levels. So in this
second project our partner team is another team from Inria called [indiscernible], who's focusing on the
scientific data management and more higher-level aspects than us. We are more system level, so to say.
So they are interested in this aspect.
We're going to build the Cloud data management framework addressing these challenges that I
mentioned before and work also on scheduling issues.
And I guess this is it. Thank you. [applause]
Dennis Gannon: Thank you. We have time for questions and Alex will speak.
>>: So if I watch mobile VMs and I start transferring from all of them, it might be network parallelism,
but it might be other kinds of [inaudible]. Is it true they're all taking different paths or you have such a
[inaudible] that you [inaudible]?
Gabriel Antoniu: Well, in fact sending data -- there might be a limitation related to the one-to-one
communication between two VMs situated in two different datacenters. But allocating new VMs in the
datacenter as the center site, that communication might be very, very fast actually. And then it can be
multiple sources of centers [indiscernible] on the receiver side. And that can improve that. I mean we
did experiment studies for that and we could measure that it was worse than [indiscernible].
>>: Yeah, yeah, yeah.
>>: So your comments blobs provide better performance than using like S3 or AzureBlobs. Can you
briefly say how it would compare to just [indiscernible] on my local, just raw disks using HTFS?
Gabriel Antoniu: Basically it performs better actually, because prior to these studies, we experimented
on BlobSeer framework, which actually implements those principles of leveraging versioning, and so
on. Within Hadoop. So basically we placed these difference by a BlobSeer base version of Hadoop
that is optimized for concurrent access, it allows concurrent rights. For instance it reads concurrent
rights with distributing meta [indiscernible] schema. I could -- in the initial variations, we could get up
to 30 percent improvement just by keeping the standard configurations of Hadoop. So that was kind of
a stopping point for this project was let's try to see how it works when we're on the Cloud. If I were to
start it now, I would probably do things differently because we have Hadoop on Azure, so we would
probably explore what's happening in that framework.
>>: It would be interesting to -- have you tried -- excuse me, comparing Hadoop on Azure to
[inaudible]?
Gabriel Antoniu: We haven't tried that, but -- well, there's some previous experiments [indiscernible].
If you want to say -Dennis Gannon: Actually why don't we save that for the break and we can talk then.
Gabriel Antoniu: Okay. Thank you very much. Thank you. [applause]
Dennis Gannon: So I'm going to introduce my colleague, Alex Wade.
Alex Wade: If I live long enough.
Dennis Gannon: I'm thinking you will.
Alex Wade: Please do.
Dennis Gannon: And he can go ahead.
Alex Wade: My name is Alex Wade. I am on the MSR Science Outreach Team. And I was going to
try to set the record for the fastest-talking presentation, but after Roger went yesterday, I decided I
couldn't compete with that. But I do have a lot of slides here so I'm going to be clicking furiously here
to get you to the break on time.
I want to go back to some of the things that Mercè was talking about earlier, and I'm sure everybody
pulled out their dog-eared The Fourth Paradigm before they came to this and refreshed. But a lot of
things that we're talking about here with data science and data-intensive research have a lot of
implications for the less sexy parts of the care and feeding of the research data themselves. And one of
the implications for institutions and for labs and for librarians in taking care of that, specifically one of
the bits that I think Jim Gray evangelized was this notion of having all aspects of the data available and
online, having the raw data, having the workflows, having the derived and recombined data available to
facilitate new discoveries and to facilitate reproducibility.
And so Mercè didn't mention this, but she was co-author on a publication that was just released in
PLOS Computational Biology last week, a couple weeks ago, that talked about ten simple rules for the
care and feeding of scientific data. I don't want you to use this as an excuse not to go and read the
article. It's an open access article on PLOS. Do that. But shortcut is these are the ten simple rules
right here.
And I want to build on some of the things that Mercè was talking about with respect to the tools that
Harvard is building out and the Dataverse Network, but I want to focus specifically on these four right
here about sharing your data online, linking to your data from your publications, publishing the code
that goes along with that data and then fostering and using data repository here.
So in the first item, repositories, there's a number of different software packages that are available right
now for building out a data repository. The general sort of push that's been going on over the past
couple of years with efforts like DataCite are to promote the discovery of the datasets as a continuation
of the publication. In other words, you publish your research, somebody reads your research, and then
they say I want to go and get to the data. And that's the primary avenue for finding the data is via the
publication, via assigning a DOI to a dataset along with the publication itself.
But I want to build on that a little bit and talk about what are some next steps that we can take to
facilitate the access and reusability of those datasets and then finish them with a sort of little bit more
forward-facing look of how we can simplify the discovery of those.
Mercè mentioned some work that we did with California Digital Library in building out the DataUp
platform and Chris Toll and CB Abrams from CDL are also here if you have any questions at the break
about this. But DataUp is intended to be a browser-based Open Source front end for data repositories
that allows individual repositories to plug into that and provides end users an easy way to integrate with
our own personal cloud storage. So we plug this in to the OneDrive Storage. But then to apply a
simple set of rules then for describing the meta data around your datasets and having that get published
into a data repository. This is available for folks to work with. We're looking for other people to get
involved in building sort of on-ramps between the data software and their repositories. So do talk to us
more about that if you're interested.
But if you're setting out right now and you're saying I'm a lab, I'm an institution, and I need to build out
a data repository, there's a large spectrum of software that's being developed to do this right now.
There's some traditional software packages that have been used mostly for publications, things like
DSpace and EPrints, that are either modifying or writing white papers on how to use those repositories
to manage a data file, more or less.
Dataverse Network that Mercè talked about -- and I'm going to talk about briefly about the CKAN
software from the Open Knowledge Foundation. This is generally being used in the UK. Globally
now a lot of use in sort of the open government data initiative type of use cases. But then there's also
hoster repositories, things like figshare as a software as a service, a place where you can go up and
upload your research data into figshare, get a DOI for that. Microsoft has a service as part of Windows
Azure called the Windows Azure DataMarket. And I'll talk briefly about that.
So really quick, with the CKAN software, I'm going to jump in here and do just -- I'm not going to give
you a demo, but if you go to CKAN.org, you can walk through this yourself. There's a live demo link
right here. And you get dropped into a data repository which is a whole bunch of things that people
have dropped into it. But the general idea that you can have a browse-and-search-based way of drilling
down into these datasets here. So you can filter by organization, you can filter by group, you can filter
by tags, the actual file format and the licenses. There's a search engine Intuit. And if you pick a dataset
-- let me see if I can find a good one here. Road closures -- you get access to the files themselves, but
then also the ability to drill into the data, preview it before you download it. That's very intense dataset
with two whole entities in there. But play around with that just so you can sort of see what the CKAN
software gives you.
If you wanted to create an instance of this, this is something that's supported right now on Windows
Azure. And I'm going to walk through just a couple -- a couple -- a lot of screenshots very quickly just
to give you a sense of how quickly this can be done. Actually it took me longer to make the
screenshots than it took me to create the repository. But the general idea that I'm going to copy two
images out of the VM Depot into my own blob storage and then I'm going to create a virtual machine
from each one of those disk images and then I'm going to link them together. Those are the three steps.
So when you log into Windows Azure, if I go to virtual machines, you can see that I have no virtual
machine instances here, and if I click over on the next step there on images, I have no images either.
So I've got on empty shell here. And this browse VM Depot is your entry point then into this set of
community-contributed virtual machine images. There's a huge number of them. I don't know how
many we have these days. But if you browse in there, you can see there's two CKAN disk images
there. There's one for the CKAN database and one for the CKAN web front end. And I can grab one
of those. This is the database. I'm going to tell it which image I want to copy it into, give it a name,
and it starts copying. I go back in there and now grab the web, put it in the same region, give it a name,
and that starts copying over. So I now have two disk images here in my site. I register them; they're
available. So now from those images I can create my virtual machines. I go in, create a virtual
machine, come down here to my images, I see the two that I just copied over. I give the VM an image,
pick a size, and boom, that's starting up now as a VM. Do the same thing with the web. And I've got
my two VMs now up and running.
Last thing I need to do then is go in and link them together. So the web front end is talking to the
database, and I click the link resources, pick the storage account, and now it's up and running and I
have now an empty data repository that I can log into, I can start creating my accounts, I can start
populating that. Okay. So that's the model for taking a disk image off of the VM Depot and building
your own data repository.
As Mercè said, the Dataverse Network is working on this as their 4.0 release comes out in June. This
will also be available via VM Depot.
So I want to build on that then and talk a little bit more about evolving beyond the notion of a data
repository as a place where there's a thing that you're going to go out and get. And one of the nice
things that I like about the CKAN repository is that they start to think about the ways that you can now
start interacting with this data beyond simply grabbing a CSV file and downloading it. So they have
over here this link for the data API. So you can look at the data, you can sample it, you can apply some
filters. They even have some graphing and mapping capabilities so that while you're in your browser
you can play around with it and say is this the data that I want to interact with, and once you say yeah,
this is what I want, rather than downloading the dataset, the data API here then gives you how to update
it, how to query it, the direct access to the data themselves. So you can write an application that is
querying it out of the data repository rather than creating your own instance of that dataset.
And similarly the Azure DataMarket does this as well. So one of the areas that this becomes
particularly important -- oops, not going to -- in the interest of time here so I don't feed into your -- eat
into your break time, is one of the ways this becomes increasingly important is that when you're
dealing with a very large dataset, you don't want to download it. So the example that I was going to
show here is if you go to the Azure DataMarket, is we've taken a large amount of data out of the
Microsoft Academic Search Service that we've been building up. So it's something on the order of 50
million papers, 20 million authors there, a large number of relationships. You can actually see the
individual SQL tables in the model there. And you don't want to download that entire dataset if you're
interested in dealing with a subset of it. So like the CKAN repository, the Azure DataMarket has a
RESTful API where you can go in and query that data and either bring down the subset of data that you
want to store locally or just have that be the backend database to the application or the analytics that
you want to be doing there.
So you go to the Azure DataMarket, you'll see then, just like CKAN, there's a number of different ways
that you can browse into the datasets that are available there. You can browse by domains, you can
browse by the free ones versus the ones that have a fee associated with them.
And somebody asked me yesterday how can we take our research data and get it into the Cloud but in a
way that we don't need to pay for it? And one of the things that the Azure DataMarket supports now is
a cost recovery model where you can make your data available and you can make it available, charging
the consumers of that data only as much as you need to pay for the sort of tenancy bills, the hosting
bills, to put that data into Azure. So it's something to consider. It's not quite wide open enough for the
world to start coming and putting their data into it. So the Azure team, I think, is really focused on the
high-value datasets. And you'll see there's only a couple hundred datasets in there right now. But that's
an approach to think about for a sustainability model.
The way in the case of Azure that you get your data back out is via the OData Protocol. People aren't
familiar with it, go check out OData.org. It's under the umbrella now of Oasis, being standardized, but
it's a -- built on top of HTTP and built on top of REST and you have the option then of giving the
payback on in JSON format or ATOM format. And it's a very simple way of interacting with a very
large datasets. One of the nice things about it, if you're interested in this aspect, is that it's also a
read/write protocol. If you have a data site that you want to be writing back into, their write operations.
One of the other rules that was mentioned is publish your code, in the article by Mercè and Alyssa
Goodman and others. This is a project that was just started up by a collaboration between figshare, the
Mozilla Science Lab, and GitHub, which makes it really easy to plug your GitHub repository into
figshare. And the general idea is if you want to snapshot your code, get a DOI for your code so you can
say this is the instance of my software that I used to produce that dataset, you can certainly plug in -- I
think they've got a Firefox plug in, but you can plug in your data repository into figshare, snapshot it, it
will bring all the code over to figshare and give you back a DOI for that. So it's a really great idea.
[Indiscernible] is also involved in a workshop at Oxford a couple of weeks ago with the Software
Sustainability Institute. Is that who ran it?
>>: It went live yesterday.
Alex Wade: So great blob that he just published last night on our connections blob. Check that out,
talks about software as a part of scientific reproducibility.
And the last thing I'll leave you with, maybe even finish in time for the break, is this notion of moving
beyond linking to your data as the sole way of discovering research data. This afternoon somebody's
going to come over and talk about some of the tools that are being built into the current version of -current generation of Excel. And I think we had a demo last night out in the lobby. But just sort of
thinking forward a little bit on how research data could be as pervasive as web pages on the internet.
So I didn't go to the Spitfire Grill last night with the rest of you. I had to go home and pick up my
boys. But I got home in time to see the last ten minutes, I think, of the hockey game where the flyers
killed New York Rangers. And instead of opening up my web browser and looking at the web of web
pages, I opened up my data browser, Excel. And I don't have any data in Excel right now, but I was
interested in learning a little bit more about the Flyers. I haven't really been following them this
season. So I clicked online search. I got a search box over here and I typed in -- can't really read that,
Philadelphia Flyers Roster. And I got a set of datasets back. So this is Bing indexing data from the
web. Most of the results here are actually tables that they scraped out of Wikipedia. But you'll see
other datasets in that set. And imagine that this was scientific data now. I had a need for some data, I
typed that need into a search box, just like you would into a web search engine and you get back a set
of data. First one says "current roster, Philadelphia Flyers." That sounds good to me, and I bring the
data into my spreadsheet. And this is where you'll start seeing some new tools this afternoon. But
there's a very simple one that Excel 2013 has, which you can just take the dataset and you can say
here's some things that are roughly places in the world, cast those on to a Bing map. And within about
90 seconds I went from having an empty Excel spreadsheet to having a map of the birthplaces of all of
the Philadelphia Flyers. So I just wanted to play around with that data.
So I just want to leave you with that notion of thinking about data discovery, thinking about research
data, getting it in the hands of users for new experiments in sort of this paradigm.
I think I'll finish with that right at 10:30, and I'll leave one minute for questions. [applause]
Dennis Gannon: Questions?
Alex Wade: Question?
>>: Yeah, did I miss something? [laughter]
Dennis Gannon: I think everybody's ready for a break.
>>: [inaudible]
Alex Wade: Questioners?
>>: I have a question but I cannot discuss it over the [inaudible].
Dennis Gannon: Why don't we go ahead and take a break.
Let's thank the panel. [applause]
Download