Dennis Gannon: Good morning. Glad to see we... 9:00 on the second day. That's usually a good...

Dennis Gannon: Good morning. Glad to see we still have a reasonably strong number that show up at 9:00 on the second day. That's usually a good sign for a good day. As I mentioned yesterday, we had 90 people continuously at any given time, at least 90 people were watching the live streaming. And I heard we have actually hundreds of people involved. So not everybody was on at the same time, but there was many hundreds. I can't remember the number. But actually it was quite interesting. There was a Tweet stream. You people don't Tweet a lot. There was only three Tweets yesterday. Not a lot of people watching. >>: [Inaudible]. Dennis Gannon: Oh, it's not on the website. It was actually -- oh, you want to Tweet? I don't want to take away from your time here. Once again, we have -- this is one of the sessions where we have four presentations. So this will be a little bit shorter than others. But I want to get started first of all with Mercè Crosas from Harvard. And this is not her title. It's my mistake. It's close. It's an approximation. So now, let's see; we have to do something here. Merc Crosas: Okay. Okay, I'll be talking about a little bit more than just Dataverse and Consilience. First I'll introduce. The data science team at the Institute for Computative Social Science at Harvard University and [indiscernible]. We just created -- just put together a website, it's going to be probably the end of this week, to describe all the research tools that we're working on now. Some, as you will see, some of them will be working on for years and some are new from the last month. What we do with the data science group at IQSS, we combine -- well, mainly three groups. Some are researchers that have a focus on a statistical computative methods of statistics and analytics. Then we have a software engineering team, professional team that many of the people there I brought actually from working in industry in the past before I went to join this group at Harvard, and then we have also data curation and storageship team that helps with data archiving, preserving data, understanding -well, cataloging and understanding data. And we think that all these three aspects are essential to build a data science team that could provide research tools for data-driven science. The team is 20 people, and you see all the different groups, including, well, the statistical analytics, software development, the curation and archiving, usability and UI, because we also do a lot of usability testing for our tools. And we find that also to be a critical part of our group of QA. So let's talk about the applications first. And I'm keeping that a little bit short because that is -- it's a short presentation and then have time for discussions. So feel free to ask us any questions at the end. So two of our main research tools or frameworks that we've been building in the last about eight years, six to eight years is Zelig and Dataverse, the Dataverse Network. Zelig is a framework -- not framework, it's a common interface that allows to access a large number of statistical methods. [Indiscernible] by different contributors. Often the problem with [indiscernible] is that there are a lot of computative methods that are hard to use because each one has been contributed from somebody with different documentation, a different interface. So Zelig brings these common interface to all these models and it makes it easier to use them to -- well, also makes it easier to understand them because it has a common documentation for all of them. It's used by hundreds of thousands of researchers. Actually it's been used very, very heavily in computative social science research but also we're expanding into vital statistics and other fields. Dataverse is also used widely -- we have a Dataverse [indiscernible] at Harvard that is free and open to all researchers in the world where it provides a data publishing framework. So a lot of researchers to go there, deposit the dataset, generates a citation, a clear part of Dataverse, so that they have a data citation that you can use to reference the data from a publication, from an article or from a book. And that way the researcher gets all the data provided, gets credited for the data, they have a way to be cited and recognized for the data that might have been hard to collect and put together and clean up. And they keep control of the data, they can set different permissions and how to access the data. We encourage open access to data, but sometimes there is need for restrictions. And the framework helps you archive it and preserve it, and also you'll see that it links with Zelig to provide analysis. More recently in the last months we've been working on a couple of projects, SolaFide, which is an interface to the Zelig statistical models and also it interfaces with Dataverse so that you can -- well researchers -- normally researchers are well familiar with the statistical models; but any user, if they don't have a good understanding of the statistics behind, they can be guided in what statistical model they can use if they understand the data. This is going to be released in June; it's now in development and testing, I'm sure. And DataTags is also a new application that is a large collaboration with the computer science department at Harvard. Berkman Center, the Law School of Harvard, and the data privacy lab, Latanya Sweeney, she's also part of IQSS. And it provides a framework to be about -- to tag a dataset with the level of sensitivity based on the -- well, if it has house data or student data, based on the degree of how private that data is, if gives a different tag, and that tag, basically it's actionable -- well, it has a format that it can be carried with the dataset. So it's a policy for that dataset to be able to understand how to handle it, how to store it, if it needs to be encrypted or double encrypted, how it can be accessed. So we try to maximize the sharing data or making part of the data open, but still restricting the parts that have to be -- well, have sensitive data. Two other software applications that we're working on, Consilience and RBuild. Consilience is a text analysis application that uses about a hundred different clustering methods of a level in the literature, and combines them to be able to give you a way to interactively or assist you in discovering new clustering, new ways of organizing your documents. Not based on the existing methods, but based on new clusterings that are in the clustering space. I'll show a little bit more about that. And then we're just starting to work with RBuild application that helps researchers who are developing higher statistical methods to be able to build them and bring them to CRAN, which is our repository for our packages. So if we put them all together, basically the research tools we're building are part of the entire research cycle from the point of developing computative methods to analyzing the data, or for quantitating datasets or for instruction text, then publishing that data, making sure that that data are cited from journals, from any published results, be able to share also sensitive data and allow others to explore, reanalyze or validate the analysis of the datasets published in Dataverse through SolaFide. So let's talk about in the Cloud and the [indiscernible] of more interests here. So first we've been working with Dennis and Christopher for -- well, we're using Azure for Consilience and PowerForce and Dataverse and SolaFide also. So in the development of Consilience -- well, this is part of the user interface that has a clustering space where you can go over that space and browse all the possible clustering solution for your augment set. This is in development and we'll be releasing in June for a few researchers to give us feedback and continue building based on that feedback. But how it is set up or what is the system overflow basically and you'll see how this maps into how it is set up in Azure. We have your load of documents set that can be from hundred to ten thousand, or at some point we want to extend to millions of documents to tens of thousands for now. You run a clustering map, which involves calculating a term document matrix, running about a hundred clustering methods, although ones that are existing in the literature. And this part is the part that can be done in parallel. We then project the results from those clustering methods in a two-dimensional space, and then from there we calculate new clustering solutions in the entire space, building a large grid for the two-dimensional space of the useful clustering solutions for those document set. So there is a paper [indiscernible] and Justin Grimmer that is grad students that now is a professor at Stanford that explains part of the methodology for out here. So once we ran all these, then users can explore the clustering map, they can discover -- I just want to point out this is totally different than the analysis of other clustering tools because it doesn't give you just a solution, it also gives you a space to explore all the possible solutions and you choose the one that it fits better for what you're looking in your documents. And then it allows you to obtain the documents. So it's set up by using multiple nodes, and we've been testing, expanding the part of the supercomputing using [indiscernible]. Part of the application is returning in Scala and Java, and the Scala component follows the AKKA -- I don't know how many people are familiar here, but have additional information if you follow the link, it's the AKKA distributed workers pattern, so you can just distribute easily well as much nodes -- use as many nodes to run all the clustering methods that need to be run based on when the nodes are ready to do that and are available. It's actually -- I have to say, it's been a very nice experience, and I'm not a developer doing that, but the developers of that project, they didn't have a chance to do the training, but they've been able to catch up very quickly and be able to -- they'll be happy to do the training when available, but they were able to set that up easily and make it run very well. So that's been a great experience. The other part is Dataverse in the Cloud. We've been setting up Dataverse instance, because one of the -- we wanted to do two things with that. As for now in Azure, one is that we're testing CDL with Patricia [indiscernible]. So the data app integration with Dataverse, and the other part is we're just setting up an image for the [indiscernible] to be able to provide for an easy way of setting up a Dataverse instance in the Cloud. We're doing that for Dataverse 4.0. That's why it's not ready there yet. It's going to be released at the end of June. And it comes with a whole new -- there are many big changes and we wanted to make sure that we would provide the latest when we put this in VM Depot. But it has a new faceted search, new data filing -- well, a lot of enhancing the data, filing [indiscernible], and support for more domains in terms of providing -- reach the meta data for datasets in neural and social science. But in astronomy which we've been collaborating with the students of astronomy at Harvard for a while, but also medical science, and we're collaborating with different groups at Harvard Medical School and hospitals and other groups worldwide to define a common set of meta data for biomedical datasets. And another part of -- in the Cloud with Dataverse is SolaFide, which as I said before is the web application and integrates with Zelig to be able to run the statistical models that are in the R -- well, originally written in R. Through -- and, well, applied to datasets and Dataverse, so it integrates the two systems. And let me see if I can run here -- this is a very short demo movie about how you have all the variables for quantitative datasets, and you can select -- you can see all the summary statistics. All these will generate during the ingest or the upload of those computative sets automatically, and then they use the [indiscernible] more about the dataset can decide what are the explanatory variables or the independent variable, and then look at -- well, choose here only a few models that now we have available where we have hundreds of models. And once you set up your -- what is the -- the subset that you want to analyze, then you can run that model, forget the X and the -- depend the variable that you set up and you get the results here that you can take and print out. And, well, this is testing and in development, but it's a way to integrate the sophisticated -- not only linear regression, but quite sophisticated statistical models with datasets and Dataverse. So have all these set up in the Cloud. There is a web app that has the main Dataverse business logic and middle layer -- or what UI in middle layer. In [indiscernible] we have a database node, the storage data files, and then the ingest of when it processes the quantitative datasets are uploaded to Dataverse are processed in R, so it has an R [indiscernible] to do that, and then when they are analyzed using SolaFide and Zelig, uses rApache. These we haven't set up as these distributed nodes, but the two [indiscernible] of ingest and the Explorer and analysis can be done with expanded -- with multiple nodes. The idea is to provide a deeper -- one image it ingests for a quick testing and it ingests everything for -- I mean the nice thing about Dataverse and all the Zelig is you can run it all on your laptop as a testing installation and you can quickly set it up there with [indiscernible]. But then also you can set it up as a production solution with several nodes and be able to expand to more nodes for the analysis. I don't know how much time I have. But I just quickly introduce all DataTags. This is not in the Cloud yet. But we've been working on the development of that. As I said, DataTags helps you to be able to share sensitive data by following -well, legal revelations that you need to follow, the HIPAA, FERPA or other -- I think there are two thousand hundred laws on privacy in the United States. But you can group them all in about thirty types of revelations. We've been working with the data privacy lab and with also the Berkman Center, the law school, and the School of Engineering at Harvard. So this gives an idea of the different levels of sensitivity of the data that you would apply to a dataset and you would apply that, based on an interview that has maybe ten to fifteen questions on your data, and this interview is actually mapped to a set of rules that are set up by the lawyers based on the revelations. So for HIPAA, we got a HIPAA certified setting up here the rules, and the rules define a set of questions and from those questions you're asked the questions. When you finish the interview you get finally a tagging for that, your dataset. That tag leaves with your dataset and tells you how it needs to be transferred and stored and used afterwards. So that's it and thank you everybody. [applause] Dennis Gannon: We have time for a quick question if somebody wants to. Anybody? Yes? >>: [inaudible] available? These tools are available for researchers to -Merc Crosas: Yes, well, we have -- almost all of our research tools are Open Source. Consilience, we're still working out how it will be distributed. Just Zelig and Dataverse are now available because we've been developing for a while and they're available to all the researchers. We have a Dataverse repository for Harvard that is open free to all researchers, but also institutions are -- well, many institutions around the world that have set up a Dataverse network for their research data. Zelig is also available from "Siran" and available to other researchers. All the other ones are going to be released during the summertime. So they're going to be available to researchers then. Yes. Dennis Gannon: And at some point there will be a version, as you say, there will be -Merc Crosas: Yeah, in June, yes. >>: I was wondering when you have any infrastructure setting blind DataTags that identifies that you need to [indiscernible] authentication or whatever the tags is in the infrastructure to help the researcher kind of expose the data with those restrictions. Or is it just these are the considerations you need to -Merc Crosas: No, it is better than that, because that's why we are integrating it with Dataverse, so that actually once you deposit the data -- I mean it could be [indiscernible] with other types of repositories infrastructures for holding data, but once you set that tag, it goes into -- it's part of the policy of the -that the Dataverse follows and how -- I mean the researcher really doesn't need to do anything. After that it's all handled properly because it just follows those -- the instruction in the tag basically. And we're reviewing these with different groups with [indiscernible] and other people in the computer science department at Harvard to see the data. Well, actually it follows all the security levels it needs to follow for the revelations. So it's not complete yet. >>: How big are the datasets that you have in Dataverse? You have said you have -Merc Crosas: Yes, so the datasets -- a dataset -- well, has the meta data that describes the data house and it can have as many files as you want. Each file is about -- has a maximum of 2 to 4 gigabytes. The only thing is that now for example we have one new dataset that has thousands of files from MRI data, and each file is less than a gigabyte, but altogether is a large dataset. One of the next -- I didn't talk about all the future projects in the website that we're releasing this week. One of the next things we're looking for is to expend Dataverse to support more easily much larger datasets, so normally is a sum of multiple files, but building an adaptive data storage behind Dataverse to be able to query quickly datasets of [indiscernible] terabytes if we get to extend it to that. Because we're working with the Connectum group to do that for the image -- nanoscale microscopy imaging of the brain. So when you put all these together those are terabytes. And also with an astronomy group that is terabytes. [applause] Dennis Gannon: Let me introduce Bill Howe and let him get started. Bill, as everybody knows in the Northwest, is one of our leading data scientists. And he will tell us about what's going on in the Northwest. Bill Howe: Thanks. So I want to talk about some of the activities that we're doing at the eScience Institute, which has recently been reenergized with a grant from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundations, although our activities in the [indiscernible] predates this award. This has really been able to give us the kind of scale that we've been looking for for a while. So this is a joint partnership with the NYU, Berkeley, and the University of Washington for a five-year program to create a data science environment. So just to give you kind of a scope of this, we're not going to give you a speil about data science and in general, and eScience in general, because I think the audience is sort of cohesive enough in this area that we sort of seen these pitches before. But just to give us a sense, you know, we put out a call for posters for our rollout event and we had 137 from departments all around campus. And you know, some of the areas such as the political science were actually some of the most interesting posters that we found. So there's really a pervasive need that we all sort of recognized. But it was nice to sort of see them practiced. So this is a pretty substantial groundswell of interest in these topics, in eSciences, in the context of this award as trying to be the umbrella organization to manage this. And just to throw these covers up that we've all sort of seen, this is a fairly old slide here, but what I like is Roger was here yesterday and of course here at Microsoft, and I use this quote from him, it's a great time to be a data geek. But the ugly side of this is this other quote I like to give a lot is this, and I'll let you read it for a second. [laughter] This is perhaps the ugly side of data science. And what we're trying to do is, and I think most people in the room here are trying to do, is to take this term data science and some of the technology and techniques and skills around it and really kind of focus it on astronomy and oceanography and social sciences, physical life and social sciences as opposed to only the industry needs. That being said, we really see a lot of interplay between the two sides of it in that industry can invest deeply in information technology that science sometimes can't; so we can borrow tools and techniques. And going the other way we see there's kind of a culture of statistical rigor coming out of sciences that I think businesses are only sort of just starting to acquire. So the line between them is blurring. And we like to sort of take advantage of that when we can. Okay. So the way we sort of set this up is just sort of establish a virtuous cycle where, you know, just new advantages in data science methodologies and techniques and technologies really enable new discoveries, and new discoveries kind of spur requirements in the methodology fields. And to mediate this, we have these working groups across these six areas, and in this talk I just want to focus on three of them and give you some highlights from education, software tools, and this idea of working spaces in culture where we have this data science studio that we're building up that I'll end on. And the other one just so you see them -- do I have a pointer here somewhere? No, I don't do that. Maybe I don't have a pointer. No, there's no pointer on here. That's why it doesn't work. That's too complicated. Oh, here it is. It's quite nice. Oh, gosh, yeah. Burn somebody with this. So reproducibility I'll point out. The connection between reproducibility and the Cloud is great, and I have some slides on that but I'm not going to show them. I'll be happy to talk to you offline. We have this kind of meta analysis of our own success that we bring up as a first-class citizen. And career paths is a big deal too that I'll be happy to talk to you about offline about how we're trying to change the kinds of people we attract and try to keep them around focused on science. So these three I want to focus on in this talk. So in education we have kind of a broad portfolio of programs in data science. There's a certificate program that Roger mentioned yesterday very briefly. He's been involved in helping to design through the University of Washington educational outreach group. There's a massively open online course that I ran last spring that I'm going to run again this summer, I hope, that I'll talk about in the next few slides. There's an NSF grant under the Eiger Program, if you're familiar with this. This is for integrative graduate education research and traineeship, and this is focused on eventually creating a Ph.D. program in big data, but through a stepping-stone of a Ph.D. trek. We have new computer science courses focused on big data and data science, all sorts of boot camps and work shops. We ran the Azure -- or we hosted the Azure two-day workshop last summer, for example. This there is also a CS course, but we're trying to sort of open it up to people from all majors, a different path into computer science topics. Rather than just focusing on games and puzzles, we focus on data analysis as a first-class activity, which we think has a capacity to sort of open up to larger cohorts that aren't typically interests in programming in computer science but would be if we focus on data. We're getting started on a data science master's degree. And I'll end with this. We have this incubator program that we run out of the studio that's less about a formal education but kind of a hands-on training flavor. So a few slides about this MOOC. We ran this through the Coursera platform. And the challenge here, one of the things I was interested in, was seeing if we could cherry pick topics from introductory classes that are typically separate; statistics, machine learning, databases, and visualization, and combine them together in one course, which is, you know, maybe a bit of a magic trick. You run the risk of being so superficial that nobody learns anything, or sort of having the intersection of prerequisites be so small that there's only about three people in the world that would actually benefit from it. So I'm reasonably happy with the outcome and the idea was to be sort of superficial and have deep dives and particular topics that we think are key ideas that everyone needs to learn. This was the syllabus. We sort of focused on what is data science, given that the term is sort of new and people have different interpretations of it. We talked about databases and MapReduce and NoSQL, and all of this under this umbrella of data manipulation of scale; statistics and machine learning where we really can't give a rigorous introduction on mathematical concepts in a couple of weeks, but what we can do is sort of teach key algorithms that everyone should be familiar with and key concepts that -- I tried to pick ones that aren't typically taught in a Statistics 101 course. I didn't want to do kind of a compressed Statistics 101 because, A, I'm not the right guy to teach that, and, B, it would be somewhat boring. Visualization. And then this week we might swap out with other kinds of special topics. But the first time we did it, we did graph analytics that seemed to be popular and we had guest lecturers from various companies. So the participation numbers are fairly fun to give, but some of the bigger ones are kind of meaningless, and I'll explain that. So the number of people registered is a ridiculous large number of 120,000, but the bar to register for a course is essentially zero, right. It's you log into Coursera and you click one button and that's it, right. There's obviously no money being paid, there's not even any kind of sign-up process or anything. So this includes people that had no intention of going through the course; they were just exploring things, who were going to watch one video and pick what they want to take and so on. So a slightly more realistic number is people who clicked play on at least one video in the first two weeks, which is still quite large, but this is people that are still kind of feeling things out. Now we're getting down to the people that are actually taking the course. 10,000 people actually turned in the first homework. And after that the attritions, I'm reasonable happy with. 10,000 turned in the first homework and about 9,000 completed all the homework, which is pretty good. This is partially gamed by me, because the first homework was a bit of a doozy. So we sort of, you know -- if people got through that, they had a pretty high likelihood of being able to get through the rest of it. And this overall attrition rate or a MOOC is pretty standard, because the bar is so low. So a lot of times when people are trying to bash these MOOCS, you know, they like to point out the fact that other [indiscernible] online courses in the past have had a much better success rate of keeping people around. But never before has the bar been so low to sign up for these. So it's not a very convincing argument to show this and argue that the MOOCS are being successful. Although I'm not necessarily a MOOC cheerleader. This is wholly an experiment on our part. We're not totally sure where this whole thing is going. Okay. And the number of people that passed was about 7,000. Other kinds of numbers I thought were interesting was the discussion forum was a really critical piece of this. The students who were paying attention, were on top of things, could answer questions faster than we in the TA's could, and that's really the only reason this works at this kind of scale is because you do have this kind of cross-talk in the forums. So there's quite a bit of activity there. And this has a dataset to actually do some analysis on. It's something we're looking into now to detect when people are learning and how people are learning. So all this is pretty consistent across, quote, hard courses that have some kind of a technical or programming component to them. To just to illustrate the idea of attrition, this is videos watched, or a lot of people watched the first one and it went down. And again this is pretty standard. This is the same thing with the assignments, which are a little flatter, which is nice. And this is one assignment and two assignments with a different part. You can see that as the parts kind of get harder, so there's some dropoff in this sort of Twitter sentiment analysis assignment, and the same thing with the databases. We had coverage from all over the world. This is actually not a map that I drew or a survey that I did. This is a student who posted on the forum, A, where are you from, and then mark it on the map. And I stole that and used it in talks because it was sort of nice; I didn't actually have to do any of this work. But there's some concentration in the U.S. that seems to be pretty all over. This was a little bit disappointing, so one of the stories about motivating MOOCS is that there's people who don't traditionally have access to higher education materials who are signing up for these things. And some of the evidence coming out, which has been pointed out before, is that it's not really clear that that's who's taking these things. A lot of times it's people who already have degrees who are professionals and working and trying to augment their skill sets as opposed to people who lack access to higher education materials. And that sort of came here too is that the biggest number here is -attracted a lot of professional software engineers as opposed to undergrads in other developing nations or something. >>: But don't you agree those people also have a lack of access to the same materials? It seems like it does fit your -Bill Howe: Possibly, but I think they're probably doing pretty well. And they're trying to -- they're interested in this -- in rounding out their skill sets as opposed to -- I mean basically they have a degree and they're successful and they have a job and so on. And sometimes there's a disconnect between the reality and how the advantages of the MOOCS are pitched, which is, oh, we're giving opportunities to people that don't necessarily have the degree or something. It's not all bad. This isn't terrible. It's just that this is not quite the same story that you hear from the purveyors of the MOOCS. But this is pretty standard. A lot of people pointed this out before that a lot of these courses are being taken by professionals. >>: I would agree with Chris, I think this is good news. Bill Howe: Okay. Well, good. This is great news. >>: [inaudible] as, you know, being someone that fits that category, I didn't sign up for your class, I'm sorry, but being somebody that fits that category, I've taken MOOCS for that very reason because I don't have the ability to get to a university to take a class. It means I do not have availability or access to it. Just because I'm working doesn't mean I don't have availability. >>: But this room is full of people who have been retrained several times. Bill Howe: That's true. That's true. >>: Okay. So if I'm a barista, what category am I going to land in there? Bill Howe: I don't know. There might be ->>: You don't seem to have something like working stiff. [laughter] >>: I mean everybody's a professional -Bill Howe: Working professional, non-tech is what I tried to do to capture that. >>: Okay. So let's say that there's 1760 people that are not in tech, some of those people might be making minimum wage. You don't know. Bill Howe: Yeah. Yeah. Yeah. Absolutely. >>: You give them an option to say minimum wage employee. Bill Howe: Yeah, if we redo the survey, there's lots of ways to improve the survey. That's not the most ->>: Yeah, I guess -Bill Howe: That's not the biggest thing I regret about the way I designed the survey actually. >>: Yeah, I'm just thinking, yeah, you really can't tell who you're serving in terms of -- let's say, in representing groups or people who are trying to climb the socioeconomic ladder. Bill Howe: Absolutely. And I was a little bit nervous about doing too much demography in the survey. Because I'm not totally sure the legality of some of these things. It's a little bit fuzzy. Because in some sense, I'm not officially representing the University of Washington in these capacities because otherwise we're into FERPA requirements. However, it's not totally obvious that there wouldn't be some flack coming back from the University of Washington if we break some rules. But any way, I would like a much richer dataset of who we're helping. And there might be a way around that. But I was being fairly conservative in this case to make sure I wasn't running afoul of anybody or make anybody feel uncomfortable of trying to answer a question and then getting a bunch of flack. At this scale everything became a problem. If somebody can complain, you will get emails. But you're right, I think there probably is somebody that is better at designing these instruments, would be smart. This was sort of whatever, flying the plane as it's being built, whatever the metaphor is. >>: I don't mean to quibble with your main query, which is basically half of them are already software engineers. Bill Howe: Which wasn't necessarily what I expected, I guess. And the other thing about this scale is you get these great little anecdotes. If you have 10,000 people, the odds of something awesome happening is pretty high and [laughter] but of course I get to cherry pick and pick the ones. But it's actually somebody that ran a company, turned out it was a grad and I didn't even know that at the time. But he was working on assignments over on his honeymoon and ended up selling his company and decided to take a job with one of the companies that gave a guest lecture in my course as a data scientists and sort of completely changed his career and other things like this. So it's sort of fun to do this, because you do have the impact at this scale. But okay, so I wanted to mention ->>: [inaudible] my wife's amusement [laughter] my wife's annoyance. Bill Howe: I was wondering if you paused over the word amusement. Like was that for my benefit or was that the reality. Okay. So real quick I want to mention this. That's one activity that we're happy with and we're going to keep running this. This is Magda,who is the lead PI on the grant. And it's a broad partnership, because she's really been the prime mover through this. If you know Magda Balazinska, she's a database faculty at the University of Washington. And the idea here is we stole this from Alex who claims that he stole it from somebody else, but the problem is unclear. But this idea of pi-shaped people -- people have heard this analogy before? Anybody know what I'm talking about? So if you're a T-shaped person, you have kind of a thin veneer of interdisciplinary that we all have and kind of a deep lag in one area. And a Greek letter pi would have, you know, two deep legs, right. Maybe you're sort of machine learning and biology, which is some [indiscernible] people or something like that. And that we see the creation and support and incentivizing, you know, career paths for these pi-shaped people is kind of one part of our mission. We want to create these people and reward them when they do this, as opposed to perhaps what the status quo is that if you end up -- if you're a biologist and you get a little too much into computer science, well, now you're neither fish nor fowl, right; you can't get a job in a biology department because you do too much of that programming stuff that we don't like and you're not going to be a computer scientist because you didn't write enough papers in Sigmat or something like that. And yet these people are kind of, in some contexts, driving science forward. So we're just in sort of supporting that. So that's the acute analogy here. The reality is the NSF program, that -- Eiger is a broad program, they do it every year. That particular year we had the award, they were particularly looking for these data-enabled science engineering, kind of a data science track. So all the proposals that came in were about kind of big data Ph.D. programs of various forms. And ours was one of the ones that was funded. The key tenets here is that we tried to have students in domain science and students in methodology sciences working together as a cohort, whereby the last chapter of the thesis of a Ph.D. student in computer science would be about how their techniques were actually applied in domain science and the actual science that came out of it. And the last chapter in the oceanography student's would be about the new tools they developed and the software they developed. And part of this is not just convincing the student to write that chapter, but also convincing the thesis committees to pressure that work in both directions. So we want them to sort of sit together, work together, come up with projects that sort of go together. And I won't go into the details here but I'm happy to talk to you about it offline. And then actually Siburn Fischer Development where we're deploying this stuff and seeing it used is another key theme here. And I'll just sort of throw out proof of constant, but actually get real sustainable tools being built. And we have participating organizations -- or the departments across campus are participating in this, astronomy and genome sciences as well as computer science and statistics, for example. All right. So that's a couple points about education. In the software -- I'll try to go faster here, I think I'm going to run a little long here. We see the data sciences can be thought of as this three-step work flow. And this isn't our idea; you see this sort of all over the web, is that there's sort of this first step of preparing data, the second step of actually running the model, and the third step of interpreting and communicating the results. And what we hear from people, and we agree, and we've heard of this a little bit of this yesterday from various people is that this first step is really the hard one. This is 80 percent of the work, so to speak. And then the joke is that this is the other 80 percent of the work. But this No. 2 is not really what's keeping people up at night. It's not really about selecting the methods. And there's a couple of reasons for this. One is that simple methods actually go a really long way. Having going from we have no predictive analysis capability to we have some predictive analysis capability is a big step. Getting two percent improvement by using someone's sophisticated state-ofthe-art method is not really the difficult part. Also simple methods are the ones that actually scale the large data, typically, as opposed to the really intricate ones and so on. The bad news here is that this step, I claim, is getting the least research attention actually. This one is very fun and easy to sort of -- well, not easy, it's very fun to find -- come up with new algorithms and get them published, while this is kind of seen as, oh, this is some stuff you have to go through, some pain. We'll hire a post-doc who will write a bunch of code and they'll handle that part, and then we'll get to the cool part, which is the machine learning to cure cancer. And then similarly there needs to be a little bit more emphasis on this as well. This is visualization. You know, it can't be sort of a black box decision. You can't spit out a number and expect stakeholders to make a decision. Dennis is standing up, so I'm definitely running long here. You know, if you're going to sort of build software around this stuff, you're going focus on this step one, you have to sort of answer the question of what are the abstractions of data science. And the fact that we're using terms like data jujitsu and data wrangling and data munging, and this is what you see in sort of on the web and so forth, suggests to me that we don't really have any idea what we're talking about; right? This is bad. And but the good news is that I think we do sort of know the abstractions of data science, they just haven't always been recognized as such. Some of the candidates are, you know, matrices and linear algebra, relations and relational algebra. Maybe everything is objects and methods like in Java or C#, files and scripts and computational science. I would claim that only the first two are really viable as potential fundamental abstractions to manage data science work flows. And I would also further argue that relations and relation algebra is under-recognized as a fundamental concept here. So we try to -some of our software here is really trying to jail-break relational algebra outside of sort of databases and get it used in more of a data science analytic sort of context. And SQLShare you heard mentioned yesterday. We tried to simplify databases down to just this simple three-step work flow of upload data and not worry about a schema; you write your queries and you share the results with people and that's it. And where you sort of organize it on a web-based interface and you can write queries and share them and so on. And we've had a lot of uptake in science for this where previously they never would have used databases. We've got people that don't write any program code in any language whatsoever, not R, not Pipe or anything, but they write these sort of 40-line hairy SQL queries that do like interval analysis on genomic data or whatever. So this is a pretty good sign that maybe there's something -- you know, this whole decorative language thing actually has some weight behind it. We see R scripts being translated into SQL queries, and they sort of ostensibly don't get more complicated when you do that. They sort of have the same expressive power. And yet the one on the right scales sort of infinitely. We know how to scale this program up, we don't know how to scale the R program up. "We" being everybody. You know, Stephen Roberts gave a talk yesterday, takes Galaxy workflows that have this kind of complicated multi-step kind of thing going on and then they get kind of compressed down to a handful of queries, and so on. We're teaching people SQL and work shops and so on. But the problem here, and I'll wrap up here, the problem here is that that's just SQL, that's just relation algebra. There's a certain class of analytics task that you can't do an SQL. And we're trying to see if we can take what we like about relation algebra and adjust enough for richness to capture a lot more of the steps. So instead of query as a service we think about this kind of relational analytics as a service. There's a big team, a bunch of students working on this. I'll skip the architecture. We have a little relational database on each node and it's a big parallel sort of engine. The other weakness in SQLShare perhaps is it wasn't really focused on hundreds of terabytes and Myrias. And the two big things about Myria are iterative queries, so you can do loops, which means you can express things that converge; right. You can express analytics tasks and clustering tasks and so on. And then it also scales a lot. It scales as well as we know how to do at the state-of-the-art. So we have this kind of theory of -- theory -- we have this notion of relations algorithmics. We want you to do algorithm design but you're using relational algebra right there in your program kind of like Lynk and C#. And this is Kmeans in our little language where everything here is relational algebra, but it's expressing Kmeans. I'll skip this I guess. This is manually optimizing a program. Here's a simple relational algebra version of it to an increasingly complex one where it gets faster but there's no hope that the database optimizer's going to magically turn out this program from the original one. So you do need to have the programmer kind of working in this space. So it's not just query optimization, it's actually algorithm design, and yet every single line is relational algebra. And we went query debugging and visual analytics of the running of these things to be kind of a first-class citizen as opposed to something only DBA's do. We want the actual analysts and the scientists to be doing this stuff. So we have some visualization work in this space. And then let me just wrap up with this. We're conscious of the need for physical proximity, or we believe -- or our hypothesis that physical proximity between the computer science and statisticians and methodology people and the domain scientists is sort of important. So we're trying to set up a data science studio here on campus where they can be -- sit and work together and run programs through this. And one of the programs we're running is this incubation program where people submit one-page proposals. There's no money exchanged. They get to come and sit elbow to elbow with our staff, our data science staff, and thereby hopefully take sort of a two-week problem of trying to figure out how to use GitHub and Hadoop into kind of a two-hour problem because we have expertise right there in the room to do it, and really focus on these sort of short-term, small-backed, high-payoff kind of projects as opposed to the big sprawling two-year "let's write an NSF grant together" kinds of things. And this seems to be working pretty well. We're doing a pilot in this quarter and we have projects spanning social sciences, we're here where there's some manifold -- you know, finding these embedded manifolds but they need to scale this method up. This is some work by Ian Kelly here, right here in the room who's looking at cell phone call logs and correlating that with events such as bomb blasts. And we're doing a systematic analysis of different kinds of tools for this job, you know, in Cloud -- on the Cloud and off the Cloud, and so on. So this has been pretty successful. And then if you want to get involved, there's various ways. You can take the MOOC, you can talk about these incubator products. We'd love to have people come by to work on either side of the desk; either to come and work on their own project, or to help other people with their projects. We have a notion of reverse internships where we hope to attract people from industry to come and physically hang out with us for a quarter. And also we're pretty committed to coming up with reasonable materials around all these programs so that if other people at other institutions want to try these programs out, like run their own incubator, they can borrow from us, if that's somehow useful. And I'll stop there since I'm over, I think, by a lot. Dennis Gannon: So when Gabriel was setting up, we should -- we can take a few questions. Yes? >>: Kind of relate it in terms of the education piece, I guess. So I think it's fair to say that in academia, you don't get promoted for writing software. So in this data science environment, are you trying to tackle that or you actually creating a problem? Bill Howe: No, that's exactly right. I didn't talk about it, but there's one of the working groups is Career Path and it's exactly that. One of the offerings that we have is that we're producing people with Ph.D.s who are not -- in some cases, who are not necessarily motivated by the traditional academic career path. We lose 100 percent of these people to industry to a first approximation; right. There's really no appropriate job for them on campus. Sometimes there's a staff program or job where you're beholden to some particular investigator in some particular department, and you live or die by their interests. But that's not a very competitive, attractive job compared to what they give in industry, even ignoring salary, which we can't beat on. So we're trying to say what if we came up with this data science studio where they had some autonomy, they had some prestige, they have some control over what they work on, they have -- probably still can't compete on salary, but we can do a decent salary. And we hope to attract these people back through that and actually give them a career path where they can grow and get promoted in various ways for doing this kind of work. And we'll see if it -- you know, ask me in five years whether it worked or not, but we have some early success in that we have people who have been thinking about what to do next, deciding to come and work with us that are incredibly good. Dan Halprin from UWCS, he has a Ph.D. in computer science and got excited about this new science stuff and came back to work with us. Jake Vanderblas, a Ph.D. in astronomy, who's a machine learning expert, decided to come back and work with us. So we have some success stories already. Dennis Gannon: Thanks a lot, Bill. This is really terrific stuff. I'm delighted. [applause] You had enough good stuff in there to come back again and give a two-hour talk. All right; Gabriel has been working on a project that we have been helping with at Inria for the last three years and he's going to give us a quick update on it. Gabriel Antoniu: Right. Thank you very much. First of all I'd like to thank you for the excellent opportunity to be here. So I greatly appreciate that. It's very exciting. It's a very exciting workshop. So the projects I'm going to talk about are data-intensive computing on clouds obviously on Azure clouds for science applications. So these are projects that have been carried out in a framework of Microsoft Research in the Inria Joint Center, which has a very nice location in Paris, but it's actually not a centralized center, it involves themes for several research centers in Inria. So Inria is the main research institute in computer science in France. It has eight research centers. And in these projects we have things involved from three centers. So I'm personally in charge of one of the research groups that's called KERDATA, focusing on scalable data storage and processing for large-scale distributor infrastructures. So the first project I'm going to talk about is called A-Brain. So it's basically a first Cloud oriented project within this joint in the Inria Microsoft Research Center. It was started as part of the Cloud Research Engagement Initiative directed by Dennis, so one of the first French projects in this framework. From the infrastructure level, one of the goals was to assess the benefits, potential benefits of the Azure infrastructure for science projects. In this particular projects, we focused on a large-scale genetics and in imaging analysis. And two Inria teams were involved here, my team focusing on storage and datarelated issues, and another team called Parietal located in [indiscernible] closer to the joint lab, focusing on neuroimaging. And of course we were in close connection with Microsoft, with the Azure teams, and also with people from ATL Europe. So in this project we have basically three kinds of data. Genetics data, SNPs, but those were presented yesterday so this term has already been introduced. We also have lots of neuroimaging data, so basically brain images, and behavioral data. That's smaller data that basically corresponds to the presence of some brain diseases, for instance. And the goal here is to find correlations between neuroimaging data and genetic data. Previous studies done by experts of the area established some links between the genetics data and the behavior data that is okay. For some specific genetic configurations there are higher risks to develop some brain disease, for instance. And the idea here was to make correlations between the genetics data and the neuroimaging data and to use the brain images to predict these kind of risks for the presence of disease even before the symptoms appear. So this is, let's say, the science challenge. So now concerning the data sciences, there are millions of variables that present neuroimaging data, so voxels that present the images there are millions of variables that are presented in genetics data too, so the SNPs, that could be hundreds or thousands of subjects to be studied. So there are genetics, a lot of potential correlations to be checked between each individual brain image and all pieces of genetic data. So in order to do this, we started in a quite, let's say, simple way. We thought it would be interesting to assess a MapReduce processing approach. Now remember at this time there was no Hadoop on Azure, so we somehow started from scratch with basic block storage abstractions in the web browse and the worker roles. So we basically started to build everything up to that level. So there were several challenges for us at the infrastructure level. One of them was related to high latency when the VMs accessed the public storage. So I say the simplest solution was to use those roles to deploy them and then use public storage and try to have the VMs access the public storage. Well, when you have highly concurrent accesses to shed data, that doesn't work very well. So we had to try to do something better in order to enable some efficient MapReduce processing. So if we do that, well we know that we had some previous experience with Hadoop and we realized that it was not really the best we can get. But we said, okay, let's try with that a simulator; and later on, for instance, we had some challenge related to the optimization of the reduce operation because just the basic way we apply standard MapReduce to this application was not optimal. So what did we achieve on the infrastructure side? Well, we tried to address all these goals. So basically, so just the first challenge, instead of using public storage, so Azure Blobs, for instance, we said, okay, maybe we should better leverage locality. And for every virtual machine, there is a way to use -- to leverage the virtual disk attached to it. So basically that's what we did. We gathered together the virtual disks of the virtual machines into a uniform storage space and we built a software infrastructure for that. So now the VMs had kind of a kind of distributor file system deployed within the VMs. Now we want a system that is optimized for concurrent accesses, for heavy concurrent accesses data. In order to do that, we went together -- it didn't just start existing this [indiscernible] system. We worked on our own and we arrived on the versioning and high throughput -- on versioning techniques to support high throughput under heavy concurrency. And in order to do that, we had done some previous work on the team on a software called BlobSeer, which was designed specifically for that purpose. So I will not have the time to develop that. I will just say that it basically combines a series of techniques like distributed meta data, lock-free conference rights, thanks to versioning, in order to allow concurrence rights or read concurrence rights to shared pieces of information. And that's what it tried to push on Azure or build the Azure version of it for TomusBlobs. So this is the final result, let's say. So I'm skipping further towards the end of the project, I say three years old, or three years later. I didn't mention that the project was finished actually a few months ago. And what it could achieve is actually shown here, is that we could reduce the execution time from the [indiscernible] application, which is the application that we were talking about in these projects for doing the joint genetics in neuroimaging analysis, or we could reduce the time from this level here, that corresponds to what would happen if we used just AzureBlobs to this level here, which corresponds to the image of TomusBlobs. We could do this, so that's again in the other 45 percent in time. We also were able to run this across several data centers up to 1,000 cores. There is a demo available on that. So in order to reach this goal we had to go step by step. And there were some smaller challenges, some sub-challenges we had to address. For instance, first of all we implemented the naive MapReduce and then realized, while doing the performance measurements, that the reduce phase was really not optimal. Actually there was bottleneck level of the reduce phase. Because we needed single results and we needed to gather the results from all the map personnel to produce something that makes sense. So we basically worked on a refined version of MapReduce, which is called MapIterativeReduce, which allows basically the reduce phase to happen in several stages with intermediate data, and it works as long as you have associative operations. So we could further improve the execution time this way. So in this picture here you have the time again corresponding to the usage of AzureBlobs in blue. And then that was in red is what happens when you use TomusBlobs so we leverage locality. And what we gain further, when we use this MapIterativeReduce approach. So it was worth doing it. We have the time to go further. So the next point was to go beyond the frontier of a single datacenter. There is a limit on the number of VMs that could deploy on a single datacenter. There is no reason to say it's an arbitrary limit. It is, anyway, interesting to see what happens if you want to deploy a number of -- a large number of virtual machines, for instance, by leveraging the sources available in different data centers. So how to enable scalable MapReduce processing across several datacenters. That was one of the challenges. So of course this corresponds to a [indiscernible] genetic scenario. It's not specific to our applications. Now, the big problem in such configuration is the latency that you have between the datacenter. So the goal is to minimize those -- that transfers across the datacenters. And we came up with a solution for that. It's not necessarily the optimal solution, it's something that we started to do that we will continue to optimize further. Basically we used TomusBlobs within each datacenter and public storage for data management across datacenters. And this is how we actually could do that. Those measurements I showed before on 1,000 cores, because we needed three datacenters for that. Okay. So now what you may wonder what was this useful for. What did the application team do for that. So we worked closely, as I said, with another team focusing on neuroimaging and genetics. So basically they use the infrastructure to run their algorithms to improve them and they came up with a new algorithm for brainwave studies. Basically it's an algorithm that tries to emphasize -- to help in detecting those correlations between areas of the brain and genetics data. So this approach is called RPBI. I'm not an expert of the [indiscernible], I'm not going to focus on that. But what we see in this picture is a comparison between the results being used in this new approach, and the state-of-the-art approach was, you see better highlighted areas here with more -- which are more colorful than you see here. So it's a quality improvement in this detection process. So that was nice. It was obtained on the Azure Cloud. We're very happy about it. But one thing you might wonder is what nice science result did it obtain? Did this lead to any science discovery? So actually it did. We had time after this study to do something else. Rather than just assessing the impact of one specific SNP on the brain, the idea was to study the joint effect of all SNPs on the brain variables. And this comes to the notion of heritability. And so a metal was developed for that by our partner team and in the end we could experiment it with real data from a reference dataset, which is called Imager. It comes from an [indiscernible] project. It's real data obtained, based on very high resolution images. And the result that was obtained is basically claims the heritability of some specific brain activation signals, which I call sort of the stop phase of brain activation signaled. So these signals were shown more heritable than chance and other reasons. So it was an interesting result that we submit to Frontiers in Neuroinformatics Journal and it was accepted a few weeks ago. And we are quite happy about this. We think it's a quite representative results for data science. It's probably the kind of thing that we want to show when infrastructure teams work with science techniques that don't know how to use the parallel infrastructure, [indiscernible] infrastructure, and we succeeded in doing that. So we are quite happy about these results. So the application team learned how to use the Cloud and learned that it was interesting to -- there could be advantages to not have to manage the cluster and [indiscernible] and just to render resources. On our side is an infrastructure team, it was interesting to do large-scale experiments on thousands of nodes on multi-site processing so we could experiment our BlobSeer software in this new infrastructure and to refine everything. We could run experiments for up to two weeks. So it was quite interesting, more than $300,000 of computation use, for instance. If we want to sum up on this project, basically on this infrastructure site, we mapped this, we issued the results so we could scale up until 1,000 cores and to cut the execution time up to 20 percent by -- sorry, 50 percent. And then on the same side we established this heritability result. Okay, plenty of publication; I'm going to skip through them. These are other people involved. And they are especially the focus the presence [indiscernible] here. [Indiscernible] the male involved in this project doing all this technical work was achieved. So the question is what's next. Once this was done, as I said a few months ago, we started to think about something else going beyond what we achieved. For instance, handling not only MapReduce workflows, but any kind of workflows. We could see several relevant examples yesterday. And also focus on this multi-site infrastructure and try to understand what are the challenges that are related to that. So dealing with workflows, I'm not going to explain that, you already know what it is, several examples were given. And we will focus on what is difficult in order to enable this kind of workflows on multi-datacenter clouds, assuming quite natural things, the fact that each Cloud has its own data and some program. Some data shouldn't leave some site; the processing must happen in different places. So it matches a real situation. So there are several challenges related to that, related to how to efficiently transfer data, how to group tasks and datasets together in a smart way, and how to balance the load. And I'm just going to focus on the first challenge of the data transfer with two examples. For instance, one approach we started to explore is the way to leverage network parallelism by enabling multi-path transfers within the sites. So basically it cost us in allocating dedicated VMs here just for the transfers. So we ran additional VMs to increase the level of parallelism. We can do even better than that. We can also allocate VMs on separate datacenters in order to further increase the level of parallelism. It is -- of course, it can increase the parallelism, but of course obviously it has some costs, so it's interesting to also assess the tradeoffs that might appear because of course you increase the [indiscernible] but you pay for that. So the right question here is how much are you willing to pay to have better performance and actually how much is it worth paying. Because at some point what you get extra is not worth any more. So these kind of challenges are the answer that we're going to focus on in the next project. Streaming was also mentioned in some applications, in some presentations yesterday. So we are going also to look not only about batch transfers, but also streaming in all these multi-site settings. So basically -- and this is actually my slide -- my last slide. In this project actually what we're going to look at several things, how to adapt workflow processing. What we use right now is mainly targeted for single-site Cloud infrastructures. Also we will also work on the expressivity levels. So in this second project our partner team is another team from Inria called [indiscernible], who's focusing on the scientific data management and more higher-level aspects than us. We are more system level, so to say. So they are interested in this aspect. We're going to build the Cloud data management framework addressing these challenges that I mentioned before and work also on scheduling issues. And I guess this is it. Thank you. [applause] Dennis Gannon: Thank you. We have time for questions and Alex will speak. >>: So if I watch mobile VMs and I start transferring from all of them, it might be network parallelism, but it might be other kinds of [inaudible]. Is it true they're all taking different paths or you have such a [inaudible] that you [inaudible]? Gabriel Antoniu: Well, in fact sending data -- there might be a limitation related to the one-to-one communication between two VMs situated in two different datacenters. But allocating new VMs in the datacenter as the center site, that communication might be very, very fast actually. And then it can be multiple sources of centers [indiscernible] on the receiver side. And that can improve that. I mean we did experiment studies for that and we could measure that it was worse than [indiscernible]. >>: Yeah, yeah, yeah. >>: So your comments blobs provide better performance than using like S3 or AzureBlobs. Can you briefly say how it would compare to just [indiscernible] on my local, just raw disks using HTFS? Gabriel Antoniu: Basically it performs better actually, because prior to these studies, we experimented on BlobSeer framework, which actually implements those principles of leveraging versioning, and so on. Within Hadoop. So basically we placed these difference by a BlobSeer base version of Hadoop that is optimized for concurrent access, it allows concurrent rights. For instance it reads concurrent rights with distributing meta [indiscernible] schema. I could -- in the initial variations, we could get up to 30 percent improvement just by keeping the standard configurations of Hadoop. So that was kind of a stopping point for this project was let's try to see how it works when we're on the Cloud. If I were to start it now, I would probably do things differently because we have Hadoop on Azure, so we would probably explore what's happening in that framework. >>: It would be interesting to -- have you tried -- excuse me, comparing Hadoop on Azure to [inaudible]? Gabriel Antoniu: We haven't tried that, but -- well, there's some previous experiments [indiscernible]. If you want to say -Dennis Gannon: Actually why don't we save that for the break and we can talk then. Gabriel Antoniu: Okay. Thank you very much. Thank you. [applause] Dennis Gannon: So I'm going to introduce my colleague, Alex Wade. Alex Wade: If I live long enough. Dennis Gannon: I'm thinking you will. Alex Wade: Please do. Dennis Gannon: And he can go ahead. Alex Wade: My name is Alex Wade. I am on the MSR Science Outreach Team. And I was going to try to set the record for the fastest-talking presentation, but after Roger went yesterday, I decided I couldn't compete with that. But I do have a lot of slides here so I'm going to be clicking furiously here to get you to the break on time. I want to go back to some of the things that Mercè was talking about earlier, and I'm sure everybody pulled out their dog-eared The Fourth Paradigm before they came to this and refreshed. But a lot of things that we're talking about here with data science and data-intensive research have a lot of implications for the less sexy parts of the care and feeding of the research data themselves. And one of the implications for institutions and for labs and for librarians in taking care of that, specifically one of the bits that I think Jim Gray evangelized was this notion of having all aspects of the data available and online, having the raw data, having the workflows, having the derived and recombined data available to facilitate new discoveries and to facilitate reproducibility. And so Mercè didn't mention this, but she was co-author on a publication that was just released in PLOS Computational Biology last week, a couple weeks ago, that talked about ten simple rules for the care and feeding of scientific data. I don't want you to use this as an excuse not to go and read the article. It's an open access article on PLOS. Do that. But shortcut is these are the ten simple rules right here. And I want to build on some of the things that Mercè was talking about with respect to the tools that Harvard is building out and the Dataverse Network, but I want to focus specifically on these four right here about sharing your data online, linking to your data from your publications, publishing the code that goes along with that data and then fostering and using data repository here. So in the first item, repositories, there's a number of different software packages that are available right now for building out a data repository. The general sort of push that's been going on over the past couple of years with efforts like DataCite are to promote the discovery of the datasets as a continuation of the publication. In other words, you publish your research, somebody reads your research, and then they say I want to go and get to the data. And that's the primary avenue for finding the data is via the publication, via assigning a DOI to a dataset along with the publication itself. But I want to build on that a little bit and talk about what are some next steps that we can take to facilitate the access and reusability of those datasets and then finish them with a sort of little bit more forward-facing look of how we can simplify the discovery of those. Mercè mentioned some work that we did with California Digital Library in building out the DataUp platform and Chris Toll and CB Abrams from CDL are also here if you have any questions at the break about this. But DataUp is intended to be a browser-based Open Source front end for data repositories that allows individual repositories to plug into that and provides end users an easy way to integrate with our own personal cloud storage. So we plug this in to the OneDrive Storage. But then to apply a simple set of rules then for describing the meta data around your datasets and having that get published into a data repository. This is available for folks to work with. We're looking for other people to get involved in building sort of on-ramps between the data software and their repositories. So do talk to us more about that if you're interested. But if you're setting out right now and you're saying I'm a lab, I'm an institution, and I need to build out a data repository, there's a large spectrum of software that's being developed to do this right now. There's some traditional software packages that have been used mostly for publications, things like DSpace and EPrints, that are either modifying or writing white papers on how to use those repositories to manage a data file, more or less. Dataverse Network that Mercè talked about -- and I'm going to talk about briefly about the CKAN software from the Open Knowledge Foundation. This is generally being used in the UK. Globally now a lot of use in sort of the open government data initiative type of use cases. But then there's also hoster repositories, things like figshare as a software as a service, a place where you can go up and upload your research data into figshare, get a DOI for that. Microsoft has a service as part of Windows Azure called the Windows Azure DataMarket. And I'll talk briefly about that. So really quick, with the CKAN software, I'm going to jump in here and do just -- I'm not going to give you a demo, but if you go to CKAN.org, you can walk through this yourself. There's a live demo link right here. And you get dropped into a data repository which is a whole bunch of things that people have dropped into it. But the general idea that you can have a browse-and-search-based way of drilling down into these datasets here. So you can filter by organization, you can filter by group, you can filter by tags, the actual file format and the licenses. There's a search engine Intuit. And if you pick a dataset -- let me see if I can find a good one here. Road closures -- you get access to the files themselves, but then also the ability to drill into the data, preview it before you download it. That's very intense dataset with two whole entities in there. But play around with that just so you can sort of see what the CKAN software gives you. If you wanted to create an instance of this, this is something that's supported right now on Windows Azure. And I'm going to walk through just a couple -- a couple -- a lot of screenshots very quickly just to give you a sense of how quickly this can be done. Actually it took me longer to make the screenshots than it took me to create the repository. But the general idea that I'm going to copy two images out of the VM Depot into my own blob storage and then I'm going to create a virtual machine from each one of those disk images and then I'm going to link them together. Those are the three steps. So when you log into Windows Azure, if I go to virtual machines, you can see that I have no virtual machine instances here, and if I click over on the next step there on images, I have no images either. So I've got on empty shell here. And this browse VM Depot is your entry point then into this set of community-contributed virtual machine images. There's a huge number of them. I don't know how many we have these days. But if you browse in there, you can see there's two CKAN disk images there. There's one for the CKAN database and one for the CKAN web front end. And I can grab one of those. This is the database. I'm going to tell it which image I want to copy it into, give it a name, and it starts copying. I go back in there and now grab the web, put it in the same region, give it a name, and that starts copying over. So I now have two disk images here in my site. I register them; they're available. So now from those images I can create my virtual machines. I go in, create a virtual machine, come down here to my images, I see the two that I just copied over. I give the VM an image, pick a size, and boom, that's starting up now as a VM. Do the same thing with the web. And I've got my two VMs now up and running. Last thing I need to do then is go in and link them together. So the web front end is talking to the database, and I click the link resources, pick the storage account, and now it's up and running and I have now an empty data repository that I can log into, I can start creating my accounts, I can start populating that. Okay. So that's the model for taking a disk image off of the VM Depot and building your own data repository. As Mercè said, the Dataverse Network is working on this as their 4.0 release comes out in June. This will also be available via VM Depot. So I want to build on that then and talk a little bit more about evolving beyond the notion of a data repository as a place where there's a thing that you're going to go out and get. And one of the nice things that I like about the CKAN repository is that they start to think about the ways that you can now start interacting with this data beyond simply grabbing a CSV file and downloading it. So they have over here this link for the data API. So you can look at the data, you can sample it, you can apply some filters. They even have some graphing and mapping capabilities so that while you're in your browser you can play around with it and say is this the data that I want to interact with, and once you say yeah, this is what I want, rather than downloading the dataset, the data API here then gives you how to update it, how to query it, the direct access to the data themselves. So you can write an application that is querying it out of the data repository rather than creating your own instance of that dataset. And similarly the Azure DataMarket does this as well. So one of the areas that this becomes particularly important -- oops, not going to -- in the interest of time here so I don't feed into your -- eat into your break time, is one of the ways this becomes increasingly important is that when you're dealing with a very large dataset, you don't want to download it. So the example that I was going to show here is if you go to the Azure DataMarket, is we've taken a large amount of data out of the Microsoft Academic Search Service that we've been building up. So it's something on the order of 50 million papers, 20 million authors there, a large number of relationships. You can actually see the individual SQL tables in the model there. And you don't want to download that entire dataset if you're interested in dealing with a subset of it. So like the CKAN repository, the Azure DataMarket has a RESTful API where you can go in and query that data and either bring down the subset of data that you want to store locally or just have that be the backend database to the application or the analytics that you want to be doing there. So you go to the Azure DataMarket, you'll see then, just like CKAN, there's a number of different ways that you can browse into the datasets that are available there. You can browse by domains, you can browse by the free ones versus the ones that have a fee associated with them. And somebody asked me yesterday how can we take our research data and get it into the Cloud but in a way that we don't need to pay for it? And one of the things that the Azure DataMarket supports now is a cost recovery model where you can make your data available and you can make it available, charging the consumers of that data only as much as you need to pay for the sort of tenancy bills, the hosting bills, to put that data into Azure. So it's something to consider. It's not quite wide open enough for the world to start coming and putting their data into it. So the Azure team, I think, is really focused on the high-value datasets. And you'll see there's only a couple hundred datasets in there right now. But that's an approach to think about for a sustainability model. The way in the case of Azure that you get your data back out is via the OData Protocol. People aren't familiar with it, go check out OData.org. It's under the umbrella now of Oasis, being standardized, but it's a -- built on top of HTTP and built on top of REST and you have the option then of giving the payback on in JSON format or ATOM format. And it's a very simple way of interacting with a very large datasets. One of the nice things about it, if you're interested in this aspect, is that it's also a read/write protocol. If you have a data site that you want to be writing back into, their write operations. One of the other rules that was mentioned is publish your code, in the article by Mercè and Alyssa Goodman and others. This is a project that was just started up by a collaboration between figshare, the Mozilla Science Lab, and GitHub, which makes it really easy to plug your GitHub repository into figshare. And the general idea is if you want to snapshot your code, get a DOI for your code so you can say this is the instance of my software that I used to produce that dataset, you can certainly plug in -- I think they've got a Firefox plug in, but you can plug in your data repository into figshare, snapshot it, it will bring all the code over to figshare and give you back a DOI for that. So it's a really great idea. [Indiscernible] is also involved in a workshop at Oxford a couple of weeks ago with the Software Sustainability Institute. Is that who ran it? >>: It went live yesterday. Alex Wade: So great blob that he just published last night on our connections blob. Check that out, talks about software as a part of scientific reproducibility. And the last thing I'll leave you with, maybe even finish in time for the break, is this notion of moving beyond linking to your data as the sole way of discovering research data. This afternoon somebody's going to come over and talk about some of the tools that are being built into the current version of -current generation of Excel. And I think we had a demo last night out in the lobby. But just sort of thinking forward a little bit on how research data could be as pervasive as web pages on the internet. So I didn't go to the Spitfire Grill last night with the rest of you. I had to go home and pick up my boys. But I got home in time to see the last ten minutes, I think, of the hockey game where the flyers killed New York Rangers. And instead of opening up my web browser and looking at the web of web pages, I opened up my data browser, Excel. And I don't have any data in Excel right now, but I was interested in learning a little bit more about the Flyers. I haven't really been following them this season. So I clicked online search. I got a search box over here and I typed in -- can't really read that, Philadelphia Flyers Roster. And I got a set of datasets back. So this is Bing indexing data from the web. Most of the results here are actually tables that they scraped out of Wikipedia. But you'll see other datasets in that set. And imagine that this was scientific data now. I had a need for some data, I typed that need into a search box, just like you would into a web search engine and you get back a set of data. First one says "current roster, Philadelphia Flyers." That sounds good to me, and I bring the data into my spreadsheet. And this is where you'll start seeing some new tools this afternoon. But there's a very simple one that Excel 2013 has, which you can just take the dataset and you can say here's some things that are roughly places in the world, cast those on to a Bing map. And within about 90 seconds I went from having an empty Excel spreadsheet to having a map of the birthplaces of all of the Philadelphia Flyers. So I just wanted to play around with that data. So I just want to leave you with that notion of thinking about data discovery, thinking about research data, getting it in the hands of users for new experiments in sort of this paradigm. I think I'll finish with that right at 10:30, and I'll leave one minute for questions. [applause] Dennis Gannon: Questions? Alex Wade: Question? >>: Yeah, did I miss something? [laughter] Dennis Gannon: I think everybody's ready for a break. >>: [inaudible] Alex Wade: Questioners? >>: I have a question but I cannot discuss it over the [inaudible]. Dennis Gannon: Why don't we go ahead and take a break. Let's thank the panel. [applause]

Dennis Gannon: Good morning. Glad to see we... 9:00 on the second day. That's usually a good...

Related documents

Products

Support

Dennis Gannon: Good morning. Glad to see we... 9:00 on the second day. That's usually a good...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib