Document 17828056

>> Rob Fatland: Welcome back for day two. I will try to keep my voice from overwhelming the speakers here. So again, I am Rob Fatland, and I do have business cards today, finally backed by the remaining epiphytes; get them while they're hot. Let's see, so yesterday we had this fantastic day of presentations kind of all across the spectrum of data production and use, and so today our program as I understand it has sort of two parts. There is this initial section that I will kind of be directing and that will feature a presentation by Shoshanna and some stuff for me, and then I think we have a coffee break and then we have sort of ongoing conversations and then Jan and some of the other I guess Jan's folks will be moderating that. So what I wanted to start off with today and here you are seeing a WorldWide Telescope Layerscape tour running in the background. The main point here is first of all if you're standing up and you watch this, you could accidentally fall over [laughter], at least I often do. So this is data from U.S. Geological Survey, I think. It is high resolution imagery of the Nisqually Delta down in the south sound. And I just popped it upside down to make a little bit easier to see. You can also put you down to its correct depth and is depth exaggerated and so forth and then it's going to start over again. And I will leave that running while I talk for a couple of minutes on some of my themes and then we will turn it over to Shoshanna. So one of the things that we are interested in as I mentioned our group is doing environmental science informatics, the arc of data from the conception of a research project all the way through to let's say publications. And I have written down for my own reference here some of the topics that I can sort of gas on about for a while. I am really in my element what I am talking about how to do nuts and bolts detaily things with software and with let's say, you know, types of data, and I have a research project that involves biogeochemistry so I'm really happy and enthusiastic to talk about, for example, the things that you can do with a mass spec and relationship to fluorometry and so forth. But where I get out of my depth is when I start talking about generalized problems, but everybody kind of likes to do that so my list of generalized problems has sort of two broad categories. There is a process question, how do you get to where you want to go, and then there are questions that are surrounding data. So I will talk more about that later, but when all is said and done eventually you have to start building a tool and that like I say is when I'm happiest is when I have something that I definitely need to build and I can get working with engineers and software people and designers so forth. I worked with a guy at Johns Hopkins and he made this observation that right now if you want to make a phone call, you don't need a software engineer. You don't need help. You just go to the store and get the phone and it works because there are 7 billion of us who need cell phones more or less, or however many, and that is the consumer market, but if you look at that technology and what it is actually doing and then you have a research problem, like why can't I use that technology in my research, because it would be so useful to be able to have that research equivalent of a cell phone; there's no market for it. There are not a lot of people trying to make money off of that, so we have this kind of carry the technology across conundrum and my friend at John Hopkins points out that today if you want to go out and build a sensor network and install it in some trees were in some soil or something you kind of need 17 technologists for one ecologist is the ratio [laughter] and it would be nice to reduce that to 5 to 1 or 1 to 1 or .5 to 1 or an undergraduate to 1 who is just doing this stuff in their spare time. So making solutions that are software-based that are engineering-based that can be replicated is like I said is one of our important themes. This particular solution, if you like, takes advantage of consumer electronics because everybody has a graphics card in their laptop that is capable of rendering 1 million polygons or something like that, so this is a data set of something like 2.4 million data points and I've only got 1.8 million data points into a WorldWide Telescope before it said no, I would really rather not have any more data points, thank you. But you know I pushed it there and it just took a little bit of work, but half a million is sort of the range. And then the fact that WorldWide Telescope knows about time and you can set up your data with time tags and say okay, go and play through the time of the data and when you get to the end start over again. And that is happening in a WorldWide Telescope completely independent of what you are looking at, so you can fly to the other side of the world, but you can fly around your data and look at it as it is evolving in time, so these are the two strong points of the visualization piece of WorldWide Telescope, and as I say we built an ecosystem around that called Layerscape. It's a website and it is an add-in for Excel, what is an add-in for Excel? Well, Excel is becoming a programming environment. You can write your own ribbon in Excel; it is called an add in. And the model for WorldWide Telescope is that you are a domain expert. You've got some data. You know how it works; you know what analysis you want to do on it. Don't try to do that with WorldWide Telescope. WorldWide Telescope is not Matlab. It is not some programming environment. It is a visualization environment, so you abstract your thinking about your data into a separate data app and you connect that to WorldWide Telescope and you put your data in and you look at it. And then being able to be inside WorldWide Telescope and ask questions back out is sort of where we are right now. That is where we would like to build next. So the utility of WorldWide Telescope, therefore is to handle a lot of data, handle complex data and to be able to use your GPU to render this stuff to fly around it, to see it play out in time, to investigate it and the goal here is to get past the eye candy stage, the isn't that cool, to the point where you are getting real insight. So that is what our little Microsoft Research project with Layerscape is about. But we have lots of other stuff going on. And in particular, Microsoft has lots of other stuff going on and it is really fun and extremely interesting to watch it evolve. So at this point I would like to talk about the data system or the data resource called Eye on Earth and will have Shoshanna come up and talk to us about this. And the basic idea is European Environmental Agency was tasked with providing a push pull kind of system where the users could actually push data in and get it back out. We got to build it and we got to attach it to something called DataMarket. And there we are. >> Shoshanna Budzianowski: So I am actually from a group in our server and tool business, so we're very practical engineers and we create lots of products. We are not scientists in any sense even though we do have some smart people in the group and so we took a look at kind of this emerging world of data and decided that there were some real problems that needed to be solved. Now when you are doing with data sets, some of the typical operations that you do is to first discover the data if it wasn't data that you've generated yourself, you have to understand or schematize data. You need to extract semantics from it and then you're going to start being able to get to a place where you can be doing some kind of intelligent analysis on it. But, you know, there are problems before you even get to that point, and this is what we have been looking at, so we know for a fact that there is more data generated in a month right now than it was in the entire past century; nobody can keep up with this information. Not only that, but the data is not always reliable. The data is dirty. As a matter of fact, this is so funny. I just realized that I went back to my whole career history and my first job out of college was actually working at a company called Foxboro Company that essentially monitored things that flow through pipes. And my first job out of college was to spend my time with Ok [phonetic] and Perl and Zed and anything you can imagine and all I did was cleanse data for like two years. I did other things too but, you know, at a certain point I realized that I was pretty expensive data cleanser, even coming out of college, so this problem is only getting worse and worse over time and we realize, you know, there are other systemic problems even if you can get access to the data which is at least for normal people and potentially even for academics that data is often siloed behind very complex walls, whether they are contractual walls or they are walls that are put up because of some of the constraints around academia and citations et cetera, getting access to that data securely in a way that the publisher trusts the consumer is extremely difficult as well. So what we actually did is we said let's take, you know, every problem is a big problem. Every problem is a hard problem, but let's just start the microcosm and start creating a solution that allows people who own data to let people who want to consume the data know about it. And so we actually created a system called Azure DataMarket. How many people here have actually heard of Azure DataMarket? Hey, that is actually pretty good. All right. It's getting better than typical. And so we like to think of Azure DataMarket as a place where you can go to get curated data sets from publishers who have written and signed a contract with us that they will keep that data set available for a minimum of three years, and I think that is important. Now, you know, a lot of times you may pull down a, if you have a static data set, you're going to pull down that static data set maybe once or go back and update it, you know, maybe three or four times a year, but a lot of what we are seeing now is there is data that is continuous data set that needs continuous refresh. Because these providers have signed contracts with us, it means that if you want some of this data for your solutions or research you can at least be guaranteed that the data is not going to disappear and you are going to be left out hanging for it. Now the other thing that we do in the Azure Marketplace is we actually work with these concept providers across the world to provide a consistent interface on the data. Now we use a standard called open data protocol or OData. I think you've heard of that before. It essentially is a stylized protocol for making queries against the data and then it returns essentially XML to you in Jason form or Adam form or whatever you want and the reason that that is important is because it means that tools and applications can bet on the data. For instance, these data sets integrate with the full Microsoft BI stack and I will show that to you in a second. The other interesting thing about unifying over a protocol is that it means that it is actually easier for you to mash up data from different data sets, so, you know, having lived in, me being the engineer, having lived in the world where every single developer creates their own API because it is a creative thing to do, it actually means that it is impossible to use two APIs from two different providers together and get anything meaningful out of it. And so this open data protocol is one of the key elements about making that data available and accessible too. So enough talk. We will go in and take a look at the Azure DataMarket for a second. And while I'm bringing up the DataMarket, so first thing, let me ask you about anything that I said about, anything I said about some of the problems that you are solving with data, how much of that does that actually resonate with you or are there problems that I am just kind of glossing over? Resonates? Resonates? Yeah, good. You have other problems too? Okay. Good. So one problem I actually hear in the scientific community a lot is sharing data that cannot be made public and we will talk about that a little bit later. So this is actually Azure DataMarket. It is as I said a Marketplace for data. Some of the data is available for subscription price. Some of the data is available for free. For instance, we have all of the World Health Organization and a good portion of the UN data sets on here. Now one thing I would love to get from you is if you see a data set, you know, I want to cruise through a little, if you see a data set that is not on here, especially like from the UN and you need it to be on there. They've got a bunch of geospatial data sets that we just haven't gotten because we haven't gotten a huge amount of demand for it, let me know. There is other data sets that are actually works in progress and the place where we are actually spend time now are on environmental data sets. It turns out there is a lot of sensitivity to publishing environmental data sets to the world in a way they can be consumed without it having been interpreted by somebody who is part of an agency or by some government regulation, so, you know, things like getting all of the water quality data sets from the United States up here is something that we are absolutely actively looking at and, you know, it is kind of stuck in bureaucracy and policy right now. We see that theme consistently across the world. So let's just go take a look at our UN data sets. I am just going to cruise here a little bit. Actually let me go here so you can see the whole list. Did I spell it wrong? There we go. So we have got food and agriculture. We've got international labor and there are actually other data sets that I am not finding right now. So we've got the world health--over here we've got the full set of data sets. We've got energy statistics, all of the AIDS data, commodity statistics, the food one for me is just so telling about the world because it really helps describe the food issues that are occurring in Third World countries throughout the world, and so when I read that on that one tends to get pretty sad for me including how much land is planted, how much is arid. It's pretty interesting. Development goals are all in there, international labor, which we saw, and then tourism which I always thought was fun. Gender info and other really fascinating data set, world population prospects et cetera. So how many of you actually use the UN data sets? Okay, is that important to your work? Okay. So are their data sets that you would expect to see from the UN on here that you haven't, in that brief thing that you haven't seen yet? >>: The geospatial one. >> Shoshanna Budzianowski: The geospatial one, I know. >>: Of course, the ones you don't have up there. >> Shoshanna Budzianowski: Well, I will talk about geospatial, you know, I am going to be really open with you in this crowd. So there is emerging ReST based geospatial standards, but there is not one that we were willing to take a bet on yet. And so I'm going to tell you something that we are going to do and you can just barf all over it and tell me that that it is completely wrong. I think this is on tape so I shouldn't have said that. [laughter]. Part of what we have been doing with European environment agencies creating geospatial solutions, showing data over maps. We actually partnered with a company called Esri. I think that you are probably all familiar with Esri. Now they actually published a GeoREST specification and it is a great spec, from what I have seen. But it is a spec that has been, you know, submitted to be an open spec. I can't remember which committee they submitted it to, but it is potentially now an open specification. Now one we could do is say we love that standard and kind of opt in to that standard, or we can say that we have to go with something that is more OGC but, you know, won't have as much tie in to real engineering systems in the background, and so these are the kind of like decisions that we need to make to determine, you know, how open we can make the data and how standardized we can make the data. So feedback, should we support that the GeoREST specification from Esri or try to go from something that is even more open? >>: This is not my field [inaudible] you guys, there is an open GIS Consortium, are you familiar with that? >> Shoshanna Budzianowski: Yeah. >>: Why wouldn't you, why wouldn't Microsoft want to go with something like that? >> Shoshanna Budzianowski: Well that, this is where it gets to like, this is where the rubber hits the road, so there is theory and there is implementation and so yes, we could go with that and then not have the data that we need fully accessible through the DataMarketer. We could go with where the biggest data sets are and use that standard. And so, you know, the hardest part for us is to make that balance right. >>: And just to follow-up, we are working with OGC to [inaudible] the old data [inaudible] layer on that is easy. >> Shoshanna Budzianowski: And that is absolutely perfect. Right, yes? >>: You said that the a ReST specification has been submitted to OGC for, to get [inaudible] but one of the I guess the process has been slowed down because the [inaudible] is currently in multiple specs and one and the recommendations of the working group, OGC working group is to break it into its component specs so that it is easier for people to adopt a portion of it so that they can comply with the visualization portion, or the data access portion. So if you were to support that modularization I think that that would be the best… >> Shoshanna Budzianowski: Okay. That is interesting. So they've got a bunch of feature services that are about calculating things and they've got a bunch of like mapping and tiling and visualization services; you're right. >>: [inaudible] when it's all bundled it actually makes it harder for people to be compliant with it, but if you break it apart into its modules, you don't actually change the function but you make it possible to be compliant with the pieces that people want to be compliant with, or need to be compliant with. >> Shoshanna Budzianowski: Fascinating. See that's exactly, that's perfect, okay. So inasmuch as Esri is an independent company that is feedback that I can provide them that would essentially make it easier for us and for you to adopt that specification for geospatial data. There is one other, since we're talking about geospatial, just a second, we also have one other problem with at least that specification that they submitted and with geospatial data currently which is their flat earth company. And we are now in a round earth world. And so a lot of the data that has been collected now at the polls can't adequately be represented in that specification, and so I think there is some other pressure that needs to happen not only for Esri but across everybody who thinks that the world is still flat. Okay so [laughter]. It is easier to think about the world that way. Okay. And so one thing I am going to do is I am going to do a search on environment, just so you can see the environmental data sets that we have available. Let me get rid of the United Nations and search for environment. >>: [inaudible] comment, I am very surprised that I guess it's because they been collecting through time, but those data sets that you put up there from the UN, I am shocked that they are not using spatial databases. >> Shoshanna Budzianowski: Right. Well, you know, that's really great feedback, so when we, you know Microsoft is a company that has, we have a lot of business customers. We have a lot of commercial customers, and what we've done is we've begun to focus on portfolios of information that are available that, and are necessary. So we first focused on business and financial data because, guess what? That's where it is. And you know the UN stuff is, and WHO are pretty, you know, open in where they want their data to go. They actually want their data to go everywhere and so they kind of pushed it to us and said well that is really interesting when you start to look at some economic scenarios, you need some of that population demographic data that we are getting into it. Now that we are actually creating, you know, a larger perspective on the environment at Microsoft, now we are going back and we are kind of retro fitting in some of that environment in geospatial data and so it's really a matter of time more than anything. So when I go look at, I just did a quick search on environment in the Azure Marketplace and I came up with 65 hits, and we also have some scientific data. I just want to show you one that I absolutely love. And this is probably a company you have never heard of. Have any of you ever heard of Environmental Hazard Rink? Okay. I know that we are mostly here talking about water, Northwest waters. They have done something amazing on the land, not on the water and I think what you want is this company to start looking outside of, you know, the land boundaries. So what they do is they calculate for any zip code in the United States how much money it is going to take to actually clean up all the hazardous waste in that zip code. And so I actually went and I talked, they had published this data set, and I went and talked to their engineer and I said well what do these numbers mean? And he said well, they are actually numbers and you should probably add a billion to the end of the number, so I was, and the number was so large and so scary that they couldn't for public consumption actually publish the exact dollar amount that it would take the cleanup the environmental information, so interestingly we have an interactive map here and this 9 and 10 is where I live. So the scale goes from 1 to 10. It turns out that we have communities that are built on things that used to be industrial areas. And the industrial waste was never cleaned up and I happen to live in a very nice community right by Lake Washington but it is also very close to where the Navy shipyard used to be and it is toxic. So, you know, a little information like this is important. Now you can also get more graph information out of it. You can get more data information, so I am just going to skip back and I want to go look at another data set and, well you can see European greenhouse gas emissions et cetera so, I am actually going to go back and look at the data set that we have from… >>: Me I ask another quick question here on this? >> Shoshanna Budzianowski: Yeah. >>: Because we are a bunch of science types and that we are always concerned about the quality and the vetting of data and that is not always obvious. That is not always on the front end of when these DataMarkets are out there, but I am assuming that the metadata and the standards behind that are available so that if you are interested in a data set, you can do your due diligence on that. >> Shoshanna Budzianowski: I think that is a really good question, so one of the things that we are trying to do is to make sure that these providers are trusted providers. Now we do have some providers on here, specifically around business data, and I think their data is just awful, but I think it would take about 10 minutes for anybody to a look at that, just go look at it, go look at the data, put it down and go look for outliers and standard deviation over some of the values that you are getting and you can tell if they are good or not. You know when you start getting data sets that have a lot of null values or just real inconsistent values, you can throw it out. Now we at Microsoft have a little bit of a privacy kind of policy and a non-compete policy and this is something interesting, so we like to think of this Azure DataMarket as iTunes for data, but Apple has a very different relationship with their providers. They essentially provide their relators, their providers with very strict standards and they review all content. It means though that Apple is actually liable for the content that is produced by their provider. So when you go to the Apple, you know, App Store, right, they have reviewed and certified every single application that is on that App Store, right? It means that they are taking liability for those applications. Now I don't know how many of you have ever read the agreement that you sign for the App Store when you take an application. It is literally 26 pages long. I think I am the only person in the world that has ever read it except from some lawyers and I did it at a soccer game that I was really bored at one day. But [laughter]. >>: And you still signed it? >> Shoshanna Budzianowski: What? >>: And you still signed it? >> Shoshanna Budzianowski: Of course I did, but I don't have an iPhone anymore; I actually have a Windows Phone. This is a problem that the whole search world has and even Google has is that as soon as you understand that content, you are liable for it. So you keep hearing things like the government trying to make search providers liable for the content that they are displaying. In the same way, this is a nascent business, this whole data business at Microsoft; we weren't willing at this point to close the pipe, because if we actually said I am going to certified every one of these data sets as the data being reliable and accurate, then you wouldn't have any data, because we would still be on our first data set, literally that is how hard it is. And I kind of briefly talked about those problems with data; I live in data every single day and let me tell you how hard it is even to get data into that consistent OData form that everyone can use. So these typically are multi-month to multiyear negotiations with companies that hold the data, so there is only so much that we want to do to stifle that pipe. Going back to some of the questions that you asked about is there metadata for the data, the answer is well, we love the OData spec because you can just say put a URL and then put dollar sign metadata at the end and get all the metadata back, which is great. Or you can just actually go look at this page. This is a data set that is published by a company called Weather Trends and what they have done for us is give us something, just a very small sampling of all of the data set that they have and this is data from I think it's from about 2001 to 2012 and we have weather data for about 10,000 stations across the world. It is just a sampling of the data set that they have. These are actual weather readings from across the world. Now what this company does in reality is they actually have weather prediction models and so the thing that they haven't given me yet are any of their prediction models that have predicted weather across the world. So here is the schema for them. So you just put in end date, start date and a station ID, which isn't in this database right now but it is easy to get, and essentially you can get the maximum minimum temperature, precipitation. Here are all the data types that come along with that. This is essentially it is the station ID and you get the latitude and longitude back from them, and so this data set is one of those data sets that is available for purchase. There is a preview or a trial data set that is available for free. And for $12 or 36 or $120 a month you can continue to read this data set, because it is weather you might actually want to check the weather every day. One thing that I did with this data set and let's actually just go in and use the data. Okay. I think I have already purchased it, but I will just put purchase again. Okay . The one thing that you will learn as you are watching me is that I am not so coordinated, so you will just have to forgive me. My family knows this. They know better than to go anywhere near me especially when I am eating. Okay. Now we're going to go explore this data set. Now you may not have noticed, but I have a Polish last name [laughter] so I am going to go take a look at, I am going to go interactively build the query for Warsaw Poland and I have already pre-cached the identity of Warsaw Poland for the station ID. And what I am going to do now is just run the query and what this does is let you without having, you know, to download and pull data into tools just to quickly go and take a look at the actual values that are in the data set, so I find this extremely helpful in just kind of doing the simple evaluation of whether or not this data is even meaningful for me, and they did give me this data set in Fahrenheit not Celsius, so it's probably not very useful for everyone, but that is what we have on here because this is more of a consumer option and a lot of the times what you see here is that, you know, this is a contact point to begin to go get the data. So you may not have even known that this company existed, so this company was founded by a bunch of weather climatologists who have, you know, advanced models on weather. Now the founder of this company actually just left and is now working for NASA on their client models, so the data is valuable and relevant. Now what I had done is after I had done this query, I went to export and export gives you the option of pulling this data down locally onto your client either in XML or CSV form or an Excel PowerPivot. How many of you are familiar with Excel PowerPivot? Okay, so I think I heard Rob say earlier that one of the things that we have noticed is that Excel is still the programmable interface that is used by a lot of people. So Excel PowerPivot, I am going to show you that to you now is an augmentation to Excel that essentially supports multidimensional data sets using the full SQL Server BI stack, and so what I have done here is that, this is here a very flat single dimensional model, but what I can actually do is very quickly build up a multidimensional model by importing data sets, different tables, you know, name your favorite data. You can even import data from data warehouses and this is an in memory columnar database that is lightning fast. You can start doing pivots, filters, functions on data sets that are as big as 10 billion rows almost instantaneously. I do not have a 10 billion row data set for you, but this is one of the premier products that we have that turns Excel into something that is actually valuable for large data sets. So what I did in that weather trends data set is I pulled it down… >>: [inaudible]. >> Shoshanna Budzianowski: Oh. PowerPivot? >>: Yeah. >> Shoshanna Budzianowski: Like, I pulled it down for free. So it's on… >>: Are they on regular Excel release or does it have something… >> Shoshanna Budzianowski: Well, just go to Microsoft.com or actually if you're on the DataMarket site, we will tell you how to install PowerPivot. We will actually give you the link right there and you just go install it for free, and it is just an add-in into Excel, so it is really super simple to use. >>: You said that B or an M? >> Shoshanna Budzianowski: B. I said B. >>: [inaudible]. >> Shoshanna Budzianowski: Yeah. As a matter of fact we have a distinguished engineer--actually I think he's a technical fellow now in the Microsoft terminology which is our, you know, top level engineer at Microsoft called the Mere Nets [phonetic] and I know there is a really great video of him demonstrating this on data that came off of sensors from cars, so billions of rows. >>: Can you hook it to a remote MySQL database? >> Shoshanna Budzianowski: Well, anything that supports an OData feed, it can be connected to that. They support like about 10, 15 different protocols for connecting to data, including all of the standard ADO.net, ODBC Connector, so if MyData supports ODBC, I think you are fine. I don't spend a lot of time with MyData. So what I did here was I think there are about 4000 rows in here. This isn't huge. It is only for one station, exactly one station in Europe and I said they've got about 10,000 stations. If I had actually downloaded the data set I would be working with something that would be much larger, and what I did quickly was I went through and added some formulas to this data set. Just to show you how really simple it is, I think I should go and add another formula. What I did is this data set comes in date and time stamped and I want to actually look at weather trends, so I want to say is January hotter or colder over the last 10 years? So for me to do that I couldn't use a date stamp, I had to just filter this data by year and month and day so that I could compare month over month. So that was a super simple formula and just to show you how easy it is, I am going to go add a column to here, and I am going to say equal, year and I am going to go back and choose my date column right here. And scroll back over and it is actually already populated the entire column with that formula and calculated it automatically. So it is really just super fun and super interesting. So once I do that, then I get to use the full power of Excel, right? And so let me go and pull up what I did with Excel, right here. I went and created PowerPivot table and chart, super simple. Are you guys PowerPivot table experts, chart experts? Okay. Let me go select the field list and, select the field list for this. This is how you operate, go pull up that field list too. Oh well. Okay. This is essentially how do operate a PowerPivot. You have this ability to create slicers and what I am going did is I was going to slice this data set by year, and then I just decide what my X and Y axes are and what value I am going to use, you know, in the data set and for this one I am actually summing, let me go. Okay, well I am actually summing the maximum temperature per day and you can see that in the chart. Let me go get rid of this thing. And I am looking at for 2010 here are 31 days, what was the maximum temperature across that? And so you see the normal bell curve that you would expect in the weather. The other thing that I can do though is I can look at any year, specifically. And then I can actually look at the aggregate of all of the years for which I think I have good data for, and you will notice a normal bell curve. Now this just goes to show you how quickly you can do this. This is not a very advanced formula. I am not a data scientist. I would actually expect that you would do something different like have multiple charts and overlay those values from multiple charts and then go look at actually, you know, whether or not there is a variance in the data, you know, between the different months, so it was just something that was really fun and super quick to pull up and it shows you how you can very quickly get data out of the system using that OData protocol into the standard BI tools at Microsoft. So enough on Azure DataMarket. Let's just go pull that down, and go back to the slides. One thing I didn't want to talk about before I actually start going into what you really wanted to see which is Eye on Earth was, you know, one thing that we recognized is that all data is never going to be public, just because data is dangerous. I discovered lately that the fish hauls that are coming off of ships all over the world are the most dangerous data sets. Although I know that we collect, you know, that data, NOAA collects that data, it is never public because it is considered to be competitive information and yet to do science you need the data, and so we are kind of like in this Catch-22 which is how to I get access to that data, and I would love to get access but I have been told to cease-and-desist already from even asking the questions. So what we are doing in Azure market place we recognize that the functionality that we provided here for providing access to information, discovery of information including the discovery of the schemas that we talked about earlier, it is all great, but some people want to keep their data private so we are offering that platform, that public Marketplace platform to organizations that want to implement the same functionality but keep it private. And so we call this our private Marketplace. We are actually going to be, this is all, it is not all that secret, but in the next couple of months we will have a lab available of that private Marketplace. What it does is it allows people who kind of own that Marketplace to invite people in. They can decide who gets to publish data. I love to think about this kind of virtuous cycle where data gets better because people are allowed to improve the data and republish it. So part of this concept of private Marketplace is to give people permission to republish data sets, improve data sets back to this closed community. Now, yes, this is good. This is what I have been hearing. Now for organizations that want to charge for their data, that is fine. For instance, some of the most interesting social data feeds right now do not come from Twitter; they do not come from Facebook; they come from purchases made on Visa. And Visa has one of the largest data sets in the world and if you can get access to it, you know, good luck to you. Actually I met a woman from MIT who has been mining the Visa data set with permission from them on their site, having signed her life away from blood and is just finding really amazing things about human behavior from that. I mean you can actually predict when families are beginning to go into crisis mode by some of the patterns that you notice in terms of their charging or their lack of charging behavior, and so one of the things that she discovered is that you can predict when somebody is going to get laid off four months before they get laid off. It's not that the company came and told them that they were getting laid off; it's because there is tension in the system and as soon as there is tension in the system their patterns start changing, so they get a little more erratic in their purchasing patterns. All of a sudden all those goody things like trips and movies go away. The other thing that she said that you can pull out is you can, it is so easy to tell. You can obviously tell when somebody is about to lose their mortgage, when they are going to stop making mortgage payments. You can also tell when people are starting to get into crisis situations with their family, when health becomes an issue for them, when the ability just to provide normal sustainable kinds of things like food become an issue, food and shelter become a major issue on the family. And this is all coming off from a single data set, a data set that you probably wouldn't think about for human behavior, but how does Visa give access to that data? This is the kind of environment that we believe it is safe and secure for them and still provides access for people to get to in a way that they have never been able to before. So here is some information just I will let you… >>: [inaudible] access to that data? >> Shoshanna Budzianowski: Well, they have to decide. See in the public Marketplace, you know, did you hear Microsoft actually has some ads about Google privacy stuff that went into New York Times. I haven't seen them. One thing about the public Marketplace is the users who consume the data are anonymous to the providers. We do not provide your information to them unless you have an issue or you agree to provide the information because you want to have a relationship. So that is very private, now in the Visa case it is not private at all. They could make you sign some sort of a licensing or royalty agreement based on the derivative products that you create because they give you access to the site and then they are going to give you access to the site. So it is completely up to them to manage their own privacy policy in that case. But, you know, the fact that the DataMarket is really public and anonymous is very difficult for those providers, by the way. Everyone in the world wants to know exactly who their contacts are, exactly how much every contact has ever spent because they all want to create a 360° profile of you to sell you more stuff, right? So they don't get that option through the Azure Marketplace which means that what the, what you often see in the Azure market place like that weather trend data set that showed you, they actually have weather prediction data, min/max temperature for a year out, that is what they are modeling, for every single point on earth, so if there is actually not a weather station in that point, they will model the weather for that point and then they will go back and kind of reverify the model on that point. There are places in Africa what you are never going to get weather, accurate weather, measured weather information from, but they didn't give me that data set, because they are going to retain that information for the people that they are going to have a relationship with. So I like to think that a lot of the data sets that we have our kind of loss leaders for those companies, or they are the data sets are perfectly wonderful for what I want to call the long tail, the 99%, rather than the 1% who are just satisfied to get access to the data because they never knew it existed. So here is my contact information, here is so little bit more information about Azure Marketplace. I spent no time on Facebook on Azure DataMarket, I wouldn't definitely go there for information, but we think that it is a great thing. There is blog, the best place to get the information is on the DataMarket site itself. I see some people are writing. I will give you another minute and then we're going to switch. And Rob you need to tell me when I am going to run out of time. >> Rob Fatland: You are doing great. >> Shoshanna Budzianowski: Yeah. >>: What's next? >> Shoshanna Budzianowski: Yeah, I think there's a question back here. >>: One question on the data quality [inaudible] QA and QC is one mechanism might be how people comment on the data that you use it and [inaudible]. >> Shoshanna Budzianowski: Lovely. >>: And but you actually address some other things like oh man you can't comment on it because of so and so, have you guys thought about that even now everything you buy online there are a whole mess of reviews and one of the things from the data user I know a lot more about the data sets I use because I get into them than the provider often does, so I actually have information that I could feedback that would help other users plus the provider go back and [inaudible]. So have you thought about that? >> Shoshanna Budzianowski: Well, we have actually thought about that. Did everybody get the question? It's like how do you allow other users to comment on the data set so you can understand the quality of it? And the answer is we do think about that a lot, but given where we are in this Marketplace some other things have higher priority for us and one of the things that we've done is we've now taken the Marketplace international, so we have the ability to essentially aggregate all of these data sets from countries across the world from about 36 different countries; we've localized the data set and we now create payment for those data sets worldwide. So some of the geospatial support and some of the world reach kind of trumped our ability or our desire to put in those crowd sourcing commentary kinds of features and, you know, I think when you are in it like--for that private Marketplace when you are in a trusted community, I would trust your feedback. I barely trust travel advisor feedback and now we know who the travel advisor feedback in the UK was cooked. It wasn't all independent travelers. So I am not as excited about that kind of model. I think when I showed you the noise meter model at the beginning of the session and you all downloaded it of course by now I am sure. I would just put it up there since it is our feed in to Eye on Earth. So this kind of stuff gets me excited when we actually have this ability to take fairly accurate I'd say, you know, readings off of sensors and devices. To me this is commentary. The stuff that human beings type in, not so much unless you actually have a large enough sample group to say that their feedback is statistically significant, but like feedback from one or two users isn't really helping us that much. Okay, I am actually going to start. We are taking a reading right now. Yeah? >>: To follow up on from that point, I think it might be interesting, you know, we often do our literature research when we found out publications that have been cited or cited by. It would be interesting for data sets to say here are the peer reviewed publications or papers that have used this data set and then you can drill back. Then the level of trust is much higher. >> Shoshanna Budzianowski: I love that. That is a great comment. So it turns out that it is quieter here than it is in my office, substantially, so this data point will be updated to that worldwide database in about an hour. So let me go and switch now to Eye on Earth. And I'm going to show you something that I am actually pretty proud of. This is called the Eye on Earth network and under the Eye on Earth network are a number of environmental watches. Now the first three watches, the water, noise and air watches were produced by the European Environmental Agency and it is part of their responsibility back to the European Union to provide this information to citizens. Now, the water watch was actually first developed about 3 1/2 years ago. Did I click the button? I have no idea. Okay, so it was developed about 3 1/2 years ago in response to something really simple. People were going to the beaches all over Europe and they would show up with their family and find out that the beaches were completely polluted and they started to get mad and so they actually pushed back on their governments and legislators and asked that in fact somebody give them some information about whether or not they should bother to spend their time at the beach. So there are 22,000 stations across Europe that are now monitoring water quality data. Now this is where politics kind of gets involved. So they monitor water quality data on a daily basis and they've got quite a bit of data about that water quality, but what they provide back to the citizens has been parliamentary approved scrubbed data. But the fact that the data is up there is pretty an arresting. So these are sites across Europe. I'm going to go to one site and the reason I am going to go to it is because I want to go visit it, because who would not want to go to Italy? So Torre del Greco, it is actually a site on the coast of Italy near Mount Vesuvius. Has anybody been there? Oh, gosh, so everything I have read about it is fabulous. It turns out its major industry is creating jewelry out of coral in the water. At least that's what their major industry used to be. So here is Mount Vesuvius, lovely right here; and I know that there is a lot of room in history that goes around Mount Vesuvius; some of it's pretty bloody, but I won't go into that. Here is the coastline for Torre del Greco, and these little boxes that you see our water quality stations that are providing water quality readings out to citizens. So I am going to select this one right in the heart of Torre del Greco and the reason I select this one is because the water quality station is saying that the water quality there is very good. And yet, the citizens have a little other information about that. In fact, what the citizens are saying is yeah, I know what the science is saying, but the accountability back to the citizens says it is not clean. It is actually polluted. Now I can go back and look at--get rid of that. Oops, wrong one. I did tell you guys that I am not so good at operating things, right? So let me go back to that spot and see if I can get back. So let me go back here and see the scientific reading for that and what I see here is yeah, there has been water quality problems at that site for many, many years, so the data here is going back to where we first started this water quality watch back in 2007. So one of the things that is interesting is that because I am a participant in my environment, I also have the ability to go in there and add my own assessment and rating of the water if I happen to be at that site. So if I could go say moderate and then I could go say the water is dirty, et cetera and rate it and that is my data coming back as a citizen back into that database. So this is a live database. I am not going to submit this. The other thing you could also do though is to go share it with your friends on your social media just so you can warn them. And people read Twitter all day long to make decisions, so this tends to be another powerful mechanism for sharing environmental information with the world. Now whether or not you think it is scientifically valid is a different question or not. So the other thing that this European Environmental Agency has is it has watches for air. It has watches for water and it has watches for noise. So like why does this exist? It exists, the solution because governments around the world are actually getting pressure from their citizens to be open, transparent and to allow their citizens to participate in their environment, so the passive perspective that we've had in the past where citizens would just take what government wants and if they want to put a garbage dump next to your house, there's only a few people that are going to complain about it. That's not happening anymore. What we are seeing in a world of limited resources as people are becoming activists in their environment and this is one of those approaches that is being used to become activist. So why is Eye on Earth important? Because what I am showing you here is, and there is tons more stuff that I can show you about this, but I'm going to go back to that first page back there, on here. What Eye on Earth is not is just a water watch or an air watch or a noise watch or a marine diversity watch or a water surface temperature watch; this is a network that we have created that allows communities, scientific communities around the world to contribute their information in a way that can be shared with other scientists and with citizens. And so we call it the Eye on Earth network because each one of these solutions that you see down here is being now submitted by scientific organizations. Now the European Environment Agency in Esri and Microsoft did a pretty nice thing, which is we bought SQL Azure storage. We bought Azure Compute and through the Eye on Earth network we are offering it to citizen scientist groups across the world for virtually for free, so scientist groups who are kind of authorized by the European Environment Agency can create their solutions and have them published here and available to the world essentially for free. Now there are other organizations like I will say Russia, which isn't necessarily a small NGO, who has their own prurient interests, they have a paid account. There is some information around forestry that they will be providing into the system and that we will host for them for free, because we actually think that their forestry information is fascinating. So I will tell you a little bit of a story that really blew me away, so the head of the European Environment Agency's Professor Jacqueline McGlade and it has an interview of her. Have any of you ever heard or met her? Okay. She is a pretty amazing woman. She has a PhD in marine biology and she teaches computers. She's got a, she teaches computer science as part of what she was doing academia and then became the director European Environment Agency, so she said well, I was in Russia and I went to meet Putin, and we are at one of his winter, you know, Dachas and it is the middle of the winter and he's got his shirt off and he is there barbecuing in the bare because he is a really masculine guy. And she says to Putin, Putin, you know, I really need you to go measure your forests. I need to know what the carbon footprint of the forest is. You need to tell me what the distribution of the different types of trees are, how big your forests are; this information is critical to be able to collect carbon information for the world. And so she said so Putin are you interested? And he said yeah, I am very interested. I can get this done right away for you. I will just take 22,000 people and send them out to the forest now. Like, okay. So it goes to show you that, you know, connections mean everything in the world and in fact, the one thing that the European Environment Agency is doing right now is taking this platform and the solution to the world to evangelize world governments NGOs sharing their information with citizens and scientists across the world. And so this solution although it's been around for three years, we have re-launched it as the Eye on Earth network at COP17 in Durban South Africa. It was launched with a new food watch that my team actually built at the Eye on Earth Summit in Abu Dhabi in December. We will be taking this solution to the UNEP conference; I guess that is like March. And then we will be at Rio +20 with about 10 other NGOs besides the ones that we have launched with here. Now one thing, I am just going to wrap up really quickly. I'm going to tell you one thing. There was one thing that is really interesting that is happening right now in the trunk of my car, so one of the things that we discovered in South Africa is we met a woman called Sarah Collins. Sarah Collins is the CEO for a company called Natural Balances. It doesn't matter. She actually created a very simple intervention to help people across the world, and especially in suffering nations, reduce the amount of carbon that they are using while they are cooking. So you know how we have been across the world trying to solve the cooking problem because we know that that particulate matter that kind of gets dissipated into the air when people cook inside their houses, poisoning them number one. Number two, there are just simple issues about access to fuel whether it's kerosene, whether it's wood, whether it's natural gas, not only especially for wood is it difficult and dangerous to collect in some parts of the world, but for those that have to pay for it, it costs a lot of money. It can cost up to 60% of a poor family’s income just to have fuel to cook their food. So what she did is invent the Wonder Bag. And in the trunk of my car right now I am cooking a stew in the Wonder Bag. Now I could've brought it in here but it would never make it to my team meeting later, right? So the Wonder Bag is a simple quilted bag about this size that is filled with polystyrene foam balls, all of that recycled, you know all of that polystyrene foam stuff that we can never get rid of? She actually put it in a quilted bag with a top. It is big enough to fit a pot. So now instead of essentially taking a stew and especially if you are cooking the stew with beans or with some meat that is not so well, it takes hours to cook, so instead of having to use fuel to keep that food cooking for 5, 6, 7 hours, what you do is you use your normal cooking pot, your normal cooking methods, your normal fuel. You heat the food, boil it, get it cooked, fry your onions. You put the pot in the Wonder Bag, close the Wonder Bag and it continues to cook for up to six hours. And so I now have, she actually delivered Wonder Bag to me. I don't think I showed it to you, but she delivered a Wonder Bag to me last week and I am now cooking a stew, you know, I assembled some beans and put some raw rice in it and I put some fried onions in it and some tomatoes in it and it is actually cooking in the trunk of my car and so we will see that later. You won't see it, but my team will see it later. So what I did, and you can say like well, can I afford to cook, you know, like a pot of stew and the answer is yes. But instead of me running my stove for five hours, I am running my stove how for 45 minutes or less. Now why is this important? This all sounds great, but Sarah Collins is doing something else. She is actually contributing to the carbon credit economy in the world. She is working with the UN Environment Agency to essentially collect carbon credits for every use of the Wonder Bag worldwide, which is amazing. And so that is how she is going to support her business. The data that she is collecting--you know, I've got to say, getting data off of Wonder Bag is not as easy as getting data off of a smart phone; the way she is collecting the data is she is creating an economy around it. She is actually working with a company in South Africa that is sending agents out to the field into homes to validate that the bags are being used and because she's got the validation, she can get the credits back. And what is she going to do with this data? Well she is going to put it up on DataMarket. And so now DataMarket is kind of an ecosystem for accountability back to the world and the environment. And so with that I will leave it to questions. Yes? >>: So you mentioned [inaudible] which is a private company in European [inaudible] what is the partnership with the federal agencies here in the state agencies and those sorts of things? >> Shoshanna Budzianowski: Well, we have, you know, me and my engineering group; I don't have relationships with everyone in the world. We actually have an environment sustainability group here at Microsoft and there is a gentleman named Rob Bernard who is our chief environmental officer for Microsoft, and he and his team have relationships with agencies throughout the United States and throughout the world. And so they have been a really great partner for us for creating contacts with those agencies. In fact when we were first envisioning this solution, I think we must've talked to virtually every agency in the United States. When we approached them and said please make your data publicly available on DataMarket, now we are back in the protracted conversations which take, we have been with I would say, I can't remember which agency in the United States knows about water, but we have actually been in negotiation with them about using Azure Marketplace for about six months now. There's a guy on our team who is handling that. It just takes a long time to--everybody can publish data. And this is what you see, this open data, everybody publishes data, right? We can get it. You can get it in different formats, KML or shape, if it's geospatial or like they will put up their random rest interface or they will put up an Excel spreadsheet, but they put header information in the Excel spreadsheet. There is so much process built into these organizations around getting the data out, but the data doesn't get out in a form that is great. So as soon as we start talking to an agency, it ends up being a protracted conversation about their own internal processes, and that's why it takes so long. >>: Do you expect to see some [inaudible] NASA data Azure market or geo-data [inaudible] at some point? >> Shoshanna Budzianowski: Yes, at some point you will see it. NASA data, absolutely we are working with them. As matter of fact we had NASA data up here. But they kind of changed their focus from outward looking, interplanetary, back to kind of inward looking and so they are making a change on the data that they want to make available as well. Yes? >>: You intimated repeatedly in your talk the focus on the social and economic value of the data products, and at the same time you've indicated that there is some [inaudible] of environmental, especially oceanographic data. I am curious as you speculate where do you think the economic market is for oceanographic data? >> Shoshanna Budzianowski: Yeah, that is a really good question because, you know, the people that need that data, like, deep oil and the energy industry, not only do they have the data but they don't necessarily want you to have the data. And so as I said, a lot of the data that we are getting here is the long tail data. There is not yet a healthy marketplace for data in this anonymous kind of model and I don't see this as being something that is going to happen immediately. I think this is years, maybe a decade before we actually have a really healthy model. We are very, very early in kind of discovering the power, the economic power of data. Sad to say, the data business worldwide is approximately a hundred billion dollar business, but most of it is going to Bloomberg and Reuters and AP and people are holding onto data that people actually leave need to buy. Yes? >>: You have a very global point of view and [inaudible] is a regional Association, are you kind of, is your point of view such that you aren't interested in regional problems and regional data sets and regional uses of… >> Shoshanna Budzianowski: No, we are absolutely interested in regional problems and regional data sets. Finding curators for those data regionally who are willing to make the commitment is more difficult. So we, I want to say we are cheap. We will talk to practically anybody who has a data set that we want to put on Marketplace and we will help them and as a matter of fact the Azure Marketplace now has a self-publishing portal. They don't even have to call us and talk to us to put their data up there, so it is more a matter of education that this facility exists, that it is available and that people can use it. Now I did talk to some people in Northwest fisheries just recently and they said oh we didn't know what this Azure Marketplace is. We never heard of it. So I would say we still have some work to do on our side to make sure that people know that this outlet exists. Yes? >>: I'm just wondering what efforts you've made towards involving volunteer [inaudible] and their data sets? I love this that you are interested in the citizens’ feedback, but especially in the United States there is a huge network of [inaudible] modern projects [inaudible] reliable [inaudible]. >> Shoshanna Budzianowski: So the question is-- there is a blower happening out there. Have we thought about voluntary monitoring projects coming in from and I am going to use the word citizen scientist. The answer is yes, we have thought about that and this Eye on Earth network platform is the platform that can be available, not only to NGOs but for citizen scientists as well, as long as they can be semi-certified that, you know, their data is reliable then they can come in here and create a watch. So you are right, there are science teachers in high schools that will go out monitor streams every single month for 10 years and that is critical information. And they now do have an outlet for publishing that information to Azure Marketplace and for creating these watches here. It is really a matter of evangelism around the solution at this point. >>: You mentioned that you had some methods for semi-certifying data. Can you talk a little bit more about the steps you've taken, how do you go about doing that? >> Shoshanna Budzianowski: We certify the providers, not the data itself. So I am pretty careful to say that, you know, we are not doing the analysis of the quality of that data. What we do is we certify the provider. They sign contracts with Microsoft to both assert that they are the owners of that data and to assert that the data will be available to you for the period of, you know, the contract. They, by the way, I need to tell you this, provide their own terms of use, terms of service for their data sets. This is difficult. Most of the providers do not offer commercial terms. I don't know if that is concerning to you. The data is typically available for industries to use for their own internal research or for scientific research, but each one of these providers will have their own contract and they have a contract with Microsoft, and that contract is also service-level agreement on availability for the data as well. So they sign a contract that any request to their data set must return a response in 5 seconds or less. I know that doesn't sound great, but, you know, it's not three hours. Yes? Another question? >>: So Eye on Earth exist as an application and if let's say [inaudible] source wanted to make this specific application of Eye on Earth for a specific, you know, way of engaging with let's say citizen monitors or world whatever, is this something that we are at this stage now? >> Shoshanna Budzianowski: Uh-huh, now. This platform is, you know, I mentioned Esri before and the reason this platform is in the commercial and available state that it is is because we built on the ArcGIS system and that is a system that, you know, Esri has been essentially supporting and selling for, you know, I don't know 10 or 15 years. The company is older than that. This solution is actually running on Azure so it is a solution that runs worldwide. It is not hosted in anybody's office park. It is actually hosted on Microsoft servers across the world. The ability to go and create your own content exists and then we now have a mechanism in place for promoting your content and asking the European Environmental Agency to make it part of this Eye on Earth network. >>: But it doesn't have to be European? It could be [inaudible]? >> Shoshanna Budzianowski: Actually a lot of solutions are coming across from coming from places across the world. Yes, so the food watch that we did which isn't on this list yet and I will tell you why it isn't on this list, because the data set isn't live yet. I don't have a live stream on the data set. It was created in a matter--I've got to tell you. It took us three days to create that watch because we are using templates that are very powerful here and it only took three days because we needed to get access to the data and we needed to understand the data types. We needed to create the schemas and so that was the hardest part for us, but it is literally available now. >>: You were talking about the publishing data using that Azure Marketplace, if I published data where is the data, how is the it in a database somewhere that is stored in a cloud? [inaudible] private server, because you said that you need to provide access to the data in 5 seconds. >> Shoshanna Budzianowski: I am just taking you through a couple of links so you can do it yourself. As you go to DataMarketplace there is a learn tab. Here is the learn tab. See where it says submit apps and data? Click that and then there is a wizard down here that takes you through how to submit your data. Now most of this asks you information about you and your company et cetera. Your data can reside in multiple places. So if your data is up on SQL Azure it could be anyplace in the world, but as long as it is hosted in SQL Azure, it is very easy for us to create an OData interface on top of it. So we do that automatically for you and we actually have been responsible for the service-level agreement because guess what, Microsoft is responsible for responding to users who make queries through Azure. You could decide for yourself though that you want to keep your data on premise and what you would do is then instead of telling us what the schema is for the SQL Azure database or pointing it to it, you would essentially give us a rest API to your data that resides on your database, so it means that you will give us a set of queries that you want us to perform on it, and you will give us examples of the response that you will provide to those queries and then what we do is we will then go and proxy your service with an OData interface, so we will still put an OData interface on top of your data, but that lets you then keep your data on your servers, but that also means that you are going to have to have somebody on board 24 hours or have an operations group that is going to make sure that that data is always available to people when they query it. Yes? >>: Two questions, one a detail. So you didn't really change the waterfall on the beach in Italy from a place in Washington right? >> Shoshanna Budzianowski: No [laughter]. Well the thing is this system so, you know, everything is really interesting, but I could've hit that button. But that is why I do not trust, you know, me. I keep pulling up the noise watch application because I really don't trust citizens in any manner and nobody trusts that data, right? It's mostly informational. It is the same kind of stuff you could've done on Twitter which is complain about something that you have actually no insight into it all. I do trust the water quality readings that histogram that we saw that came out of that water quality assessment station, though. So in a place where humans add data, not so valuable, maybe it is an aggregate if you got a billion humans adding the data, but censored data is much more accurate. We actually have a nature watch that is being announced pretty soon and the nature watch is kind of a mixture between those two. So there will be a mobile application that lets you take a picture of fauna because we have lat long location now on our phones that we share. We will have a picture of the fauna, a lat long location and then users will have a pictorial database where they can go and try themselves to identify that fauna to help with some of the processing on it. That I trust. It's not citizen commentary. The citizen commentary I don't trust as much. >>: That's good. All of the beaches in North Carolina are closed [laughter]. [inaudible]. >> Shoshanna Budzianowski: I know people are honest, but I'm not honest. >>: The other question is this is a group here which we collect a lot of disparate things and one of the things I want to be able to do is to come up with an integrated products and sort of be able to address natural language type of questions like this is probably not that good, but what is the environmental variable that controls when can you go out on the beach? Is it waves, is it water quality? And you don't really know what data is there and you don't really know, you just have this question for one of these the Northeast or whatever. Is that, we were talking about it. We know the data sets are there; we can all drill in and find stuff but have you looked at sort of the higher-level kind of integration of really disparate data? >> Shoshanna Budzianowski: Yes, yes, yes, we have. It is not part of the Marketplace right now. Before, we actually spent a lot of time looking up link data and RDS and semantic graphs and, you know, how to navigate and query across those semantic graphs. I would say that's more in MSR in the research phase at this point and what we haven't done yet is turn that into an actionable product that the rest of the world can get. Yeah. But it is very interesting. It's also very dangerous by the way, for other reasons. Okay. Good? All right. Thank you very much everyone for your time. I really enjoyed this. [applause]. >> Rob Fatland: Okay. I could not have done anything from my point of view I couldn't have done anything better than to let Shoshanna talk like that. That was sort of a, most relevant thing that I am aware of that we can sort of communicate to this group. What I want to do now is just sort of wrap up the Microsoft focus or Microsoft centric I guess perspective on the presentation here. I want to mention and my colleague [inaudible] was just here but she left. She probably had something else to go to, but she is also in my group and she runs kind of a workshop called open data for open science. This is April 4, 5 and 6 this year and if you have a technology problem around Microsoft technology that involves, you know, how do I get this thing done, the purpose of that workshop is to bring together our experts with our collaborators basically, with people who are interested in solving those problems in a sort of let's get down to it kind of environment for three days. So you can get in touch with us about that if you are interested in taking advantage of it and that is April 4, 5 and 6. The Azure DataMarketplace as we just saw is a commercially built up, it's part of the commercial segment of the company and I personally think about that commercial segment as a lot of people and a lot of time and energy and a lot of resources go into it because it has to work and obviously we've all had PowerPoint hang on us or whatever. So even when it has to work sometimes it doesn't work. Computers are complicated things. In contrast however, in Microsoft Research, our charter is a little bit different. We go in and investigate stuff. We can tend to build mockups. We build tools. We don't build commercial applications. There is this very clear division in the company. And so what we tend to do is we tend to sort of come up with cool ideas and we only have one or two guys working on it or one or two people working on it, and you sort of carry it through as far as you are interested in carrying it through, and then you have to sort of figure out what your exit strategy is. So what I wanted to do is talk a minute about SciScope and SciScope is a Microsoft Research project. It is not as your DataMarket, but as you'll see, it carries some common features. So I wanted to start off by talking about process here and we kind of got into the how you trust data, and maybe you use social network mechanisms to see ratings for data and if you had this sort of world where everybody was publishing data like crazy, you would have to sort of sort through that somehow and decide which you could trust and read the caveat [inaudible] and at some point you are leaping off a cliff and you're hoping that what you are basing your decision on is good, and if it's not good then you're going to lose money or your paper is going to be refuted in the next publication or whatever. That being said, I am going to show you two things to finish up. The first one will be SciScope and the second one will be back to the Layerscape stuff where we are really trying to anticipate some of these issues about data value and trust and so forth. So to get into this idea, I wanted to sort of come up with an image so I came up with the guys walking into a bar joke. So there isn't a joke actually, there is, two guys walk into a bar. One of them is a technologist and one of them is a scientist and they are working together to come up with a design for a system. And the technologist is leading. The technologist we will say is a programmer for a particularly, then the technologist will say I have this tool and I have this tool and, you know, I work at Microsoft so I use Visual Studio and I write in C sharp and they will tend to talk in terms of the tools of the trade. And then the domain specialist, if you like, the marine scientist or whatever will be thinking about I really need to understand where all of the salmon are in the middle of the summer time, because I have no idea where out in the ocean they are. And so you have this process of having conversations between the technology and the scientist and it has to be a partnership because if you let the technologist lead, you will end up tending to go down into what they find cool and interesting, because programmers, and I am a programmer, and we like to build stuff that they know how to do and find really cool. And the problem with that my boss likes to say is you shouldn't let what you are capable of doing design your user interface. Your user interface should be driven out of what your end-user needs and in our group I mentioned we have this book called Fourth Paradigm and a piece of that Fourth Paradigm for business is one of our sort of mentors who is no longer with us unfortunately, was Jim Gray and he talked about these 20 questions process. And Shoshanna used the phrase queries that you want to perform on it; it being some kind of data system. And the idea of 20 questions is worth mentioning because I have found it a really useful process to go through. So I am a technologist. I go to my collaborator and let's say they are doing biogeochemistry. They want to make a cloud library where they go off and they generate a absorbance spectrum and I want to be able to put it into this library and be able to see it later and have somebody else put their absorbance spectrum into the library and find out, you know, what does my absorbance spectrum look like? Does it look like the Congo River or does it look like the [inaudible] or does it look like the middle of the Pacific? So allowing the technologist to drive it, you'll end up with one sort of interface, but if you allow the scientist to drive it how do you do that? How do you do that process? And I just think this is really interesting. So the answer is, the technologist says to the domain scientist, the marine biologist what are the 20 questions that you want to ask of your data? You have to sit down and you have to write them out. You have to come up with that list, that description of how you think about it. Or let's go back to the papers you have published, and looked at the 20 typical graphs that you have generated. So if you start to go down to that level of detail of what are you trying to really pull out of your data; that is one way of approaching the design process. So design process and technology they have to fit together and it is a really hard thing to do and if you see a system, let's say a data portal where it isn't doing what you want it to do, there is a good bet that in the design process in the beginning there wasn't enough thought given to how people are going to use the system. And again, I am totally guilty on that technology side because I really want to get my fingers going on a keyboard writing the code to do my next idea. It is very painful to have to back up and think about the next five years using this system. So one typical example in this thinking, in this space is exemplified by SciScope and unfortunately SciScope is not up right now or I would show it to you, but I have this tutorial here and I can scroll through it will, and I will just describe to you that there is a map interface. It is hosted at a machine, a server in Berkeley Water Center and the idea is that SciScope has its own vocabulary, has a bunch of keywords. So you would go to this site and you would use this interface over here to specify your parameters. You would give it a time range. That is, let me see if I can highlight it. These two calendars are here for, you'd use these sort of location choosers either to draw a polygon or choose a preselected precut polygon, and then you would go down and you would select a keyword and the keyword might be discharged like in a stream gauge or it might be diazinon. In fact, I go down here and I can leave, it's a little slow, huh? Well, if it's going to be this slow…Keywords, there we go. Okay, so there is a partial list of the ontology if you like or the vocabulary of SciScope. These are, you type in two letters and it tells you what you've matched so far and you can sort of sub pick. So you give it that for your search and then SciScope comes back with a search hit or a set of hits I should say. And I am working up to this sort of philosophical point. For some reason it is just really slow. Well, we will just take that to be our hits. So I have chosen an area down by [inaudible] and I do a search and get those to orange dots back and it's just like a search engine, where I have gotten hits where the data is available. So that is my search query result, but the important point here and my whole reason for starting down this road is that SciScope doesn't have data. It doesn't know what the data is behind those two orange points. What SciScope does is it goes off and talks to the USGS system and so it's got custom code and here is this custom web rest interface, whatever the interface is, somebody wrote the code to go talk to USGS and it does it on Sunday night and it says tell me about the data you have and the USGS says well we have stream gauges here and here and we have data that is this, this, this, time range and SciScope, scribbles that down on its own database so that when you do a SciScope query, you are actually querying a metadata catalog about what and when and where the data is and you are getting these hits back. And this brings up this idea that if you had a whole bunch of SciScopes built out there and they all knew about each other, then they could each service a different type of request and if you gave it a keyword that that particular SciScope didn't know, then it can hand it to another SciScope that did know it, and so you are confederating data systems and this is all behind the scenes. As a user your experience is I come to the map and I select my, set up my search I say go. SciScope does everything else invisibly and that is why we call it the one-stop shopping idea, because I could've search for diazinon which is at the EPA star website. So the idea here is that SciScope is a metadata catalog. It doesn't know what the data is; it just knows where it is and when it is and what the keyword is. So far, so good, but I actually want this data. So I click on one of these icons and now SciScope says oh, you are serious, okay, so I better go over and finally SciScope goes over and talks to a data repository that does have the data and it gets it back, and it hands it to you in some form. Okay. So the form that we chose is Excel spreadsheets and there is a page of data and then a page of meta-data which describes what the agency is that gathered the data and other types of parameters about it. So that is where you would put things like error bars and so forth. So SciScope was stood up and was running and we had a great time with it and it's actually this follow-on to an earlier version that's part of the, what's the water consortium named, the organization, I can't remember at the moment. But then what is our exit strategy, because we are Microsoft Research. We can't sit here and maintain this system in perpetuity because we have to move on to the next thing. So our exit strategy was to simply publish SciScope and keep it standing up at the Berkeley Water Center, except it occasionally goes down, but then to publish the code so the code is out there on code plex; it is published essentially as open-source. And the idea is if we communicate about this and people think they, this is a really good idea, they can go get the bits and they can stand up their own SciScope. Now it is cool when that happens, when people sort of get excited about it start to adopt it, and we didn't think much of it. We kind of walked away from it and we just let it sit for year two, but now the open geospatial Consortium is starting to get interested in designs like this and so that is kind of where it is, so the exit strategy might be that we stand it up for a while and maintain it and see what kind of responses we get to it or we can just put it out there as open source and move on. And then, so that is my little spiel about SciScope and I am really not trying to promote it as for use, but I am just trying to promote it as a set of ideas that you don't always have to think about data repositories. You can index into them using your technological tools. I will mention in passing that another project that we have going on, that is more along the lines of an ongoing and we're going to keep doing this kind of deal, and it has to do with publications and authors and so forth. It is called Academic Search and you can use it to search through however much content we have indexed. But it also has a co-author graph so you can put Jan Newton up here and you can find what her connection is to other authors. I will just try to type one in and see if it works. [laughter]. >>: Direct connection. >> Rob Fatland: So Jan is connected to her co-authors. Let's see what happens if it's [laughter]. Ah, good. So Jan apparently wrote a paper with David Kirchman who wrote a paper with Rudolf Amann who wrote a paper with Max Planck who wrote a paper with Einstein, so your Einstein number is four. >>: Wow. [laughter]. >> Rob Fatland: So this is a project we are taking very seriously. The exit strategy here is no exit strategy. We are just going to keep doing this. [laughter]. And actually the real exit strategy is it will become adopted hopefully in perpetuity by other parts of the company that are better at maintaining things like this, but right now we're just having a really good time just going to different organizations. We go to different publishers now on we say can we index your stuff. And in the academic search part you can probably pull up an abstract and, but you can't necessarily pull up the full PDF of the paper that you find, unless the publisher has made that available. So there you go, Academic Search and that was en route to my last point here, which is getting back to Layerscape. So Layerscape again is a portal site and again I want to, I was going to digress into portals and how portals are dangerous things, but I think I will just leave that alone and try to wrestle this thing to the ground. The important thing about the Layerscape website right now it's called communities.WorldWide Telescope.org and in a couple of weeks it will just become Layerscape.org and we will do our gold release, but when you get here, you can see some featured content over here and if I click on that hopefully it will take me to the page of that featured content, and you can get an idea of what WorldWide Telescope is capable of doing by hitting play on the video that is there, but not all of the content has a video associated with. [video begins]. >> Rob Fatland: This is just amazingly slow. Let me just try something here. That's much better. So you can, I can play this tour as a video here or I can actually say view tour are and for that I have to install WorldWide Telescope in my PC. But you will notice that there are ratings that are built in here and there is a publish button up there, so if I want to publish my own content, I have some new content that I think is worth putting in here, then publishing it can be as simple as just browsing to the file that you want to publish. It can be a WorldWide Telescope tour. It can be an Excel spreadsheet. It can be a JPEG. It can be any kind of, there is no restriction. And then I can click the publish button, but if I want to I can also add some descriptive information, a little thumbnail for it and then I can give a citation and that is sort of the last thing that I want to mention is that by citing data in this way or citing a source I am accruing value to the person who provided that. So if you give me some data because you want to see what it looks like in WorldWide Telescope, I will definitely be putting your name or whatever you like in that citation box. And so it is a step towards this thing that is important to me which is incentivizing people to publish data, and if I say to you hey, I need you to publish data and you agree with me, that's great, then your next question is how hard is it to do? And I say well, you have to go through the long process of filling out multiple forms of metadata. You have to describe your experimental procedure, the papers you published and pretty soon you are decided to not do it, right? So I just have this interesting thought that wouldn't it be nice if we could make the publication process sort of you get to choose how difficult it is. So the easiest possible thing would be like it is here where I just have to upload the file and click publish and for that there is no metadata, so how trustworthy is it? There is not much to--the data is not really usable if I just click a spot on the map, but if you make the easiest barrier to publishing really small, then maybe some people will start stepping across that and you sort of try to go for that accumulation of momentum and pretty soon people are saying well, I have to publish my data set. There goes the next two hours, but they realize they are going to get something out of that. Let's say it's going to end up getting used and then that will be something they can carry to their tenure committee or it’s something that they end up getting value back from their community in the case of an operational data set. So I just wanted to throw that out there that we think about all different aspects of data process and for me one of those is an important one is incentivizing people to actually publish data sets because I think we've all had the conversation about how much data we have sort of sitting back in our careers in our shelves and on floppy disk drives and so forth, and wouldn't it be nice to get access to all of that? So that is one of our topics to consider. With that, I will just say that it would be great if anybody wants to go and experiment with Layerscape, find out what is there. Hopefully you would be not too averse to installing WorldWide Telescope and seeing it in its native environment. Send me e-mails if you have questions or comments. We love to get feedback and that is I think everything we will do there and so again, thanks for coming here. We will take a coffee break and then I think we will roll over to the discussion phase that will be moderated by Jan and that's it, so thanks. [applause].

Document 17828056

Related documents

Products

Support

Document 17828056

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib