Document 17828056

advertisement
>> Rob Fatland: Welcome back for day two. I will try to keep my voice from overwhelming the speakers
here. So again, I am Rob Fatland, and I do have business cards today, finally backed by the remaining
epiphytes; get them while they're hot. Let's see, so yesterday we had this fantastic day of presentations
kind of all across the spectrum of data production and use, and so today our program as I understand it
has sort of two parts. There is this initial section that I will kind of be directing and that will feature a
presentation by Shoshanna and some stuff for me, and then I think we have a coffee break and then we
have sort of ongoing conversations and then Jan and some of the other I guess Jan's folks will be
moderating that.
So what I wanted to start off with today and here you are seeing a WorldWide Telescope Layerscape
tour running in the background. The main point here is first of all if you're standing up and you watch
this, you could accidentally fall over [laughter], at least I often do. So this is data from U.S. Geological
Survey, I think. It is high resolution imagery of the Nisqually Delta down in the south sound. And I just
popped it upside down to make a little bit easier to see. You can also put you down to its correct depth
and is depth exaggerated and so forth and then it's going to start over again. And I will leave that
running while I talk for a couple of minutes on some of my themes and then we will turn it over to
Shoshanna.
So one of the things that we are interested in as I mentioned our group is doing environmental science
informatics, the arc of data from the conception of a research project all the way through to let's say
publications. And I have written down for my own reference here some of the topics that I can sort of
gas on about for a while. I am really in my element what I am talking about how to do nuts and bolts
detaily things with software and with let's say, you know, types of data, and I have a research project
that involves biogeochemistry so I'm really happy and enthusiastic to talk about, for example, the things
that you can do with a mass spec and relationship to fluorometry and so forth. But where I get out of
my depth is when I start talking about generalized problems, but everybody kind of likes to do that so
my list of generalized problems has sort of two broad categories. There is a process question, how do
you get to where you want to go, and then there are questions that are surrounding data. So I will talk
more about that later, but when all is said and done eventually you have to start building a tool and that
like I say is when I'm happiest is when I have something that I definitely need to build and I can get
working with engineers and software people and designers so forth.
I worked with a guy at Johns Hopkins and he made this observation that right now if you want to make a
phone call, you don't need a software engineer. You don't need help. You just go to the store and get
the phone and it works because there are 7 billion of us who need cell phones more or less, or however
many, and that is the consumer market, but if you look at that technology and what it is actually doing
and then you have a research problem, like why can't I use that technology in my research, because it
would be so useful to be able to have that research equivalent of a cell phone; there's no market for it.
There are not a lot of people trying to make money off of that, so we have this kind of carry the
technology across conundrum and my friend at John Hopkins points out that today if you want to go out
and build a sensor network and install it in some trees were in some soil or something you kind of need
17 technologists for one ecologist is the ratio [laughter] and it would be nice to reduce that to 5 to 1 or 1
to 1 or .5 to 1 or an undergraduate to 1 who is just doing this stuff in their spare time. So making
solutions that are software-based that are engineering-based that can be replicated is like I said is one of
our important themes.
This particular solution, if you like, takes advantage of consumer electronics because everybody has a
graphics card in their laptop that is capable of rendering 1 million polygons or something like that, so
this is a data set of something like 2.4 million data points and I've only got 1.8 million data points into a
WorldWide Telescope before it said no, I would really rather not have any more data points, thank you.
But you know I pushed it there and it just took a little bit of work, but half a million is sort of the range.
And then the fact that WorldWide Telescope knows about time and you can set up your data with time
tags and say okay, go and play through the time of the data and when you get to the end start over
again. And that is happening in a WorldWide Telescope completely independent of what you are
looking at, so you can fly to the other side of the world, but you can fly around your data and look at it
as it is evolving in time, so these are the two strong points of the visualization piece of WorldWide
Telescope, and as I say we built an ecosystem around that called Layerscape. It's a website and it is an
add-in for Excel, what is an add-in for Excel? Well, Excel is becoming a programming environment. You
can write your own ribbon in Excel; it is called an add in. And the model for WorldWide Telescope is
that you are a domain expert. You've got some data. You know how it works; you know what analysis
you want to do on it. Don't try to do that with WorldWide Telescope. WorldWide Telescope is not
Matlab. It is not some programming environment. It is a visualization environment, so you abstract
your thinking about your data into a separate data app and you connect that to WorldWide Telescope
and you put your data in and you look at it. And then being able to be inside WorldWide Telescope and
ask questions back out is sort of where we are right now. That is where we would like to build next. So
the utility of WorldWide Telescope, therefore is to handle a lot of data, handle complex data and to be
able to use your GPU to render this stuff to fly around it, to see it play out in time, to investigate it and
the goal here is to get past the eye candy stage, the isn't that cool, to the point where you are getting
real insight. So that is what our little Microsoft Research project with Layerscape is about. But we have
lots of other stuff going on. And in particular, Microsoft has lots of other stuff going on and it is really
fun and extremely interesting to watch it evolve. So at this point I would like to talk about the data
system or the data resource called Eye on Earth and will have Shoshanna come up and talk to us about
this. And the basic idea is European Environmental Agency was tasked with providing a push pull kind of
system where the users could actually push data in and get it back out. We got to build it and we got to
attach it to something called DataMarket. And there we are.
>> Shoshanna Budzianowski: So I am actually from a group in our server and tool business, so we're
very practical engineers and we create lots of products. We are not scientists in any sense even though
we do have some smart people in the group and so we took a look at kind of this emerging world of data
and decided that there were some real problems that needed to be solved. Now when you are doing
with data sets, some of the typical operations that you do is to first discover the data if it wasn't data
that you've generated yourself, you have to understand or schematize data. You need to extract
semantics from it and then you're going to start being able to get to a place where you can be doing
some kind of intelligent analysis on it. But, you know, there are problems before you even get to that
point, and this is what we have been looking at, so we know for a fact that there is more data generated
in a month right now than it was in the entire past century; nobody can keep up with this information.
Not only that, but the data is not always reliable. The data is dirty. As a matter of fact, this is so funny. I
just realized that I went back to my whole career history and my first job out of college was actually
working at a company called Foxboro Company that essentially monitored things that flow through
pipes. And my first job out of college was to spend my time with Ok [phonetic] and Perl and Zed and
anything you can imagine and all I did was cleanse data for like two years. I did other things too but, you
know, at a certain point I realized that I was pretty expensive data cleanser, even coming out of college,
so this problem is only getting worse and worse over time and we realize, you know, there are other
systemic problems even if you can get access to the data which is at least for normal people and
potentially even for academics that data is often siloed behind very complex walls, whether they are
contractual walls or they are walls that are put up because of some of the constraints around academia
and citations et cetera, getting access to that data securely in a way that the publisher trusts the
consumer is extremely difficult as well.
So what we actually did is we said let's take, you know, every problem is a big problem. Every problem
is a hard problem, but let's just start the microcosm and start creating a solution that allows people who
own data to let people who want to consume the data know about it. And so we actually created a
system called Azure DataMarket. How many people here have actually heard of Azure DataMarket?
Hey, that is actually pretty good. All right. It's getting better than typical. And so we like to think of
Azure DataMarket as a place where you can go to get curated data sets from publishers who have
written and signed a contract with us that they will keep that data set available for a minimum of three
years, and I think that is important. Now, you know, a lot of times you may pull down a, if you have a
static data set, you're going to pull down that static data set maybe once or go back and update it, you
know, maybe three or four times a year, but a lot of what we are seeing now is there is data that is
continuous data set that needs continuous refresh. Because these providers have signed contracts with
us, it means that if you want some of this data for your solutions or research you can at least be
guaranteed that the data is not going to disappear and you are going to be left out hanging for it. Now
the other thing that we do in the Azure Marketplace is we actually work with these concept providers
across the world to provide a consistent interface on the data. Now we use a standard called open data
protocol or OData. I think you've heard of that before. It essentially is a stylized protocol for making
queries against the data and then it returns essentially XML to you in Jason form or Adam form or
whatever you want and the reason that that is important is because it means that tools and applications
can bet on the data. For instance, these data sets integrate with the full Microsoft BI stack and I will
show that to you in a second. The other interesting thing about unifying over a protocol is that it means
that it is actually easier for you to mash up data from different data sets, so, you know, having lived in,
me being the engineer, having lived in the world where every single developer creates their own API
because it is a creative thing to do, it actually means that it is impossible to use two APIs from two
different providers together and get anything meaningful out of it. And so this open data protocol is
one of the key elements about making that data available and accessible too. So enough talk. We will
go in and take a look at the Azure DataMarket for a second. And while I'm bringing up the DataMarket,
so first thing, let me ask you about anything that I said about, anything I said about some of the
problems that you are solving with data, how much of that does that actually resonate with you or are
there problems that I am just kind of glossing over? Resonates? Resonates? Yeah, good. You have
other problems too? Okay. Good.
So one problem I actually hear in the scientific community a lot is sharing data that cannot be made
public and we will talk about that a little bit later. So this is actually Azure DataMarket. It is as I said a
Marketplace for data. Some of the data is available for subscription price. Some of the data is available
for free. For instance, we have all of the World Health Organization and a good portion of the UN data
sets on here. Now one thing I would love to get from you is if you see a data set, you know, I want to
cruise through a little, if you see a data set that is not on here, especially like from the UN and you need
it to be on there. They've got a bunch of geospatial data sets that we just haven't gotten because we
haven't gotten a huge amount of demand for it, let me know. There is other data sets that are actually
works in progress and the place where we are actually spend time now are on environmental data sets.
It turns out there is a lot of sensitivity to publishing environmental data sets to the world in a way they
can be consumed without it having been interpreted by somebody who is part of an agency or by some
government regulation, so, you know, things like getting all of the water quality data sets from the
United States up here is something that we are absolutely actively looking at and, you know, it is kind of
stuck in bureaucracy and policy right now. We see that theme consistently across the world. So let's
just go take a look at our UN data sets. I am just going to cruise here a little bit. Actually let me go here
so you can see the whole list. Did I spell it wrong? There we go. So we have got food and agriculture.
We've got international labor and there are actually other data sets that I am not finding right now. So
we've got the world health--over here we've got the full set of data sets. We've got energy statistics, all
of the AIDS data, commodity statistics, the food one for me is just so telling about the world because it
really helps describe the food issues that are occurring in Third World countries throughout the world,
and so when I read that on that one tends to get pretty sad for me including how much land is planted,
how much is arid. It's pretty interesting. Development goals are all in there, international labor, which
we saw, and then tourism which I always thought was fun. Gender info and other really fascinating data
set, world population prospects et cetera. So how many of you actually use the UN data sets? Okay, is
that important to your work? Okay. So are their data sets that you would expect to see from the UN on
here that you haven't, in that brief thing that you haven't seen yet?
>>: The geospatial one.
>> Shoshanna Budzianowski: The geospatial one, I know.
>>: Of course, the ones you don't have up there.
>> Shoshanna Budzianowski: Well, I will talk about geospatial, you know, I am going to be really open
with you in this crowd. So there is emerging ReST based geospatial standards, but there is not one that
we were willing to take a bet on yet. And so I'm going to tell you something that we are going to do and
you can just barf all over it and tell me that that it is completely wrong. I think this is on tape so I
shouldn't have said that. [laughter]. Part of what we have been doing with European environment
agencies creating geospatial solutions, showing data over maps. We actually partnered with a company
called Esri. I think that you are probably all familiar with Esri. Now they actually published a GeoREST
specification and it is a great spec, from what I have seen. But it is a spec that has been, you know,
submitted to be an open spec. I can't remember which committee they submitted it to, but it is
potentially now an open specification. Now one we could do is say we love that standard and kind of
opt in to that standard, or we can say that we have to go with something that is more OGC but, you
know, won't have as much tie in to real engineering systems in the background, and so these are the
kind of like decisions that we need to make to determine, you know, how open we can make the data
and how standardized we can make the data. So feedback, should we support that the GeoREST
specification from Esri or try to go from something that is even more open?
>>: This is not my field [inaudible] you guys, there is an open GIS Consortium, are you familiar with
that?
>> Shoshanna Budzianowski: Yeah.
>>: Why wouldn't you, why wouldn't Microsoft want to go with something like that?
>> Shoshanna Budzianowski: Well that, this is where it gets to like, this is where the rubber hits the
road, so there is theory and there is implementation and so yes, we could go with that and then not
have the data that we need fully accessible through the DataMarketer. We could go with where the
biggest data sets are and use that standard. And so, you know, the hardest part for us is to make that
balance right.
>>: And just to follow-up, we are working with OGC to [inaudible] the old data [inaudible] layer on that
is easy.
>> Shoshanna Budzianowski: And that is absolutely perfect. Right, yes?
>>: You said that the a ReST specification has been submitted to OGC for, to get [inaudible] but one of
the I guess the process has been slowed down because the [inaudible] is currently in multiple specs and
one and the recommendations of the working group, OGC working group is to break it into its
component specs so that it is easier for people to adopt a portion of it so that they can comply with the
visualization portion, or the data access portion. So if you were to support that modularization I think
that that would be the best…
>> Shoshanna Budzianowski: Okay. That is interesting. So they've got a bunch of feature services that
are about calculating things and they've got a bunch of like mapping and tiling and visualization services;
you're right.
>>: [inaudible] when it's all bundled it actually makes it harder for people to be compliant with it, but if
you break it apart into its modules, you don't actually change the function but you make it possible to be
compliant with the pieces that people want to be compliant with, or need to be compliant with.
>> Shoshanna Budzianowski: Fascinating. See that's exactly, that's perfect, okay. So inasmuch as Esri is
an independent company that is feedback that I can provide them that would essentially make it easier
for us and for you to adopt that specification for geospatial data. There is one other, since we're talking
about geospatial, just a second, we also have one other problem with at least that specification that
they submitted and with geospatial data currently which is their flat earth company. And we are now in
a round earth world. And so a lot of the data that has been collected now at the polls can't adequately
be represented in that specification, and so I think there is some other pressure that needs to happen
not only for Esri but across everybody who thinks that the world is still flat. Okay so [laughter]. It is
easier to think about the world that way. Okay. And so one thing I am going to do is I am going to do a
search on environment, just so you can see the environmental data sets that we have available. Let me
get rid of the United Nations and search for environment.
>>: [inaudible] comment, I am very surprised that I guess it's because they been collecting through time,
but those data sets that you put up there from the UN, I am shocked that they are not using spatial
databases.
>> Shoshanna Budzianowski: Right. Well, you know, that's really great feedback, so when we, you know
Microsoft is a company that has, we have a lot of business customers. We have a lot of commercial
customers, and what we've done is we've begun to focus on portfolios of information that are available
that, and are necessary. So we first focused on business and financial data because, guess what? That's
where it is. And you know the UN stuff is, and WHO are pretty, you know, open in where they want
their data to go. They actually want their data to go everywhere and so they kind of pushed it to us and
said well that is really interesting when you start to look at some economic scenarios, you need some of
that population demographic data that we are getting into it. Now that we are actually creating, you
know, a larger perspective on the environment at Microsoft, now we are going back and we are kind of
retro fitting in some of that environment in geospatial data and so it's really a matter of time more than
anything. So when I go look at, I just did a quick search on environment in the Azure Marketplace and I
came up with 65 hits, and we also have some scientific data. I just want to show you one that I
absolutely love. And this is probably a company you have never heard of. Have any of you ever heard of
Environmental Hazard Rink?
Okay. I know that we are mostly here talking about water, Northwest waters. They have done
something amazing on the land, not on the water and I think what you want is this company to start
looking outside of, you know, the land boundaries. So what they do is they calculate for any zip code in
the United States how much money it is going to take to actually clean up all the hazardous waste in
that zip code. And so I actually went and I talked, they had published this data set, and I went and
talked to their engineer and I said well what do these numbers mean? And he said well, they are
actually numbers and you should probably add a billion to the end of the number, so I was, and the
number was so large and so scary that they couldn't for public consumption actually publish the exact
dollar amount that it would take the cleanup the environmental information, so interestingly we have
an interactive map here and this 9 and 10 is where I live. So the scale goes from 1 to 10. It turns out
that we have communities that are built on things that used to be industrial areas. And the industrial
waste was never cleaned up and I happen to live in a very nice community right by Lake Washington but
it is also very close to where the Navy shipyard used to be and it is toxic. So, you know, a little
information like this is important. Now you can also get more graph information out of it. You can get
more data information, so I am just going to skip back and I want to go look at another data set and, well
you can see European greenhouse gas emissions et cetera so, I am actually going to go back and look at
the data set that we have from…
>>: Me I ask another quick question here on this?
>> Shoshanna Budzianowski: Yeah.
>>: Because we are a bunch of science types and that we are always concerned about the quality and
the vetting of data and that is not always obvious. That is not always on the front end of when these
DataMarkets are out there, but I am assuming that the metadata and the standards behind that are
available so that if you are interested in a data set, you can do your due diligence on that.
>> Shoshanna Budzianowski: I think that is a really good question, so one of the things that we are
trying to do is to make sure that these providers are trusted providers. Now we do have some providers
on here, specifically around business data, and I think their data is just awful, but I think it would take
about 10 minutes for anybody to a look at that, just go look at it, go look at the data, put it down and go
look for outliers and standard deviation over some of the values that you are getting and you can tell if
they are good or not. You know when you start getting data sets that have a lot of null values or just
real inconsistent values, you can throw it out. Now we at Microsoft have a little bit of a privacy kind of
policy and a non-compete policy and this is something interesting, so we like to think of this Azure
DataMarket as iTunes for data, but Apple has a very different relationship with their providers. They
essentially provide their relators, their providers with very strict standards and they review all content.
It means though that Apple is actually liable for the content that is produced by their provider. So when
you go to the Apple, you know, App Store, right, they have reviewed and certified every single
application that is on that App Store, right? It means that they are taking liability for those applications.
Now I don't know how many of you have ever read the agreement that you sign for the App Store when
you take an application. It is literally 26 pages long. I think I am the only person in the world that has
ever read it except from some lawyers and I did it at a soccer game that I was really bored at one day.
But [laughter].
>>: And you still signed it?
>> Shoshanna Budzianowski: What?
>>: And you still signed it?
>> Shoshanna Budzianowski: Of course I did, but I don't have an iPhone anymore; I actually have a
Windows Phone. This is a problem that the whole search world has and even Google has is that as soon
as you understand that content, you are liable for it. So you keep hearing things like the government
trying to make search providers liable for the content that they are displaying. In the same way, this is a
nascent business, this whole data business at Microsoft; we weren't willing at this point to close the
pipe, because if we actually said I am going to certified every one of these data sets as the data being
reliable and accurate, then you wouldn't have any data, because we would still be on our first data set,
literally that is how hard it is. And I kind of briefly talked about those problems with data; I live in data
every single day and let me tell you how hard it is even to get data into that consistent OData form that
everyone can use.
So these typically are multi-month to multiyear negotiations with companies that hold the data, so there
is only so much that we want to do to stifle that pipe. Going back to some of the questions that you
asked about is there metadata for the data, the answer is well, we love the OData spec because you can
just say put a URL and then put dollar sign metadata at the end and get all the metadata back, which is
great. Or you can just actually go look at this page. This is a data set that is published by a company
called Weather Trends and what they have done for us is give us something, just a very small sampling
of all of the data set that they have and this is data from I think it's from about 2001 to 2012 and we
have weather data for about 10,000 stations across the world. It is just a sampling of the data set that
they have. These are actual weather readings from across the world. Now what this company does in
reality is they actually have weather prediction models and so the thing that they haven't given me yet
are any of their prediction models that have predicted weather across the world. So here is the schema
for them. So you just put in end date, start date and a station ID, which isn't in this database right now
but it is easy to get, and essentially you can get the maximum minimum temperature, precipitation.
Here are all the data types that come along with that. This is essentially it is the station ID and you get
the latitude and longitude back from them, and so this data set is one of those data sets that is available
for purchase. There is a preview or a trial data set that is available for free. And for $12 or 36 or $120 a
month you can continue to read this data set, because it is weather you might actually want to check
the weather every day. One thing that I did with this data set and let's actually just go in and use the
data. Okay. I think I have already purchased it, but I will just put purchase again. Okay . The one thing
that you will learn as you are watching me is that I am not so coordinated, so you will just have to
forgive me.
My family knows this. They know better than to go anywhere near me especially when I am eating.
Okay. Now we're going to go explore this data set. Now you may not have noticed, but I have a Polish
last name [laughter] so I am going to go take a look at, I am going to go interactively build the query for
Warsaw Poland and I have already pre-cached the identity of Warsaw Poland for the station ID. And
what I am going to do now is just run the query and what this does is let you without having, you know,
to download and pull data into tools just to quickly go and take a look at the actual values that are in the
data set, so I find this extremely helpful in just kind of doing the simple evaluation of whether or not this
data is even meaningful for me, and they did give me this data set in Fahrenheit not Celsius, so it's
probably not very useful for everyone, but that is what we have on here because this is more of a
consumer option and a lot of the times what you see here is that, you know, this is a contact point to
begin to go get the data. So you may not have even known that this company existed, so this company
was founded by a bunch of weather climatologists who have, you know, advanced models on weather.
Now the founder of this company actually just left and is now working for NASA on their client models,
so the data is valuable and relevant.
Now what I had done is after I had done this query, I went to export and export gives you the option of
pulling this data down locally onto your client either in XML or CSV form or an Excel PowerPivot. How
many of you are familiar with Excel PowerPivot? Okay, so I think I heard Rob say earlier that one of the
things that we have noticed is that Excel is still the programmable interface that is used by a lot of
people. So Excel PowerPivot, I am going to show you that to you now is an augmentation to Excel that
essentially supports multidimensional data sets using the full SQL Server BI stack, and so what I have
done here is that, this is here a very flat single dimensional model, but what I can actually do is very
quickly build up a multidimensional model by importing data sets, different tables, you know, name your
favorite data. You can even import data from data warehouses and this is an in memory columnar
database that is lightning fast. You can start doing pivots, filters, functions on data sets that are as big as
10 billion rows almost instantaneously. I do not have a 10 billion row data set for you, but this is one of
the premier products that we have that turns Excel into something that is actually valuable for large
data sets. So what I did in that weather trends data set is I pulled it down…
>>: [inaudible].
>> Shoshanna Budzianowski: Oh. PowerPivot?
>>: Yeah.
>> Shoshanna Budzianowski: Like, I pulled it down for free. So it's on…
>>: Are they on regular Excel release or does it have something…
>> Shoshanna Budzianowski: Well, just go to Microsoft.com or actually if you're on the DataMarket site,
we will tell you how to install PowerPivot. We will actually give you the link right there and you just go
install it for free, and it is just an add-in into Excel, so it is really super simple to use.
>>: You said that B or an M?
>> Shoshanna Budzianowski: B. I said B.
>>: [inaudible].
>> Shoshanna Budzianowski: Yeah. As a matter of fact we have a distinguished engineer--actually I
think he's a technical fellow now in the Microsoft terminology which is our, you know, top level engineer
at Microsoft called the Mere Nets [phonetic] and I know there is a really great video of him
demonstrating this on data that came off of sensors from cars, so billions of rows.
>>: Can you hook it to a remote MySQL database?
>> Shoshanna Budzianowski: Well, anything that supports an OData feed, it can be connected to that.
They support like about 10, 15 different protocols for connecting to data, including all of the standard
ADO.net, ODBC Connector, so if MyData supports ODBC, I think you are fine. I don't spend a lot of time
with MyData. So what I did here was I think there are about 4000 rows in here. This isn't huge. It is
only for one station, exactly one station in Europe and I said they've got about 10,000 stations. If I had
actually downloaded the data set I would be working with something that would be much larger, and
what I did quickly was I went through and added some formulas to this data set. Just to show you how
really simple it is, I think I should go and add another formula. What I did is this data set comes in date
and time stamped and I want to actually look at weather trends, so I want to say is January hotter or
colder over the last 10 years? So for me to do that I couldn't use a date stamp, I had to just filter this
data by year and month and day so that I could compare month over month. So that was a super simple
formula and just to show you how easy it is, I am going to go add a column to here, and I am going to say
equal, year and I am going to go back and choose my date column right here. And scroll back over and it
is actually already populated the entire column with that formula and calculated it automatically. So it is
really just super fun and super interesting. So once I do that, then I get to use the full power of Excel,
right?
And so let me go and pull up what I did with Excel, right here. I went and created PowerPivot table and
chart, super simple. Are you guys PowerPivot table experts, chart experts? Okay. Let me go select the
field list and, select the field list for this. This is how you operate, go pull up that field list too. Oh well.
Okay. This is essentially how do operate a PowerPivot. You have this ability to create slicers and what I
am going did is I was going to slice this data set by year, and then I just decide what my X and Y axes are
and what value I am going to use, you know, in the data set and for this one I am actually summing, let
me go. Okay, well I am actually summing the maximum temperature per day and you can see that in the
chart. Let me go get rid of this thing. And I am looking at for 2010 here are 31 days, what was the
maximum temperature across that? And so you see the normal bell curve that you would expect in the
weather. The other thing that I can do though is I can look at any year, specifically. And then I can
actually look at the aggregate of all of the years for which I think I have good data for, and you will
notice a normal bell curve. Now this just goes to show you how quickly you can do this. This is not a
very advanced formula. I am not a data scientist. I would actually expect that you would do something
different like have multiple charts and overlay those values from multiple charts and then go look at
actually, you know, whether or not there is a variance in the data, you know, between the different
months, so it was just something that was really fun and super quick to pull up and it shows you how
you can very quickly get data out of the system using that OData protocol into the standard BI tools at
Microsoft.
So enough on Azure DataMarket. Let's just go pull that down, and go back to the slides. One thing I
didn't want to talk about before I actually start going into what you really wanted to see which is Eye on
Earth was, you know, one thing that we recognized is that all data is never going to be public, just
because data is dangerous. I discovered lately that the fish hauls that are coming off of ships all over the
world are the most dangerous data sets. Although I know that we collect, you know, that data, NOAA
collects that data, it is never public because it is considered to be competitive information and yet to do
science you need the data, and so we are kind of like in this Catch-22 which is how to I get access to that
data, and I would love to get access but I have been told to cease-and-desist already from even asking
the questions. So what we are doing in Azure market place we recognize that the functionality that we
provided here for providing access to information, discovery of information including the discovery of
the schemas that we talked about earlier, it is all great, but some people want to keep their data private
so we are offering that platform, that public Marketplace platform to organizations that want to
implement the same functionality but keep it private.
And so we call this our private Marketplace. We are actually going to be, this is all, it is not all that
secret, but in the next couple of months we will have a lab available of that private Marketplace. What
it does is it allows people who kind of own that Marketplace to invite people in. They can decide who
gets to publish data. I love to think about this kind of virtuous cycle where data gets better because
people are allowed to improve the data and republish it. So part of this concept of private Marketplace
is to give people permission to republish data sets, improve data sets back to this closed community.
Now, yes, this is good. This is what I have been hearing. Now for organizations that want to charge for
their data, that is fine. For instance, some of the most interesting social data feeds right now do not
come from Twitter; they do not come from Facebook; they come from purchases made on Visa. And
Visa has one of the largest data sets in the world and if you can get access to it, you know, good luck to
you. Actually I met a woman from MIT who has been mining the Visa data set with permission from
them on their site, having signed her life away from blood and is just finding really amazing things about
human behavior from that. I mean you can actually predict when families are beginning to go into crisis
mode by some of the patterns that you notice in terms of their charging or their lack of charging
behavior, and so one of the things that she discovered is that you can predict when somebody is going
to get laid off four months before they get laid off. It's not that the company came and told them that
they were getting laid off; it's because there is tension in the system and as soon as there is tension in
the system their patterns start changing, so they get a little more erratic in their purchasing patterns. All
of a sudden all those goody things like trips and movies go away. The other thing that she said that you
can pull out is you can, it is so easy to tell. You can obviously tell when somebody is about to lose their
mortgage, when they are going to stop making mortgage payments. You can also tell when people are
starting to get into crisis situations with their family, when health becomes an issue for them, when the
ability just to provide normal sustainable kinds of things like food become an issue, food and shelter
become a major issue on the family. And this is all coming off from a single data set, a data set that you
probably wouldn't think about for human behavior, but how does Visa give access to that data? This is
the kind of environment that we believe it is safe and secure for them and still provides access for
people to get to in a way that they have never been able to before. So here is some information just I
will let you…
>>: [inaudible] access to that data?
>> Shoshanna Budzianowski: Well, they have to decide. See in the public Marketplace, you know, did
you hear Microsoft actually has some ads about Google privacy stuff that went into New York Times. I
haven't seen them. One thing about the public Marketplace is the users who consume the data are
anonymous to the providers. We do not provide your information to them unless you have an issue or
you agree to provide the information because you want to have a relationship. So that is very private,
now in the Visa case it is not private at all. They could make you sign some sort of a licensing or royalty
agreement based on the derivative products that you create because they give you access to the site
and then they are going to give you access to the site. So it is completely up to them to manage their
own privacy policy in that case. But, you know, the fact that the DataMarket is really public and
anonymous is very difficult for those providers, by the way. Everyone in the world wants to know
exactly who their contacts are, exactly how much every contact has ever spent because they all want to
create a 360° profile of you to sell you more stuff, right?
So they don't get that option through the Azure Marketplace which means that what the, what you
often see in the Azure market place like that weather trend data set that showed you, they actually have
weather prediction data, min/max temperature for a year out, that is what they are modeling, for every
single point on earth, so if there is actually not a weather station in that point, they will model the
weather for that point and then they will go back and kind of reverify the model on that point. There
are places in Africa what you are never going to get weather, accurate weather, measured weather
information from, but they didn't give me that data set, because they are going to retain that
information for the people that they are going to have a relationship with. So I like to think that a lot of
the data sets that we have our kind of loss leaders for those companies, or they are the data sets are
perfectly wonderful for what I want to call the long tail, the 99%, rather than the 1% who are just
satisfied to get access to the data because they never knew it existed. So here is my contact
information, here is so little bit more information about Azure Marketplace. I spent no time on
Facebook on Azure DataMarket, I wouldn't definitely go there for information, but we think that it is a
great thing. There is blog, the best place to get the information is on the DataMarket site itself.
I see some people are writing. I will give you another minute and then we're going to switch. And Rob
you need to tell me when I am going to run out of time.
>> Rob Fatland: You are doing great.
>> Shoshanna Budzianowski: Yeah.
>>: What's next?
>> Shoshanna Budzianowski: Yeah, I think there's a question back here.
>>: One question on the data quality [inaudible] QA and QC is one mechanism might be how people
comment on the data that you use it and [inaudible].
>> Shoshanna Budzianowski: Lovely.
>>: And but you actually address some other things like oh man you can't comment on it because of so
and so, have you guys thought about that even now everything you buy online there are a whole mess
of reviews and one of the things from the data user I know a lot more about the data sets I use because I
get into them than the provider often does, so I actually have information that I could feedback that
would help other users plus the provider go back and [inaudible]. So have you thought about that?
>> Shoshanna Budzianowski: Well, we have actually thought about that. Did everybody get the
question? It's like how do you allow other users to comment on the data set so you can understand the
quality of it? And the answer is we do think about that a lot, but given where we are in this Marketplace
some other things have higher priority for us and one of the things that we've done is we've now taken
the Marketplace international, so we have the ability to essentially aggregate all of these data sets from
countries across the world from about 36 different countries; we've localized the data set and we now
create payment for those data sets worldwide. So some of the geospatial support and some of the
world reach kind of trumped our ability or our desire to put in those crowd sourcing commentary kinds
of features and, you know, I think when you are in it like--for that private Marketplace when you are in a
trusted community, I would trust your feedback. I barely trust travel advisor feedback and now we
know who the travel advisor feedback in the UK was cooked. It wasn't all independent travelers. So I
am not as excited about that kind of model. I think when I showed you the noise meter model at the
beginning of the session and you all downloaded it of course by now I am sure. I would just put it up
there since it is our feed in to Eye on Earth. So this kind of stuff gets me excited when we actually have
this ability to take fairly accurate I'd say, you know, readings off of sensors and devices. To me this is
commentary. The stuff that human beings type in, not so much unless you actually have a large enough
sample group to say that their feedback is statistically significant, but like feedback from one or two
users isn't really helping us that much. Okay, I am actually going to start. We are taking a reading right
now. Yeah?
>>: To follow up on from that point, I think it might be interesting, you know, we often do our literature
research when we found out publications that have been cited or cited by. It would be interesting for
data sets to say here are the peer reviewed publications or papers that have used this data set and then
you can drill back. Then the level of trust is much higher.
>> Shoshanna Budzianowski: I love that. That is a great comment. So it turns out that it is quieter here
than it is in my office, substantially, so this data point will be updated to that worldwide database in
about an hour. So let me go and switch now to Eye on Earth. And I'm going to show you something that
I am actually pretty proud of. This is called the Eye on Earth network and under the Eye on Earth
network are a number of environmental watches. Now the first three watches, the water, noise and air
watches were produced by the European Environmental Agency and it is part of their responsibility back
to the European Union to provide this information to citizens. Now, the water watch was actually first
developed about 3 1/2 years ago. Did I click the button? I have no idea. Okay, so it was developed
about 3 1/2 years ago in response to something really simple. People were going to the beaches all over
Europe and they would show up with their family and find out that the beaches were completely
polluted and they started to get mad and so they actually pushed back on their governments and
legislators and asked that in fact somebody give them some information about whether or not they
should bother to spend their time at the beach. So there are 22,000 stations across Europe that are now
monitoring water quality data. Now this is where politics kind of gets involved. So they monitor water
quality data on a daily basis and they've got quite a bit of data about that water quality, but what they
provide back to the citizens has been parliamentary approved scrubbed data. But the fact that the data
is up there is pretty an arresting. So these are sites across Europe. I'm going to go to one site and the
reason I am going to go to it is because I want to go visit it, because who would not want to go to Italy?
So Torre del Greco, it is actually a site on the coast of Italy near Mount Vesuvius. Has anybody been
there? Oh, gosh, so everything I have read about it is fabulous. It turns out its major industry is creating
jewelry out of coral in the water. At least that's what their major industry used to be. So here is Mount
Vesuvius, lovely right here; and I know that there is a lot of room in history that goes around Mount
Vesuvius; some of it's pretty bloody, but I won't go into that. Here is the coastline for Torre del Greco,
and these little boxes that you see our water quality stations that are providing water quality readings
out to citizens. So I am going to select this one right in the heart of Torre del Greco and the reason I
select this one is because the water quality station is saying that the water quality there is very good.
And yet, the citizens have a little other information about that. In fact, what the citizens are saying is
yeah, I know what the science is saying, but the accountability back to the citizens says it is not clean. It
is actually polluted. Now I can go back and look at--get rid of that. Oops, wrong one. I did tell you guys
that I am not so good at operating things, right? So let me go back to that spot and see if I can get back.
So let me go back here and see the scientific reading for that and what I see here is yeah, there has been
water quality problems at that site for many, many years, so the data here is going back to where we
first started this water quality watch back in 2007. So one of the things that is interesting is that
because I am a participant in my environment, I also have the ability to go in there and add my own
assessment and rating of the water if I happen to be at that site. So if I could go say moderate and then
I could go say the water is dirty, et cetera and rate it and that is my data coming back as a citizen back
into that database. So this is a live database. I am not going to submit this. The other thing you could
also do though is to go share it with your friends on your social media just so you can warn them. And
people read Twitter all day long to make decisions, so this tends to be another powerful mechanism for
sharing environmental information with the world. Now whether or not you think it is scientifically valid
is a different question or not. So the other thing that this European Environmental Agency has is it has
watches for air. It has watches for water and it has watches for noise. So like why does this exist? It
exists, the solution because governments around the world are actually getting pressure from their
citizens to be open, transparent and to allow their citizens to participate in their environment, so the
passive perspective that we've had in the past where citizens would just take what government wants
and if they want to put a garbage dump next to your house, there's only a few people that are going to
complain about it. That's not happening anymore. What we are seeing in a world of limited resources
as people are becoming activists in their environment and this is one of those approaches that is being
used to become activist. So why is Eye on Earth important? Because what I am showing you here is,
and there is tons more stuff that I can show you about this, but I'm going to go back to that first page
back there, on here. What Eye on Earth is not is just a water watch or an air watch or a noise watch or a
marine diversity watch or a water surface temperature watch; this is a network that we have created
that allows communities, scientific communities around the world to contribute their information in a
way that can be shared with other scientists and with citizens. And so we call it the Eye on Earth
network because each one of these solutions that you see down here is being now submitted by
scientific organizations. Now the European Environment Agency in Esri and Microsoft did a pretty nice
thing, which is we bought SQL Azure storage. We bought Azure Compute and through the Eye on Earth
network we are offering it to citizen scientist groups across the world for virtually for free, so scientist
groups who are kind of authorized by the European Environment Agency can create their solutions and
have them published here and available to the world essentially for free.
Now there are other organizations like I will say Russia, which isn't necessarily a small NGO, who has
their own prurient interests, they have a paid account. There is some information around forestry that
they will be providing into the system and that we will host for them for free, because we actually think
that their forestry information is fascinating. So I will tell you a little bit of a story that really blew me
away, so the head of the European Environment Agency's Professor Jacqueline McGlade and it has an
interview of her. Have any of you ever heard or met her? Okay. She is a pretty amazing woman. She
has a PhD in marine biology and she teaches computers. She's got a, she teaches computer science as
part of what she was doing academia and then became the director European Environment Agency, so
she said well, I was in Russia and I went to meet Putin, and we are at one of his winter, you know,
Dachas and it is the middle of the winter and he's got his shirt off and he is there barbecuing in the bare
because he is a really masculine guy. And she says to Putin, Putin, you know, I really need you to go
measure your forests. I need to know what the carbon footprint of the forest is. You need to tell me
what the distribution of the different types of trees are, how big your forests are; this information is
critical to be able to collect carbon information for the world. And so she said so Putin are you
interested? And he said yeah, I am very interested. I can get this done right away for you. I will just
take 22,000 people and send them out to the forest now. Like, okay. So it goes to show you that, you
know, connections mean everything in the world and in fact, the one thing that the European
Environment Agency is doing right now is taking this platform and the solution to the world to
evangelize world governments NGOs sharing their information with citizens and scientists across the
world. And so this solution although it's been around for three years, we have re-launched it as the Eye
on Earth network at COP17 in Durban South Africa. It was launched with a new food watch that my
team actually built at the Eye on Earth Summit in Abu Dhabi in December. We will be taking this
solution to the UNEP conference; I guess that is like March. And then we will be at Rio +20 with about
10 other NGOs besides the ones that we have launched with here. Now one thing, I am just going to
wrap up really quickly. I'm going to tell you one thing. There was one thing that is really interesting that
is happening right now in the trunk of my car, so one of the things that we discovered in South Africa is
we met a woman called Sarah Collins. Sarah Collins is the CEO for a company called Natural Balances. It
doesn't matter. She actually created a very simple intervention to help people across the world, and
especially in suffering nations, reduce the amount of carbon that they are using while they are cooking.
So you know how we have been across the world trying to solve the cooking problem because we know
that that particulate matter that kind of gets dissipated into the air when people cook inside their
houses, poisoning them number one. Number two, there are just simple issues about access to fuel
whether it's kerosene, whether it's wood, whether it's natural gas, not only especially for wood is it
difficult and dangerous to collect in some parts of the world, but for those that have to pay for it, it costs
a lot of money. It can cost up to 60% of a poor family’s income just to have fuel to cook their food. So
what she did is invent the Wonder Bag. And in the trunk of my car right now I am cooking a stew in the
Wonder Bag. Now I could've brought it in here but it would never make it to my team meeting later,
right? So the Wonder Bag is a simple quilted bag about this size that is filled with polystyrene foam
balls, all of that recycled, you know all of that polystyrene foam stuff that we can never get rid of? She
actually put it in a quilted bag with a top. It is big enough to fit a pot. So now instead of essentially
taking a stew and especially if you are cooking the stew with beans or with some meat that is not so
well, it takes hours to cook, so instead of having to use fuel to keep that food cooking for 5, 6, 7 hours,
what you do is you use your normal cooking pot, your normal cooking methods, your normal fuel. You
heat the food, boil it, get it cooked, fry your onions. You put the pot in the Wonder Bag, close the
Wonder Bag and it continues to cook for up to six hours. And so I now have, she actually delivered
Wonder Bag to me. I don't think I showed it to you, but she delivered a Wonder Bag to me last week
and I am now cooking a stew, you know, I assembled some beans and put some raw rice in it and I put
some fried onions in it and some tomatoes in it and it is actually cooking in the trunk of my car and so
we will see that later. You won't see it, but my team will see it later. So what I did, and you can say like
well, can I afford to cook, you know, like a pot of stew and the answer is yes. But instead of me running
my stove for five hours, I am running my stove how for 45 minutes or less. Now why is this important?
This all sounds great, but Sarah Collins is doing something else. She is actually contributing to the
carbon credit economy in the world. She is working with the UN Environment Agency to essentially
collect carbon credits for every use of the Wonder Bag worldwide, which is amazing. And so that is how
she is going to support her business. The data that she is collecting--you know, I've got to say, getting
data off of Wonder Bag is not as easy as getting data off of a smart phone; the way she is collecting the
data is she is creating an economy around it. She is actually working with a company in South Africa that
is sending agents out to the field into homes to validate that the bags are being used and because she's
got the validation, she can get the credits back. And what is she going to do with this data? Well she is
going to put it up on DataMarket. And so now DataMarket is kind of an ecosystem for accountability
back to the world and the environment. And so with that I will leave it to questions. Yes?
>>: So you mentioned [inaudible] which is a private company in European [inaudible] what is the
partnership with the federal agencies here in the state agencies and those sorts of things?
>> Shoshanna Budzianowski: Well, we have, you know, me and my engineering group; I don't have
relationships with everyone in the world. We actually have an environment sustainability group here at
Microsoft and there is a gentleman named Rob Bernard who is our chief environmental officer for
Microsoft, and he and his team have relationships with agencies throughout the United States and
throughout the world. And so they have been a really great partner for us for creating contacts with
those agencies. In fact when we were first envisioning this solution, I think we must've talked to
virtually every agency in the United States. When we approached them and said please make your data
publicly available on DataMarket, now we are back in the protracted conversations which take, we have
been with I would say, I can't remember which agency in the United States knows about water, but we
have actually been in negotiation with them about using Azure Marketplace for about six months now.
There's a guy on our team who is handling that. It just takes a long time to--everybody can publish data.
And this is what you see, this open data, everybody publishes data, right? We can get it. You can get it
in different formats, KML or shape, if it's geospatial or like they will put up their random rest interface or
they will put up an Excel spreadsheet, but they put header information in the Excel spreadsheet. There
is so much process built into these organizations around getting the data out, but the data doesn't get
out in a form that is great. So as soon as we start talking to an agency, it ends up being a protracted
conversation about their own internal processes, and that's why it takes so long.
>>: Do you expect to see some [inaudible] NASA data Azure market or geo-data [inaudible] at some
point?
>> Shoshanna Budzianowski: Yes, at some point you will see it. NASA data, absolutely we are working
with them. As matter of fact we had NASA data up here. But they kind of changed their focus from
outward looking, interplanetary, back to kind of inward looking and so they are making a change on the
data that they want to make available as well. Yes?
>>: You intimated repeatedly in your talk the focus on the social and economic value of the data
products, and at the same time you've indicated that there is some [inaudible] of environmental,
especially oceanographic data. I am curious as you speculate where do you think the economic market
is for oceanographic data?
>> Shoshanna Budzianowski: Yeah, that is a really good question because, you know, the people that
need that data, like, deep oil and the energy industry, not only do they have the data but they don't
necessarily want you to have the data. And so as I said, a lot of the data that we are getting here is the
long tail data. There is not yet a healthy marketplace for data in this anonymous kind of model and I
don't see this as being something that is going to happen immediately. I think this is years, maybe a
decade before we actually have a really healthy model. We are very, very early in kind of discovering
the power, the economic power of data. Sad to say, the data business worldwide is approximately a
hundred billion dollar business, but most of it is going to Bloomberg and Reuters and AP and people are
holding onto data that people actually leave need to buy. Yes?
>>: You have a very global point of view and [inaudible] is a regional Association, are you kind of, is your
point of view such that you aren't interested in regional problems and regional data sets and regional
uses of…
>> Shoshanna Budzianowski: No, we are absolutely interested in regional problems and regional data
sets. Finding curators for those data regionally who are willing to make the commitment is more
difficult. So we, I want to say we are cheap. We will talk to practically anybody who has a data set that
we want to put on Marketplace and we will help them and as a matter of fact the Azure Marketplace
now has a self-publishing portal. They don't even have to call us and talk to us to put their data up
there, so it is more a matter of education that this facility exists, that it is available and that people can
use it. Now I did talk to some people in Northwest fisheries just recently and they said oh we didn't
know what this Azure Marketplace is. We never heard of it. So I would say we still have some work to
do on our side to make sure that people know that this outlet exists. Yes?
>>: I'm just wondering what efforts you've made towards involving volunteer [inaudible] and their data
sets? I love this that you are interested in the citizens’ feedback, but especially in the United States
there is a huge network of [inaudible] modern projects [inaudible] reliable [inaudible].
>> Shoshanna Budzianowski: So the question is-- there is a blower happening out there. Have we
thought about voluntary monitoring projects coming in from and I am going to use the word citizen
scientist. The answer is yes, we have thought about that and this Eye on Earth network platform is the
platform that can be available, not only to NGOs but for citizen scientists as well, as long as they can be
semi-certified that, you know, their data is reliable then they can come in here and create a watch. So
you are right, there are science teachers in high schools that will go out monitor streams every single
month for 10 years and that is critical information. And they now do have an outlet for publishing that
information to Azure Marketplace and for creating these watches here. It is really a matter of
evangelism around the solution at this point.
>>: You mentioned that you had some methods for semi-certifying data. Can you talk a little bit more
about the steps you've taken, how do you go about doing that?
>> Shoshanna Budzianowski: We certify the providers, not the data itself. So I am pretty careful to say
that, you know, we are not doing the analysis of the quality of that data. What we do is we certify the
provider. They sign contracts with Microsoft to both assert that they are the owners of that data and to
assert that the data will be available to you for the period of, you know, the contract. They, by the way,
I need to tell you this, provide their own terms of use, terms of service for their data sets. This is
difficult. Most of the providers do not offer commercial terms. I don't know if that is concerning to you.
The data is typically available for industries to use for their own internal research or for scientific
research, but each one of these providers will have their own contract and they have a contract with
Microsoft, and that contract is also service-level agreement on availability for the data as well. So they
sign a contract that any request to their data set must return a response in 5 seconds or less. I know
that doesn't sound great, but, you know, it's not three hours. Yes? Another question?
>>: So Eye on Earth exist as an application and if let's say [inaudible] source wanted to make this specific
application of Eye on Earth for a specific, you know, way of engaging with let's say citizen monitors or
world whatever, is this something that we are at this stage now?
>> Shoshanna Budzianowski: Uh-huh, now. This platform is, you know, I mentioned Esri before and the
reason this platform is in the commercial and available state that it is is because we built on the ArcGIS
system and that is a system that, you know, Esri has been essentially supporting and selling for, you
know, I don't know 10 or 15 years. The company is older than that. This solution is actually running on
Azure so it is a solution that runs worldwide. It is not hosted in anybody's office park. It is actually
hosted on Microsoft servers across the world. The ability to go and create your own content exists and
then we now have a mechanism in place for promoting your content and asking the European
Environmental Agency to make it part of this Eye on Earth network.
>>: But it doesn't have to be European? It could be [inaudible]?
>> Shoshanna Budzianowski: Actually a lot of solutions are coming across from coming from places
across the world. Yes, so the food watch that we did which isn't on this list yet and I will tell you why it
isn't on this list, because the data set isn't live yet. I don't have a live stream on the data set. It was
created in a matter--I've got to tell you. It took us three days to create that watch because we are using
templates that are very powerful here and it only took three days because we needed to get access to
the data and we needed to understand the data types. We needed to create the schemas and so that
was the hardest part for us, but it is literally available now.
>>: You were talking about the publishing data using that Azure Marketplace, if I published data where
is the data, how is the it in a database somewhere that is stored in a cloud? [inaudible] private server,
because you said that you need to provide access to the data in 5 seconds.
>> Shoshanna Budzianowski: I am just taking you through a couple of links so you can do it yourself. As
you go to DataMarketplace there is a learn tab. Here is the learn tab. See where it says submit apps and
data? Click that and then there is a wizard down here that takes you through how to submit your data.
Now most of this asks you information about you and your company et cetera. Your data can reside in
multiple places. So if your data is up on SQL Azure it could be anyplace in the world, but as long as it is
hosted in SQL Azure, it is very easy for us to create an OData interface on top of it. So we do that
automatically for you and we actually have been responsible for the service-level agreement because
guess what, Microsoft is responsible for responding to users who make queries through Azure. You
could decide for yourself though that you want to keep your data on premise and what you would do is
then instead of telling us what the schema is for the SQL Azure database or pointing it to it, you would
essentially give us a rest API to your data that resides on your database, so it means that you will give us
a set of queries that you want us to perform on it, and you will give us examples of the response that
you will provide to those queries and then what we do is we will then go and proxy your service with an
OData interface, so we will still put an OData interface on top of your data, but that lets you then keep
your data on your servers, but that also means that you are going to have to have somebody on board
24 hours or have an operations group that is going to make sure that that data is always available to
people when they query it. Yes?
>>: Two questions, one a detail. So you didn't really change the waterfall on the beach in Italy from a
place in Washington right?
>> Shoshanna Budzianowski: No [laughter]. Well the thing is this system so, you know, everything is
really interesting, but I could've hit that button. But that is why I do not trust, you know, me. I keep
pulling up the noise watch application because I really don't trust citizens in any manner and nobody
trusts that data, right? It's mostly informational. It is the same kind of stuff you could've done on
Twitter which is complain about something that you have actually no insight into it all. I do trust the
water quality readings that histogram that we saw that came out of that water quality assessment
station, though. So in a place where humans add data, not so valuable, maybe it is an aggregate if you
got a billion humans adding the data, but censored data is much more accurate. We actually have a
nature watch that is being announced pretty soon and the nature watch is kind of a mixture between
those two. So there will be a mobile application that lets you take a picture of fauna because we have
lat long location now on our phones that we share. We will have a picture of the fauna, a lat long
location and then users will have a pictorial database where they can go and try themselves to identify
that fauna to help with some of the processing on it. That I trust. It's not citizen commentary. The
citizen commentary I don't trust as much.
>>: That's good. All of the beaches in North Carolina are closed [laughter]. [inaudible].
>> Shoshanna Budzianowski: I know people are honest, but I'm not honest.
>>: The other question is this is a group here which we collect a lot of disparate things and one of the
things I want to be able to do is to come up with an integrated products and sort of be able to address
natural language type of questions like this is probably not that good, but what is the environmental
variable that controls when can you go out on the beach? Is it waves, is it water quality? And you don't
really know what data is there and you don't really know, you just have this question for one of these
the Northeast or whatever. Is that, we were talking about it. We know the data sets are there; we can
all drill in and find stuff but have you looked at sort of the higher-level kind of integration of really
disparate data?
>> Shoshanna Budzianowski: Yes, yes, yes, we have. It is not part of the Marketplace right now. Before,
we actually spent a lot of time looking up link data and RDS and semantic graphs and, you know, how to
navigate and query across those semantic graphs. I would say that's more in MSR in the research phase
at this point and what we haven't done yet is turn that into an actionable product that the rest of the
world can get. Yeah. But it is very interesting. It's also very dangerous by the way, for other reasons.
Okay. Good? All right. Thank you very much everyone for your time. I really enjoyed this.
[applause].
>> Rob Fatland: Okay. I could not have done anything from my point of view I couldn't have done
anything better than to let Shoshanna talk like that. That was sort of a, most relevant thing that I am
aware of that we can sort of communicate to this group. What I want to do now is just sort of wrap up
the Microsoft focus or Microsoft centric I guess perspective on the presentation here. I want to mention
and my colleague [inaudible] was just here but she left. She probably had something else to go to, but
she is also in my group and she runs kind of a workshop called open data for open science. This is April
4, 5 and 6 this year and if you have a technology problem around Microsoft technology that involves,
you know, how do I get this thing done, the purpose of that workshop is to bring together our experts
with our collaborators basically, with people who are interested in solving those problems in a sort of
let's get down to it kind of environment for three days. So you can get in touch with us about that if you
are interested in taking advantage of it and that is April 4, 5 and 6. The Azure DataMarketplace as we
just saw is a commercially built up, it's part of the commercial segment of the company and I personally
think about that commercial segment as a lot of people and a lot of time and energy and a lot of
resources go into it because it has to work and obviously we've all had PowerPoint hang on us or
whatever. So even when it has to work sometimes it doesn't work. Computers are complicated things.
In contrast however, in Microsoft Research, our charter is a little bit different. We go in and investigate
stuff. We can tend to build mockups. We build tools. We don't build commercial applications. There is
this very clear division in the company. And so what we tend to do is we tend to sort of come up with
cool ideas and we only have one or two guys working on it or one or two people working on it, and you
sort of carry it through as far as you are interested in carrying it through, and then you have to sort of
figure out what your exit strategy is. So what I wanted to do is talk a minute about SciScope and
SciScope is a Microsoft Research project. It is not as your DataMarket, but as you'll see, it carries some
common features. So I wanted to start off by talking about process here and we kind of got into the
how you trust data, and maybe you use social network mechanisms to see ratings for data and if you
had this sort of world where everybody was publishing data like crazy, you would have to sort of sort
through that somehow and decide which you could trust and read the caveat [inaudible] and at some
point you are leaping off a cliff and you're hoping that what you are basing your decision on is good, and
if it's not good then you're going to lose money or your paper is going to be refuted in the next
publication or whatever. That being said, I am going to show you two things to finish up. The first one
will be SciScope and the second one will be back to the Layerscape stuff where we are really trying to
anticipate some of these issues about data value and trust and so forth. So to get into this idea, I
wanted to sort of come up with an image so I came up with the guys walking into a bar joke. So there
isn't a joke actually, there is, two guys walk into a bar. One of them is a technologist and one of them is
a scientist and they are working together to come up with a design for a system. And the technologist is
leading. The technologist we will say is a programmer for a particularly, then the technologist will say I
have this tool and I have this tool and, you know, I work at Microsoft so I use Visual Studio and I write in
C sharp and they will tend to talk in terms of the tools of the trade. And then the domain specialist, if
you like, the marine scientist or whatever will be thinking about I really need to understand where all of
the salmon are in the middle of the summer time, because I have no idea where out in the ocean they
are. And so you have this process of having conversations between the technology and the scientist and
it has to be a partnership because if you let the technologist lead, you will end up tending to go down
into what they find cool and interesting, because programmers, and I am a programmer, and we like to
build stuff that they know how to do and find really cool. And the problem with that my boss likes to
say is you shouldn't let what you are capable of doing design your user interface. Your user interface
should be driven out of what your end-user needs and in our group I mentioned we have this book
called Fourth Paradigm and a piece of that Fourth Paradigm for business is one of our sort of mentors
who is no longer with us unfortunately, was Jim Gray and he talked about these 20 questions process.
And Shoshanna used the phrase queries that you want to perform on it; it being some kind of data
system. And the idea of 20 questions is worth mentioning because I have found it a really useful process
to go through. So I am a technologist. I go to my collaborator and let's say they are doing
biogeochemistry. They want to make a cloud library where they go off and they generate a absorbance
spectrum and I want to be able to put it into this library and be able to see it later and have somebody
else put their absorbance spectrum into the library and find out, you know, what does my absorbance
spectrum look like? Does it look like the Congo River or does it look like the [inaudible] or does it look
like the middle of the Pacific? So allowing the technologist to drive it, you'll end up with one sort of
interface, but if you allow the scientist to drive it how do you do that? How do you do that process?
And I just think this is really interesting. So the answer is, the technologist says to the domain scientist,
the marine biologist what are the 20 questions that you want to ask of your data? You have to sit down
and you have to write them out. You have to come up with that list, that description of how you think
about it. Or let's go back to the papers you have published, and looked at the 20 typical graphs that you
have generated. So if you start to go down to that level of detail of what are you trying to really pull out
of your data; that is one way of approaching the design process. So design process and technology they
have to fit together and it is a really hard thing to do and if you see a system, let's say a data portal
where it isn't doing what you want it to do, there is a good bet that in the design process in the
beginning there wasn't enough thought given to how people are going to use the system. And again, I
am totally guilty on that technology side because I really want to get my fingers going on a keyboard
writing the code to do my next idea.
It is very painful to have to back up and think about the next five years using this system. So one typical
example in this thinking, in this space is exemplified by SciScope and unfortunately SciScope is not up
right now or I would show it to you, but I have this tutorial here and I can scroll through it will, and I will
just describe to you that there is a map interface. It is hosted at a machine, a server in Berkeley Water
Center and the idea is that SciScope has its own vocabulary, has a bunch of keywords. So you would go
to this site and you would use this interface over here to specify your parameters. You would give it a
time range. That is, let me see if I can highlight it. These two calendars are here for, you'd use these
sort of location choosers either to draw a polygon or choose a preselected precut polygon, and then you
would go down and you would select a keyword and the keyword might be discharged like in a stream
gauge or it might be diazinon. In fact, I go down here and I can leave, it's a little slow, huh? Well, if it's
going to be this slow…Keywords, there we go. Okay, so there is a partial list of the ontology if you like or
the vocabulary of SciScope. These are, you type in two letters and it tells you what you've matched so
far and you can sort of sub pick. So you give it that for your search and then SciScope comes back with a
search hit or a set of hits I should say. And I am working up to this sort of philosophical point. For some
reason it is just really slow. Well, we will just take that to be our hits. So I have chosen an area down by
[inaudible] and I do a search and get those to orange dots back and it's just like a search engine, where I
have gotten hits where the data is available. So that is my search query result, but the important point
here and my whole reason for starting down this road is that SciScope doesn't have data. It doesn't
know what the data is behind those two orange points. What SciScope does is it goes off and talks to
the USGS system and so it's got custom code and here is this custom web rest interface, whatever the
interface is, somebody wrote the code to go talk to USGS and it does it on Sunday night and it says tell
me about the data you have and the USGS says well we have stream gauges here and here and we have
data that is this, this, this, time range and SciScope, scribbles that down on its own database so that
when you do a SciScope query, you are actually querying a metadata catalog about what and when and
where the data is and you are getting these hits back. And this brings up this idea that if you had a
whole bunch of SciScopes built out there and they all knew about each other, then they could each
service a different type of request and if you gave it a keyword that that particular SciScope didn't know,
then it can hand it to another SciScope that did know it, and so you are confederating data systems and
this is all behind the scenes. As a user your experience is I come to the map and I select my, set up my
search I say go. SciScope does everything else invisibly and that is why we call it the one-stop shopping
idea, because I could've search for diazinon which is at the EPA star website. So the idea here is that
SciScope is a metadata catalog. It doesn't know what the data is; it just knows where it is and when it is
and what the keyword is. So far, so good, but I actually want this data. So I click on one of these icons
and now SciScope says oh, you are serious, okay, so I better go over and finally SciScope goes over and
talks to a data repository that does have the data and it gets it back, and it hands it to you in some form.
Okay. So the form that we chose is Excel spreadsheets and there is a page of data and then a page of
meta-data which describes what the agency is that gathered the data and other types of parameters
about it. So that is where you would put things like error bars and so forth. So SciScope was stood up
and was running and we had a great time with it and it's actually this follow-on to an earlier version
that's part of the, what's the water consortium named, the organization, I can't remember at the
moment. But then what is our exit strategy, because we are Microsoft Research. We can't sit here and
maintain this system in perpetuity because we have to move on to the next thing. So our exit strategy
was to simply publish SciScope and keep it standing up at the Berkeley Water Center, except it
occasionally goes down, but then to publish the code so the code is out there on code plex; it is
published essentially as open-source. And the idea is if we communicate about this and people think
they, this is a really good idea, they can go get the bits and they can stand up their own SciScope. Now it
is cool when that happens, when people sort of get excited about it start to adopt it, and we didn't think
much of it. We kind of walked away from it and we just let it sit for year two, but now the open
geospatial Consortium is starting to get interested in designs like this and so that is kind of where it is, so
the exit strategy might be that we stand it up for a while and maintain it and see what kind of responses
we get to it or we can just put it out there as open source and move on.
And then, so that is my little spiel about SciScope and I am really not trying to promote it as for use, but I
am just trying to promote it as a set of ideas that you don't always have to think about data repositories.
You can index into them using your technological tools. I will mention in passing that another project
that we have going on, that is more along the lines of an ongoing and we're going to keep doing this kind
of deal, and it has to do with publications and authors and so forth. It is called Academic Search and you
can use it to search through however much content we have indexed. But it also has a co-author graph
so you can put Jan Newton up here and you can find what her connection is to other authors. I will just
try to type one in and see if it works. [laughter].
>>: Direct connection.
>> Rob Fatland: So Jan is connected to her co-authors. Let's see what happens if it's [laughter]. Ah,
good. So Jan apparently wrote a paper with David Kirchman who wrote a paper with Rudolf Amann who
wrote a paper with Max Planck who wrote a paper with Einstein, so your Einstein number is four.
>>: Wow.
[laughter].
>> Rob Fatland: So this is a project we are taking very seriously. The exit strategy here is no exit
strategy. We are just going to keep doing this. [laughter]. And actually the real exit strategy is it will
become adopted hopefully in perpetuity by other parts of the company that are better at maintaining
things like this, but right now we're just having a really good time just going to different organizations.
We go to different publishers now on we say can we index your stuff. And in the academic search part
you can probably pull up an abstract and, but you can't necessarily pull up the full PDF of the paper that
you find, unless the publisher has made that available. So there you go, Academic Search and that was
en route to my last point here, which is getting back to Layerscape. So Layerscape again is a portal site
and again I want to, I was going to digress into portals and how portals are dangerous things, but I think I
will just leave that alone and try to wrestle this thing to the ground. The important thing about the
Layerscape website right now it's called communities.WorldWide Telescope.org and in a couple of
weeks it will just become Layerscape.org and we will do our gold release, but when you get here, you
can see some featured content over here and if I click on that hopefully it will take me to the page of
that featured content, and you can get an idea of what WorldWide Telescope is capable of doing by
hitting play on the video that is there, but not all of the content has a video associated with.
[video begins].
>> Rob Fatland: This is just amazingly slow. Let me just try something here. That's much better. So you
can, I can play this tour as a video here or I can actually say view tour are and for that I have to install
WorldWide Telescope in my PC. But you will notice that there are ratings that are built in here and
there is a publish button up there, so if I want to publish my own content, I have some new content that
I think is worth putting in here, then publishing it can be as simple as just browsing to the file that you
want to publish. It can be a WorldWide Telescope tour. It can be an Excel spreadsheet. It can be a
JPEG. It can be any kind of, there is no restriction. And then I can click the publish button, but if I want
to I can also add some descriptive information, a little thumbnail for it and then I can give a citation and
that is sort of the last thing that I want to mention is that by citing data in this way or citing a source I am
accruing value to the person who provided that. So if you give me some data because you want to see
what it looks like in WorldWide Telescope, I will definitely be putting your name or whatever you like in
that citation box. And so it is a step towards this thing that is important to me which is incentivizing
people to publish data, and if I say to you hey, I need you to publish data and you agree with me, that's
great, then your next question is how hard is it to do? And I say well, you have to go through the long
process of filling out multiple forms of metadata. You have to describe your experimental procedure,
the papers you published and pretty soon you are decided to not do it, right? So I just have this
interesting thought that wouldn't it be nice if we could make the publication process sort of you get to
choose how difficult it is. So the easiest possible thing would be like it is here where I just have to
upload the file and click publish and for that there is no metadata, so how trustworthy is it? There is not
much to--the data is not really usable if I just click a spot on the map, but if you make the easiest barrier
to publishing really small, then maybe some people will start stepping across that and you sort of try to
go for that accumulation of momentum and pretty soon people are saying well, I have to publish my
data set. There goes the next two hours, but they realize they are going to get something out of that.
Let's say it's going to end up getting used and then that will be something they can carry to their tenure
committee or it’s something that they end up getting value back from their community in the case of an
operational data set. So I just wanted to throw that out there that we think about all different aspects
of data process and for me one of those is an important one is incentivizing people to actually publish
data sets because I think we've all had the conversation about how much data we have sort of sitting
back in our careers in our shelves and on floppy disk drives and so forth, and wouldn't it be nice to get
access to all of that? So that is one of our topics to consider. With that, I will just say that it would be
great if anybody wants to go and experiment with Layerscape, find out what is there. Hopefully you
would be not too averse to installing WorldWide Telescope and seeing it in its native environment. Send
me e-mails if you have questions or comments. We love to get feedback and that is I think everything
we will do there and so again, thanks for coming here. We will take a coffee break and then I think we
will roll over to the discussion phase that will be moderated by Jan and that's it, so thanks.
[applause].
Download