>> Eric Horvitz: So it's an honor to have... professor at computer science at MIT in the CSAI laboratory. ...

advertisement
>> Eric Horvitz: So it's an honor to have David Karger with us today. David is a
professor at computer science at MIT in the CSAI laboratory. He's a fellow of the
ACM. He's on the Scientific Advisory Board of Web Science Research Institute.
He received a 2003 National Academy of Science award for Initiatives in
Research. Sounds like a great thing to get, and sounds like you, David, taking
initiatives in research.
His worked has spanned a variety of areas, including algorithms, information
retrieval, networking, peer-to-peer systems, machine learning, communication
coding. I can go on. I just say that when I look at people that are out there in the
world, I always think that David Karger's breadth of interest sort of like remind me
of myself. And one of the best compliments I've ever gotten when Jamie Tevon,
one of your students who is now on our team, said you remind me of David
Karger. I said oh, really? That was a big compliment for me.
His recent work is focused on developing tools to let users and groups gather,
manage, visualize and share information. And today he'll be talking about
helping regular people communicate on the Web through rich interactive data
visualizations. David.
>> David Karger: Thanks, Eric. It's a pleasure to be here. Some of you know,
some of you may not, that I've actually been working for Microsoft for year -- well,
I spent sabbatical here and I've been working here half time for the past six
months or so pushing forward some of these ideas. Oh, maybe we'll get the KAI
announcements on the projector. [laughter].
So this reflects the work that I sort of started doing at MIT but which brought me
into Microsoft and which I'm still trying to sort of advance here as well as back at
MIT. The title's a bit of a riff on, all of the massive data work that everybody is
talking about these days. Everybody talks about these huge datasets and all of
the challenges that we face in trying to process them and so on.
I want to bring a somewhat alternative message that there's a ton of stuff for us
to do with small datasets that can be very valuable. And I'd like to see us
advance in that direction. Okay. So I always like to start with the conclusion in
case people have to leave early.
And so here it is. What I'm going to be arguing today is that structure really
enhances the value of information over sort of unstructured information like text,
if you can get it, in that it allows all kinds of rich visualization and interactions and
makes it easier for you to take that information and repurpose it, combine it with
other information, so on and so forth.
And I'm also going to argue that working with structured information can and
should be done by end users. And that there's a big gap between what they can
do and what we are allowing them to do with today's tools. They can author
structured information, they can author the interactions of that information. They
can combine and repurpose the information. And they can do all of this with very
simple editing tools, as opposed to learning to become database engineers. And
I'll show you some of those tools that you can use right now that we've got out in
the -- out in the wild.
The key to all of this is a real focus on sort of a data-centric architecture. Instead
of thinking about complex applications, I think the right perspective for this class
of users is to think of sort of lightweight skins of style over datasets and helping
regular people author these themes or these styles through direct manipulation
without any -- using any kind of fancy APIs.
Now, I also wanted to sort of give an early alert. There was some debate about
which talk I should give and Eric said well, you've got an hour and a half, so give
both. And so what I'm going to do is I'm going to give this talk that I've just sort of
given the conclusion for and then, after giving anybody who wants to a chance to
leave, I'm going to spend a few minutes sort of surveying a bunch of the other
projects that my research group is involved with that I'd be more than happy to
talk to everybody about -- to talk to people about or, you know, find some
interesting collaborations and such. Okay?
So get set for a marathon or to leave early. So talk about these five problems
that I think we see a lot on the Web today. The main one being that individuals
cannot effectively communicate their ideas as effectively as large, wealthy, well
skilled organizations that can hire teams of developers to do fancier things.
On the flip side, there's the problem that a lot of the information that people would
like to find or work with on the Web they can't find it because either it's never
been published or because it's been published in a way that they can't effectively
make use of it.
As particular funny examples of this that I find particularly interesting, we've had
this whole data.gov effort, this effort by the United States government to put all
sorts of datasets up online. And there's tons and tons of datasets. But somehow
they're not really -- they haven't had as much impact as some of us imagined that
they would have when the government started putting all this data out. They just
kind of sit there.
Conversely, scientists have lots and lots of data which they never actually
release. It stays inside of their -- they hug it close and just publish little -- you
know, little line charts about their data, but the data itself doesn't come out.
So I want to tackle all these problems with one hammer. You know, every
academic has to have their hammer. And that is to make gathering and
publishing rich interactive data visualizations as easy as publishing text.
So to motivate this, let's look at -- back at the Web in the early days. We had
these boring but exciting at the time static Web pages, right? We could -- you
know, anybody who wanted to could come along and author a little bit of HTML
and put it on their website, and all of a sudden they were one of these early
adopters of the Web.
In fact, most people didn't even author a webpage. Right? All that they would do
is copy somebody else's and make a couple of changes to make it into their own.
Yes. So this sort of monkey see, monkey do, was a really important part of the
spread of the earlier Web. Nobody learned anything. It was just copying what
other people do.
>>: [inaudible].
>> David Karger: What? Okay. There were also plenty of typos in the early
Web. All right. So now things have gotten a little bit easier. People don't really
have to author source code anymore, instead they put up blogs or post in forums
or collaboratively edit wikis that are out on the Web.
So now, you know, just about anybody can put stuff out on the Web if they want
to.
And so the Web created this whole society of authors. All of a sudden huge
numbers of people began to publish information on the Web and consume
information that other people had published. But it's worth asking. What really
changed with the Web? Was it the civility to fetch remote content from a server
anywhere in the world? Well, no, we could always do that, right? You just start
up your FTP clients and, you know, somebody else has authored a document
and put it on their FTP server and you type in a few arcane command and they're
you've got the document and you can look at it in your word processor or
whatever other tool's appropriate. Right?
So no real change. We could always do this. But the Web introduced these very
minor workflow changes, right, it created these URLs that packaged up this
whole fetching process in one string, right? And the click as a way to access that
without typing in all of those arcane commands and then it created this browser
as a place where you would just stay as you were accessing all of this remote
content, you wouldn't have to keep launching new applications in order to see
everything that you were looking at.
And on top of this, we've got this copy, paste, tweak ecology that I've already
mentioned that let people author without learning anything complicated.
So the Web really didn't make anything new possible, it just made stuff, certain
stuff simpler, and that was -- that caused the revolution. It's just making things a
little bit easier to do so that you didn't have to think about them.
So there was this real virtuous cycle created of Web authoring where it was really
easy for people to finds stuff on the Web, and that created and incentive for other
people to create stuff, because they knew that people were looking for it. And
they would get kudos from having put it up there.
There was also on the flip side the fact that it was really easy to create this
content. You didn't have to host anything -- you didn't have to build or install
anything yourself, you just put this file on to the Web server and you were all
done. So that was great. But now things have changed a little bit. Instead of
these 1990's webpage, we've got fancy Web pages with all sorts of rich
interaction, right? You can filter and search and sort the information you're
looking at. It's presented in all sorts of fancy templates and just looks a lot more
interesting and is a lot more useful for navigation.
You've got rich visualizations like maps and timelines and such. And if you look,
I think that there's been a real split now, that people who have the resources, the
money or the skills, are able to create these incredible rich visualizations and
powerful interactive exploration and navigation. But plain user websites haven't
changed, okay?
These professional websites can afford to implement a rich data model where
they put all the information into databases and extract it using complex queries
and then feed it into templating Web servers. And top of that they can exploit the
structure to do rich information interactions with filtering and sorting and all of
these fans views. Plain authors don't know how to install a database, don't know
how to define all of the data that goes into the database, don't know how to write
the queries don't know how to write airplanes to these rich visualizations so
they're still basically limited to text pages. Okay. Whether it be wikis or forums
or blogs, it's basically text.
And so they have less power to communicate effectively and therefore less
incentive to publish the information that they could communicate.
So this is actually a modern plain user webpage. Somebody's really interested in
breakfast cereal. Right? And so he's made this webpage. But it's just a static
webpage, right. You can click to navigate to a particular brand and inside of the
particular brand you can see lists of breakfast cereal characters but there's no
filtering or sorting or any of the other stuff that we have grown to expect on a
fancy website.
There are content carriers like Flickr and Amazon and Epicurious that are hosting
sites for a particular kind of content that a user can contribute. Okay? So you
can post a book review to Amazon or you can post a recipe on a recipe site or a
photo on Flickr.
And these sites are -- do have rich interactions driven by their structured data
that the -- the site owners manage. Plain users can contribute to these
repositories and benefit from the structure when they explore or consume the
data. But there are real limits, right? You have to publish exactly the way that
the content carrier has chosen to arrange the information. You can't say I don't
like your schema, I don't like your organization. Okay?
There are plenty of book sites out there, but my wife actually keeps her books at
home sorted by the public -- by the birth date of the author. Okay?
Now, is that something that your typically book site is going to support? No. It's
something that works for her, and she might want to share that. But she can't
change the site to do that.
I maintain a folk dance video collection. So it's a video collection, but there's all
sorts of bizarre metadata for folk dances that is not part of the YouTube
metadata. So nobody can use that to navigate through my folk dance video
collection, okay? If a scientist wants to change paradigms their stuff is not going
to fit into the usual scientific databases. And if you get into really weird stuff
you're completely doomed. There isn't even a starting point for the kind of
information that you want to share.
Gets even worse between sites of course because each of these sites managing
its own kind of information and if you want to combine the information from one
place with the information from another place then neither of the places is a good
starting point. You have to create a third place that integrates the information
from these two sources and create your own visualizations of that information.
And this is the whole sort of mash-up scene that is popular now. But again, in
order to create a mash-up you have to be a Web developer. You can't be just a
regular person taking information from two places and putting it together.
And of course the result is just another vertical website that again can't be
changed by anybody. Now, ideally, we want to democratize all of this. Anybody
should be able to do what the big content creators are doing, create interesting
data or find it on multiple sites, put it together, create rich visualizations of it with
interaction and make that available to anybody on the Web without knowing how
to program or install a database or even what a database schema is.
And if we can do this, then we're going to take this whole long tail of the Web, the
small people, and instead of having them only working on text, they'll become
sort of full fledged contributors to all of this rich information that we can find on
the Web today. So that's the motivation. What about the how? How do we
actually do all of this? Well, most of the Web, if you look at it, is crud, okay,
which means that there is a process of people creating information, reading that
information, occasionally updating it and even more rarely, deleting it. There isn't
necessarily a lot of computation over that information. It's basically the Web is a
big storage bank and you looking at what has been stored.
And this is true even on these professional websites. And so if we can just
democratize that much, this process of creating and showing, we don't have to
worry about the computational aspects of data.
So I'm going to outline an approach which starts by observing that of course
accomplishing data is very easy. Anybody can put a spreadsheet on their Web
server. So the only challenging part is the visualization side. And so what we're
going to do is we're going to identify the key elements of the interactive
visualizations that we see on the Web, and we're just going to add them to the
HTML document vocabulary that everybody is already familiar with, okay. We're
going to make up some new tags that talk about data the same way as we talk
about images or video in Web documents today.
And we are going to configure those data visualizations by binding them to the
data that's in the page, the same way as you attach a chart to a spreadsheet in
Excel. Okay?
So let's look at a typical webpage. Here's that Epicurious page with its
searching, sorting, and filtering. And let's look at the elements on this page. So
some of them are very familiar, right? We've got images. Okay? We're all
familiar with images and how you create an image on a webpage by putting an
image tag into the webpage. But let's look at the data -- the interesting data part
of the webpage. What do we have here?
Well, we've got recipes, okay? So there's a bunch of items listed on this
webpage. And each of those items has a bunch of properties, a title, a source
magazine, a publication data rating, okay? And these are all presented to you in
this template that is being used over and over again to render the items on the
webpage. Okay?
At a higher level, we have a view, okay? So the view is this actually -- I don't
know why my boxes are breaking, but this whole big thing here which has some
summary information about a set of items and here a sorted list, what -- I guess I
only have 59 minutes to finish my talk. So there's here a list which can be sorted
by a variety of properties like best match and a template that's used to present
every item in the list.
Up here we've got what are called facets. A faceted navigation has become
typical on the Web. You get various categories that you can use to filter your
items. So here we're filtering according to a main ingredient. But it also offers
the possibility of filtering by course or dish or season. All of these are ways to
filter -- to narrow down the set of information that you're looking at. And of
course there's also a text search.
So these in fact are the basic keys -- the key elements of basically all the interact
Web pages that we see on the Web, right? There's data, there's templates for
rendering individual data items that tell you basically how should the properties of
each item be laid out. There are views which are ways of looking at collections
of items. Are we going to have to do this over and over again? Okay. Let's just
go hide that over there. And let's -- as I said, 59 minutes from now, we will see
what happens.
And these -- so there are these huge like lists or thumbnail collections on Flickr
or maps or scatter plots, whatever you want to -- whatever you want to do to
display a collection of items. And again, these are connected to data by
specifying which properties of the information determine the layout or the position
of the individual items in the view. So for a map, you need a latitude and
longitude for every item in order to plot it. And then these facets for doing the
filtering.
So what we're going to do is make it possible for people to author all of these.
And, in fact, they already can. Okay? So for example, if you think about data,
well people use spreadsheets to author those all the time. Is it closing my -didn't it say that I had 59 minutes?
>>: [inaudible].
>> David Karger: I guess we're going to take a small break now while the
computer reboots.
>>: [inaudible].
>> David Karger: Yes. Go ahead. I can certainly take questions.
>>: So you have data templates views and facets [inaudible] and building blocks.
Do you also have grouping? Is that part of views maybe?
>> David Karger: Yes. Grouping is one of the things that you might want to do
within a view, sort of when you're looking at a collection, right, and so, in fact, the
list -- now it's not even -- there we go. And so in fact the kind of list views that I
will show you once we get that started up again do support grouping as well as
sorting of the items in the list.
>>: [inaudible].
>> David Karger: But of course [inaudible] spreadsheets isn't going to work and
that's sort of the argument that I want to make because they're boring, right?
Who wants to look at a table full of -- you know, at a bunch of columns of data on
the Web, right, when we've got all these beautiful ->>: And moreover, there's [inaudible].
>> David Karger: Yes, there is indeed.
>>: And [inaudible].
>> David Karger: Yes. And so actually this comes back to -- this comes back to
what I was just saying to Eric. Because there's so much -- it just made me
change my password. And now we have to see if I can remember what it is.
There's a ton of text on the Web and so there's a tremendous amount of work
going into figure out how to -- figuring out how to extract structured information
from where it got put into the text on the Web. Okay? And this is obviously
necessary work, right, that, you know, we're not -- this text isn't going anywhere,
there's a ton of value in it, let's get it out, okay.
But what I think is being neglected is the question of how do we get people to
author the information as structured data in the first place so that we don't have
to figure out sophisticated algorithms for extracting structured information from
text?
>>: But that requires [inaudible].
>> David Karger: Well [inaudible].
>>: [inaudible].
>> David Karger: It doesn't necessarily require planning ahead if -- and this is
the points of my talk, if we can give them an incentive to do it right away. If we
can say you'll actually have a better time, it will be better for you to publish this
data as structured data than it would be for you to publish it as text.
>>: [inaudible].
>> David Karger: What was that?
>>: And as easy. That's right. We have to create incentives and remove
disincentives.
>>: [inaudible] and the Epicurious people certainly somewhere [inaudible]
they've decided not to publish it.
>> David Karger: Correct. And there's a great -- but there's a great discussion
about incentives for sharing structured data. And disincentives for sharing
structured data which parallels but is -- sorry, is close to but not identical to the
discussions about sharing information on the Web in general. And I've got some
slides about that towards the end because I think it's a very -- it's a very
important question. Okay. Where were we?
So actually it was a reasonable time to spot because we had just gotten through
the sort of motivation story and I was going to tell you how to do everything.
What's that doing -- no, that's not the right place. Okay. Let's charge forward.
Okay. Communicating with data. Okay. So we identified these elements views,
facets, templates and the data. Can people author them? Well, they're authoring
data in spreadsheets. They create views all the time inside of those
spreadsheets by making charts, okay, by specify which columns in the
spreadsheet go to which chart -- where in which chart.
Facets are a lot like views. You need to specify which column of the data you
want to filter on. And again, that's something that spreadsheets make available.
And templates, well templates are every where. We have document templates in
Microsoft Word for example and people are comfortable working with those. But
they aren't doing it so much on the Web yet. Okay?
So we created a proof of concept implementation of this idea, of creating new
tags that will make things as easy to author in a webpage as text in images. It's
designed to let somebody publish an interactive data visualization by putting two
files on their website. One of the files is a data file, which can be a spreadsheet
or a CSV or a variety of other formats, and the other is an HTML document with
these -- with these added tags like lens tags and view tags and facet tags to
specify the different data interactive elements.
We have a JavaScript library that interprets these tags and makes the right
things happen, so the user doesn't see -- doesn't program anything and doesn't
know how it happens if they just put this magic JavaScript file -- link into their
header everything just works. And it all runs in the visitors browser. So there's
nothing to install on the server side.
So let's walk through some demos of that. Hello. Anybody see a browser?
There's a browser. Okay. So here is an exhibit that was created using our
framework. I'm afraid I have a lot of pages to load, so there's going to have to be
some waiting here. Good. So this is an exhibit about the presidents of the
United States. We have a timeline showing when they were in office. We have a
map showing where they were born.
On each of them you've got an icon showing them, and you've got some
additional information that you can pull up in a bubble. Over on the left we've got
-- no, I don't want to install more updates. We've got filters so you can filter on
the religion of the president, for example and see the -- both of the views update
with that. There -- it's a sort of a combination filter. You can add in multiple
values. And you can also intersect with other restrictions. So here I'm looking
only at the democratic Presbyterians and, you know, the view shows me how
interesting. They all lived on the East Coast. Not clear why.
Okay. There's text search if I want to home in on a particular -- on a particular
president. There are also alternative visualizations. So besides this map of
where people were born, here we've got maps of death -- of where people died,
okay? Again, we've still got the same ability to pull up information about any one
of them. We also have here color coding according to party. We've got a detail
view which is your more typical tabular view. You can filter on different
properties. And basically continue to use the facets to filter this information. So
now we're getting only the rows of the Episcopalians.
So that's an -- that's a typical rich information visualization that's got the things
that you're used to, views and facets and text search and templates for individual
elements. How is it created? Well, let's look at the HTML. As I said, we've got a
JavaScript library that interprets these new HTML tags that we've created. And
down here inside of the perfectly normal HTML document are the tags
themselves. So we start with a data file. We should take a look at that. Which in
this case is represented as JSON but as I say could just as easily be a
spreadsheet. And it's basically a collection of items, each of which has some
properties like their name and what terms they were in and whether they died in
office and their date of birth and some values for each of those.
On top of that, we create these special tags. So the reason that there's a facet
on the left that lets you filter by religion is that we put a facet tag into the
document, okay? And all it says is make a facet and use the religion property as
the thing that you filter on.
Similarly for the party and whether or not they died in office, okay? They're just
simple tags like any other.
Lower down, much like inserting a -- an image, you use one HTML tag to make a
timeline, okay? And so actually the only required properties for this timeline are
the in date and out date which are the things that specify the start -- or just the in
date to specify a start date for the timeline and then you can optionally specify an
end for each element of the timeline using another property. So those are start
and end. Everything else here is optional.
But by specifying a color key, you can tell it to color code the lines on the timeline
according to a different property. And then there are things for setting the
dimensions and so on and so forth. Similarly, further down we have a map view.
For the map view, the minimum that you need to specify is what property in the
data contains the latitudes and longitudes that should be used to plot items on
the map.
Here we've specified as well what property of the data contains a URL of an icon
or an image that should be put into the point that's plotted on to the map? And
then there are again the usual, you know, center the map here and make it this
big and so on and so forth.
Here's the death places map, which is done in much the same way, but we're
using a different property now to plot points on the map. So this is again very
similar to charting in Excel, right, you specify some columns and the roles that
they play in the chart.
The tabular view is equally easy. You just specify a list of the properties that
should be placed in columns of the tabular view. The last part is the templates
and these are implemented using tags that we call lens tags. And so here is a
lens, okay. And all it is is a fragment of HTML representing how the individual
items should look. And inside of that HTML their -- it's like Mad Libs, you just
specify how different blanks in the template should be filled in using the
properties that are drawn from the specific item. Okay? So that's the extended
HTML vocabulary for specifying these kinds of visualizations.
So we put up a couple of visualizations. Here's one of our department directory.
This is using a thumbnail view. And you'll notice -- you asked about grouping so
this actually supports grouping. So if we for example go by floor it groups on this
category but then you can also sort by other things within that grouped category.
Okay?
We've got the obvious facets and each of these thumbnails is one of these
lenses, a template for a particular item. Here's another one we made more of a
chart kind of visualization. Okay. So this is a -- this is a line -- a line chart view
of some tab -- of some structured information. Again, you can facet in order to
filter certain parts of the -- of the dataset. Okay? By team or by year, so on and
so forth.
Once we put up a few and sort of announced it, other people started making
some, which was nice to approve that it wasn't only the designers of the tool that
could -- that could build visualizations. So let me load up a few since the network
--
>>: [inaudible]. So in the long run will [inaudible] simple queries to large
databases, other interesting issues [inaudible].
>> David Karger: Absolutely. But of course one of the arguments that I made
right at the beginning was that we shouldn't jump right ahead to thinking about
large databases because there are a ton of really small databases. And just the
challenge of working effectively with those if we could do that, I think we would
have tremendous progress.
>>: I heard that. But I was asking because we're actually facing something like
this in our healthcare area, and question would be, you know, what would -- in
the long-term and maybe wait until the end of your talk, what do you see coming
in terms of large scale data mining that's interactive with Web tools to help you
do that kind of thing.
>> David Karger: Right.
>>: In regular, kind of normal consumer oriented Web [inaudible].
>> David Karger: Yes. Well, so I think there's actually two very -- an important
differentiation in that data mining is one stage, someone investigating in data,
trying to figure out what's going on. But then there tends to be a communication
step. Somebody has figured something out and wants to share what it is that
they know. Okay? And you need -- the tools that you need for that are different.
And the datasets that you're working with are not going to be as large or as
confusing because you've already sort of homed in on what you want to convey.
>>: Like where the consumer model might be very simple but -- and but
like trends of various kinds that [inaudible].
>> David Karger: Right.
>>: But the back end is complicated.
>> David Karger: Yes. Yes. I mean again, that sort of trend computation, that's
part of this computation aspect which I sort of explicitly pushed to the side and
said let's concentrate on just authoring and seeing the results of a computation.
Okay? I think that's a lower bar than asking end users to actually be able to carry
out complex computations. Okay?
So let's run through a few other examples. Here's a map of ozone
concentrations around the world that were generated from actually a dataset on
data.gov. Here's some local newspaper using exhibits to show locations for their
Fringe Festival in Minneapolis.
Sorry. These take a while to load. But they interact very quickly because all of
the information is actually on the client, right? There's no -- there's no interaction
with the server to do the visualization.
So here is a Gina Trapani made a nice little map of all of the places that she's
been, all the Broadway shows that she's been to and where they are in New York
City, musical versus play, what theater. Okay?
Here are vegetarian restaurants in Glasgow. I was going to take a little more
time on these, but I'm going to try to rush through them to make up what we lost.
Here's a nice visualization of sort so of the history of classical music. Somebody
made a relatively complex template where inside of the template you actually
have a video, you can watch a performance of the music by that composer?
Okay. You can filter on periods. He's color coded the timeline by periods.
You can switch to alternative visualizations. Here's a map of Europe that is
showing -- you know, it's very dangerous to be a classical composer in Europe.
Because you -- because they all die there.
Here my -- here's my unpublications page. Obviously every academic is
interested in making their publications as accessible as possible. So here you
can get a nice filter according to the different areas that I've worked in or the
different types of publication, the venues.
You can filter by different co-authors. Up here you can see a tag class -- sort of
a cloud view of my publication by year so that you can see I became more and
more productive until I got tenure and then it sort of drained off after that. Okay?
Here's somebody showing microloans throughout the world, drawing off of a
spreadsheet that they maintained somewhere else. European Court of Human
Right cases. All of this slowness is page loads from the net. The framework
itself is nice and fast.
Showing, you know, what's their status within the court, what's their type of
violation, who -- what the location of the court case is or where the violation took
place. The law library at Colombia uses it in order to help you sort of find your
way through the different available law resources. Somebody really interested in
soccer is using it to show the history of the World Cup.
>>: [inaudible].
>> David Karger: Well, then they're all going to be fighting each other for the
network bandwidth.
>>: [inaudible] authors become aware of this work?
>> David Karger: Word of mouth, okay? Or our publication. Or they ran into
another exhibit. Although actually this is one of the real problems with this
undertaking which I was going to talk about a little bit later is how do people
become aware of it. So you come and you visit -- you visit this webpage and you
say cool, it's got a timeline and interactive filtering and bubbles and so on and so
forth. I don't think it occurs to the regular user that they could have done it
themselves. They figure oh, somebody installed the database and wrote a
templating engine and so on and so forth. And I'll come back to that later on.
But sort of making people realize what they can do is I think one of the biggest
challenges here.
Somebody decided to use this for -- for genealogy and they actually authored a
new view, a family tree view, to let you look at the information that way. It's
integrated with the rest of the exhibits so that you can see the usual, you know,
timeline of when people were alive or a map of where they lived and so on. And
again, this is I think important for the ecology that now somebody instead of
having to write a whole new application for genealogy just has to address this
one particular need of a particular visualization, throw it into the framework and
they get everything else for free. So it's a finer grained approach to the
development of information and visualization than writing whole new applications.
Some newspapers have picked up on exhibits. So this is the St. Petersburg
Chronicle showing double dippers, those public officials who retire and then draw
big salaries as well as pensions with filtering by agency. Here we've got -- let's
see, where did my other newspapers go? Oh, yes, foreclosures in San
Francisco.
>>: [inaudible].
>> David Karger: Yes. Yes. Absolutely.
>>: [inaudible].
>> David Karger: Yes. That's a really, really good point. And -- so I'm a really
big fan of many eyes and for the beauty of the visualizations that we can create.
But, you know, I look and I see how often people want to have something that's
theirs, right? They own it. It's on their webpage. It's not -- and so even these
tools that sort of let you go build a visualization somewhere else and then embed
it in your webpage, it somehow, it's not yours. And I really want to -- to address
that demand for ownership that people have.
So there's another interesting thing about this one, which is a map of
foreclosures in the Bay area which is that all it is is a map, right? The Google
map's API is perfectly capable of rendering this map. Why did they use the
exhibit framework? Well, because they didn't have to learn an API in order to
use the exhibit framework, they just had to make a dataset of points and point the
map thing out. Yeah?
>>: [inaudible] exhibit API.
>> David Karger: Well, they had to learn those HTML tags, but they never had to
write any JavaScript, right? And if it was somebody who didn't know JavaScript,
they didn't have to learn JavaScript in order to be able to write to the API.
>>: [inaudible] I mean the template for each is almost exactly the same and is
almost exactly the same number of lines.
>> David Karger: Yeah, but I -- okay. I believe that this is simpler in people's
minds, but I can't prove it. So I will sort of leave it as a debatable point.
>>: [inaudible].
>> David Karger: Right. That's just it. Authoring HTML is something that people
have done since the '90s.
>>: [inaudible].
>> David Karger: Yes, yes, I'll get to that. So here is the Star Tribune in
Minneapolis showing schools failing to meet standards. And they also showed
bridges, failing bridges. I'll let these load up while I'm talking.
It's also been used in some interesting scientific context. So here's somebody
who was studying language acquisition and this is all of their data for their PhD.
So this shows, you know, what interview questions were asked and unfortunately
I can't tell you too much about this exhibit because I don't know Japanese but
they provide a way to sort of navigate through their different acquisition subjects
and look at that information.
Here we've got brain some sort of brain gene expression data which again I can't
tell you anything about. I don't know what this means. But that's a good thing,
right, that somebody was able to create this without calling me and saying okay, I
need a computer scientist to help me deal with this brain expression -- this gene
expression data? Because what do I know about gene expression? They should
be able to do it themselves. Okay.
So here we have a pretty fancy view that they've created and here we've got their
-- the facets that they decided were important.
More biology stuff. Here is gene mutations. And again, I don't know what -- I
don't know what AA position and SNP SV and SIFT score are, but they seem to
matter to biologists so they put them in.
Here a biologist actually took our timeline and turned it into a gene viewer. So
the gene is a nice long sequence and so you can scroll over that sequence the
same way as you scroll over a timeline. And you can filter it on things like
protease and MEROPS families, whatever those are.
Okay. All right. So enough with the different visualizations. I basically wanted to
throw a lot at you in order to argue that lots of different visualizations can be
created using this framework without writing any new code. Now, it has some
scalability limitations because it is JavaScript, so Eric asked about this. It's nice
and interactive if you have less than 500 items. Somebody made an exhibit of all
the Lego sets that were ever sold, and there are 2,733 of those. And it still
works, it just slows down linearly with the number of items.
The problem actually is in manipulation of the DOM of the webpage to insert all
of these items that are being shown. It didn't a limitation per se, because as I've
argued, there are tons of small datasets out there.
What I would like to see happen is for this ability to template and render data to
become part of the browser layer. We're already seeing this, the latest releases
or browsers now have this latest database storage API in them. What they don't
yet have is a set of HTML tags that people can use to access the data that's in
that store without being JavaScript programmers. And that's the kind of thing
that I'm trying to demonstrate with the exhibit.
Okay. At that point, I think we would easily scale to 50,000 data items. Because
that's less than the amount of data that's in a typical webpage today, the sort of
two megabyte Web pages that you pull down with all of their style files and
things.
All right. I kept these in case something crashed, but these are just other
visualizations. Here's the leg go sets, all 2,700 of them.
So the argument here is that these kinds of pages can show people a reason to
publish structured data, right, if you publish structured data then you can have all
this fancy rich interactivity visualization. Great.
It lets you communicate better. It's also actually easier to maintain, right? If you
have a map of all of the places that you've -- I'm sorry. Map's a bad example.
But if I've got a publications page, okay, then every time I create a new
publication I have to go and edit some new HTML and make sure that it looks
right with all of the other HTML. If instead I just have a spreadsheet where I keep
all of my publications, I just write a new row into that spreadsheet, I don't have to
do the formatting again, the page takes care of it for me. So this is the same sort
of motivation as CSS, right, the style sheets that there's a certain part of your
webpage that's just not going to change, and you should just concentrate on
changing the part that does change.
Now, an interesting side effect of all of this is that we're convincing people to
author the structured data. Well, that structured data is now exposed. Other
people can access it and use it for other visualizations or critique it. So the
selfish incentive to communicate better is leading to this social benefit of making
data available.
And we tried to make this explicit. I can show you on any of the visualizations
that we've created where somebody hasn't turned it off. We produce a data copy
button. So if there's data on this page, you click that button and you can take the
data out in whatever format is suitable for your needs, okay? So we're trying to
create the same sort of copy, paste, ecology as we saw in the earlier days of the
Web. You see something in you like it, you copy it and you change it. So you
pull it out, and you put it in.
Now, we've seen signs of this. So actually here is an exhibit that we found on the
Web in the early days. And you can see there's a small issue with it, right? So
they started with an exhibit that we created of MIT Nobel Prize winners, and they
said that looks about like what I want. I just want to put some different data
there.
So they downloaded the visualization, which after all is just an HTML document,
and they changed the data file. And they just forgot to update the title of the
HTML document. So this is nice documented proof.
So there's two kinds of copying that can go on, right? You can copy down the
data or you can copy down the visualization because both of them are just static
files. There's nothing to install or program.
We've created this small set of use. There are many others. But as I showed
you, we -- our framework makes it possible for people to actually instantiate
some new kind of view like the genealogy view as part of the framework. And I
think that -- this is sort of the future that I wish on Web designers or on Web
developers. Instead of developing whole applications, they can develop
information visualizations that can be incorporated as part of entire visual -entire page that is are authored by the end user.
Let's see. Let me skip over the -- well, okay. So another thing that we did with
exhibit was we replaced the MIT course catalog, okay? So I called in 4
undergraduates and said make me a new course catalog that's better than this
sort of big page of course listings that they have right now. Took them two days
to write the UI and two days to reformat the data. After it took six months to get
the data that we needed. Okay? So we contacted the registrar and we said we'd
like to make a better visualization of the MIT course catalog. And they said why
would you want to do that? Ours is perfectly good. And then they said, well,
wait, if you do that and there are any mistakes in what you do then people are
going to come see it and they're going to register for the wrong courses and it's
going to be our fault that we gave you the data.
So we had to argue all sorts of with issues to convince them that opening up their
data would be a good thing. Once we put the course catalog up and they looked
at it, they said oh, that why you wanted the data? That's really nice. And then
sort of a couple weeks later they gave us a direct line into their database to be
able to pull out the data whenever we needed it, because now there was a
reason for that data to be open and so they made it open. So again, I think the
visualization drives the creation -- or drives the opening up of data, and I think
that that's very valuable.
All right. Now, I left a big problem on the floor -- on the table here which is who
edits HTML source code these days? Right? Only -- only people like us, right?
Geeks. So we need to -- I mean so I believe that the framework, the idea of
these visualization tags is correct, but we have to -- to get to everybody, we have
to give them appropriate authoring tools for dealing with those tags. Okay? And
so we've built three different tools that show the kind -- that show the way that a
regular person can actually author.
So instead of being in the Web of the 1990s with source code authoring, you're in
the modern Web where people work with things like wikis or WYSIWYG editable
documents or blogs. And what we've done is we've said, well, since we're just
extending the HTML vocabulary, we should be able to go with the flow, use the
tools that people currently use to author different kinds of HTML and just
incorporate data authoring as part of those tools.
And so we've done this three different ways. We've done it in a Wiki, we've done
it through an editable stand alone document with a WYSIWYG editor, and we've
done it as part of the blog publication process.
So let me show you two of them briefly and then I'll spends a lot of time on data
blogging.
So one of the things we did was we added data visualization to Mediawiki, to the
software that runs Wikipedia. Not on that site, but to the underlying software.
We started with something called semantic Mediawiki which was a preexisting
extension for Wikipedia that lets you -- for Mediawiki that lets you put structured
data into the Wiki. You might think there's already structured data in the Wiki
because all of those pages have info boxes in them, you know. If you go to the
page for a particular President of the United States, there's this nice table on the
right-hand side which shows when they were born and when they went into office
and what was their party and so on and so forth. But that's not actually
structured data. Because the Mediawiki treats it as text. But somebody wrote an
extension to take all the information that you're putting into those templates and
put it into an actual database so that you can query that database and get back
the information -- and get back structured information from what people are
putting into the templates.
We simply enriched that with our exhibit framework. So the underlying extension
just gives you back a table of results. We shoved that table of results into all of
the rich information visualizations that we've created. You author them the same
way as you author any Wiki page. So if I go over here to my beer page, beer.
So I had a German student come out and do this. And he was very interested in
beer. And he made this visualization of different varieties of beer throughout the
world. And it's an exhibit like many of the other ones I've already shown you.
Okay. This is the list view where you can sort by type or what country it was
brewed in. We have a tabular view, and we also have a map showing where
everything was -- is created and little bubbles showing what kind of things
happen, showing the brand of the beer.
Now, if you look at the source of this page, this is the whole thing. So this is the
text extension where you can query the data inside of the Wiki and get back
columns of structured information. So you write in that query. And this was
already available as part of the Mediawiki extension.
What we added was the ability to say format the results as an exhibit and show
me a list view, a tabular view, and a map view of those results and put in a
brewed in country facet for filtering them. So anybody who's comfortable offering
Wiki text can offer a visualization like this.
You can also use the Wiki framework to edit the individual pages. And here
again, you'll see typical Wiki text in the Wiki, but as you edit it, it goes into the
database and then the view gets updated to reflect the modified visualization. So
that was our attempt to fit into the Wiki workflow. And you can play with that right
here, projects.csail.mit.edu/wibit. And it's world writable, so you can spam it if
you want.
The next thing that we worked with was documents. So everybody edits
documents but the documents that we edit right now have mainly text. Let's
introduce structured data into the documents. So here again -- here we go. So
here's a structured data document. And it's an exhibit but it's an editable one.
So I can edit the information right here. And it immediately updates in the
visualization. I can add new items, I can delete items, I can do all of this while I
am interacting with the data.
I can also interact with the visualization itself. So for example I can grab my
WYSIWYG editor and say I would like to add a -- sorry. My screen is a little
small. A facet to filter on the discipline within which these Nobel Prize winners
won their prize. And I would also like to go over here and add a timeline view
based on the Nobel year.
And when I'm done with all this editing, I now have the ability to filter on discipline
or look at a different visualization that I've just thrown in. And this is just a file.
So once I finished with editing the data and editing the visualization I click save,
and it's saved. And now I can e-mail this file to somebody else, I can put it into a
version control repository, I can upload it to SharePoint, I can put it on the Web
and let people interact with it that way. It's just a file, and all of the things that
you do with a file you can do with this document.
So that also -- and again, this works because we're just editing tag. So I grabbed
an open source HTML editor and just -- just told it about our tags as something
else that it should be able to edit and so it was very quick to build this -- to build
this prototype.
Now, the last thing that I wants to talk about -- oh, but I have to say. Just this
morning, as I was putting these slides together, there was an announcement of
the executable paper grand challenge from ELSEVIER which they want a way for
scientists to be able to publish their papers, including the data and the ability for
somebody who's reading the paper to interact with the data.
>>: [inaudible].
>> David Karger: Yes. But it also sounds very DIDOish. DIDO is the tool that I
just showed you, right, a data integrated active document is exactly what these
scientists need. And again, I think that the opportunity to publish not just a boring
old line chart but some data that a reader can actually read can provide the kind
of incentive that we're looking for for the scientists to put their data out in their
publications.
Last thing I want to talk about is blogging of data. So we built a tool to integrate
these sort of rich visualizations into -- into word press. Before we started we
thought we would use this to test whether this is actually a need for data
publication. So we grabbed a bunch of the blogs off of technorati which tracks
lots of blogs and looked through the articles that were booking posted on those
blogs.
And here's what we found. 21 percent of the articles and 81 percent of the blogs
in their postings would enumerate the properties of a structured item. Okay? So,
you know, when somebody posts a review, they tend to throw lots of properties
of the item being reviewed into their text. Okay?
Also, 30 percent of the articles and 86 percent of the blogs did sort of data
comparison. So here we were -- this was election season and so people were
talking about poles and different results for different candidates.
91 percent of the articles actually referenced some data that were somewhere
else. 32 percent of them did explicitly and 59 percent of them just did it implicitly
by sort of referring to some data set that you couldn't actually link to.
What was -- but invariably this data was conveyed either as text or at best as an
HTML table, or maybe as a picture of a chart that they made with some other
tool. And I have to tell -- so Eric, were you involved with CPOF? So I heard this
great story about the command post of the future which was this very fancy
system for the military to let them ->>: [inaudible].
>> David Karger: Okay. So they ->>: [inaudible].
>> David Karger: They built this big fancy tool that would let the military create
these very rich information visualizations in this tool of all you know all of the
units and the resources and the combatants and so on and so forth interact with
it.
The way this was used was that people with one instance of this command public
keys of the future would create this really rich visualization. Then they would
take a screenshot of it and send that over to another installation where they
would put that screenshot into the command post of the future installation that
they had.
So instead of caring all of this rich data from one application to another, all that
you got was an image of the output, and you lost all of the richness, even though
it was the same tool on both sides. Okay? And we get some of the same thing
in data, in blogs, where people will create a rich visualization using some other
tool and put a picture of it in their blog posting.
So we made Datapress, which is a word press plugin. And it uses the standard
Datapress workflow. You upload or link to some data, and then you WYSIWYG
your visualization of that data using the regular word press blog post editor. So
here is sort of a work example for somebody writing a blog post at a conference.
We gave a paper about this at the Semantic Web Conference last month.
They're writing their blog post. They go to the editor, and they notice that above
the toolbar there's some new stuff, namely a -- an upload for data and a
visualization button.
So you the click on the visualization button and you point it at a spreadsheet say,
where there is some data. Okay? And you type in that URL, okay? We've got a
little wizard so you say here's the data that I want to visualize. Then you can go
through and specify some visualizations that should be created over that data.
So here it's creating a table and you're specifying which columns from the
dataset should be concluded into that table.
Similarly you go through and you add some facets and you configure whether
you want it to be light boxed or part of the blog post or so on and so forth. Okay?
And then you have the ability to configure templates for the items, but
unfortunately since we said it was advanced nobody every actually used that. So
we should have left that off.
So we put this out, and we actually studied -- we studied the way this tool got
used. So here we have some honest usage reporting. So if I go over here,
here's one -- here's a dataset, and here is a website qualified QuantNet that's
one of the users of our tools.
So this is a preexisting blog, which is still loading. And they made this nice
visualization of programs for different degrees in quantitative finance that you can
get. You can filter by the type of degree you want, where they're -- you can look
at a map of where they're located. Sort of usual exhibity stuff. But they did this
right in their blog.
We also had somebody use the blog to -- use Datapress to manage their
publications. And the third sort of case that we looked at closely was somebody
who used it to maintain a blog about the music scene in Portland, Maine.
>>: [inaudible].
>> David Karger: So we've created a plugin for word press and you download
that plugin from the word press site. You install it the same way you install any
word press plugin, and there you go. Okay?
>>: And could you also go a slightly different direction which is to be able to use
exhibit to manage the publications in the blog itself?
>> David Karger: Yes. So I haven't yet connected sort of the data editing tool
with the blog tool. Although I have to say people seem pretty happy, you know,
having their dataset somewhere else and just using -- so here's music in
Portland, okay? Now, talking a little bit about these -- that wasn't what I wanted.
Hang on.
So, yes, we have this quantitative finance guy. Here are some other
visualizations that were created using Datapress. Some data dump from
data.gov. Publications. The Semantic Web Conference itself. Here is a bar
graph of where people -- where the sources of the peoples at this conference.
Here's one that I thought was quite interesting. This was a blog post that
somebody made. And notice this part. So this guy basically says, you know, this
is a thing that lets me use exhibit to put something into my blog. I thought about
visualizing this data using exhibit, but I didn't have time or was too lazy to
program and so Datapress saved me. It was sort of easy enough to do that I
finally went and did it.
And again, I think this is the key is if you can make it simple enough, then all of
this restrained creativity will burst forth and you'll see lots of this stuff being
created.
The data can come from a spreadsheet or from -- that is upload to your blog if
you really wants it to be yours or else you can just link to a data source that's
somewhere else. So for example a Google spreadsheet.
You can't link to a Windows Live spreadsheet because they haven't provided a
data output for their spreadsheets. Or you can link to the data that's stored in a
Wiki somewhere -- in our Wiki extension in wibits somewhere. Or you can link to
data that's on somebody else's blog because obviously a lot of what people like
to do in their blog postings is talk about other people's blog posting. And so this
way you can visualize the way somebody else visualized the data and put up
your own visualization of that data to compete with it.
You can also create sort of a blog data feed where for example if you are writing
a blog of book reviews, then each time you put in a book review you can also put
in the structured data for that book. And all of those individual data items
become a single feed of data that you can incorporate into a sort of an aggregate
posting.
So here in the blog post editor you can insert a data item and say what data item
you're talking about. So here's a set of templates you can use and you specify
what kind of thing that you want to add and fill in the fields for it. And that goes
into the blogs database.
So we studied this. We got about 120 downloads at the time we were writing the
paper. 36 people participated in the study, created 94 visualizations. We got
75,000 page views and only one bug report. So that suggests that we're pretty
robust. Because the exhibit framework's been around for quite a few years and
had its bug fix -- bugs fixed. And then we have these 3 in-depth interviews on
the sites that I showed you, the publications, the quantitative finance network and
the factory Portland music website.
So here they are again. This is the botanist with their publications, the music
lover and the pro blogger. About quantitative finance.
So here are some observations about their experiences. Our subjects, they
found the Datapress plugin after months of looking. Okay? Which means that in
those months of looking, they had a need and it wasn't being met by the tools
they were able to find. So I think this provides some evidence that there is a
need and that we don't have the tools for it.
The botanist actually went so far as to install word press just to be able to use the
Datapress plugin. The others had preexisting word press sites. None of them
wanted to be hacking HTML source code. Okay? And what was very interesting
to me was their limited ambitions. If you look at it, two of the three sites, all they
-- all they created using Datapress was a table. And HTML already has tables.
Okay? But they wanted the filtering and they wanted the ease of maintenance
that instead of having to go in and hack that table, they could just edit their
spreadsheet with additional data. Okay?
Now, what was also interesting was that we asked them why they had only made
the tables, and two hours later the finance network had a map and a timeline of
the stuff. They just hadn't again realized -- thought about what they could do with
the tool. And so they didn't try. Okay?
Again, I made this point before, but the ability to visualize drives the structuring of
data. Right? The botanist had an HTML publications page, but once we told her
observing you could do all this visualization stuff, then she turned it into a Google
spreadsheet. And that added structure to the Web. The musician had a
hand-edited table. But they moved their data into a data file in order to create the
visualization that they wanted.
Two of these three people actually asked for a collaborative -- they wanted their
users to be able to collaborate in the management of the data. They didn't know
about our wibit project, but that would have been the natural way for them to do
that.
Somebody asked about sharing back at the beginning. And there were a lot of -there was a very interesting spread of perspectives about what is -- about the
sharing. You're making this data available. Does that mean anybody should be
able to take it and use it for something else? Well, one person was perfectly
happy to share everything. One thought that it was fine to reuse the stuff if there
was proper attribution, okay. Another thought this is a little odd but they thought
it was fine for somebody to copy the visualization or to copy the data but not both
because that would be copying.
The musician was the most interesting of all. He actually sort of thought about
trying to go out and convince other music oriented people in Portland to turn -- to
expose their data so that he could aggregate it into his website and so on. But
he also went and learned just enough CSS to hide the data copy button that we
provided. Because he didn't want other people to take his data.
>>: [inaudible].
>> David Karger: Okay? So making this data -- so, you know, looking at data
like this doesn't magically solve all of the copying problems that we -- that we
already have with text. In fact, it makes them worse. Because with text you have
copyright. With data you can't copyright data. There are court cases, right?
If you have facts, you're out of luck. If they're available, anybody else can copy
them. And so this -- this -- this does pose a challenge. But what can we do? It's
something that we'll have to address the same way as we address other forms of
copying on the Web.
So I'm going to finish with a couple of perspectives on all of this. First, this is the
first webpage from CERN back in 1994 or something like that. This is the CERN
webpage today, okay? It's gotten a lot fancier. The big difference is that, you
know, there's a lot of CSS, there's guidelines so that how you should style things
so that they should be looked at. Well, I think that if we moved in 15 years to
splitting content from styling in the next is to split data from content from styling,
that every webpage should actually have three parts. Or not every, but a data
carrying webpage should explicitly have a data portion and it should have a
content part that explains how that data should be formed for visualization and
then a style part much in the same way as we're using CSS now.
And I think that the common case will be that just like the CSS, the tags for
visualization will be static. And it's only the data that will be changing on the
page. And this many dramatically improve the ability of people to maintain their
information easily, or use it for multiple datasets.
All of what we did was client side. Okay? So do we need servers at all? Well, of
course. Once the data gets big enough, you're not going to be able to do these
kinds of client side computations that we're running, you're going to be want
support from a server. And in fact we have a -- we've just gotten a small grant
from the Library of Congress to work on this because they have -- they want to
use exhibit but their datasets are too large. So they want some further
development to provide server side support for the kinds of data interactions that
exhibit lets you author.
But even as we push down that direction I continues to argue that there's tons of
valuable data that comes in small packages where the real challenge is not
performance and scaling but simply ease of authoring the visualizations.
Authoring data is not complicated and you isn't need a computer scientist to do it.
Okay? Everybody is happy putting images and videos into their Web pages and
inserting data should be no harder.
A lot of data is in fact available on the Web this comes back to the argument I
was having with Daniel at the beginning. All sorts of companies offer data APIs
and say we're good citizens, we're offering data APIs. Well, this is great for
programmers, but what about everybody else? I would much rather see or I
would like to see us add on a Web ecology where it becomes normal for people
to have data copy buttons on their Web pages, okay? We see a little bit of this in
microformat -- in microformat lands. Okay? Where you should just be able to
say, okay, I have navigated around on this website, I've gone through it's query
process and so on and so forth, now there's a -- this page is showing a bunch of
data that I like. Well, let me copy out that data and do something with it. Ideally
copy it out as a feed, an updatable URL that I can use if the data ever changes
so that I can keep my visualization up to date.
I won't even talk about the Semantic Web vision. It's not important for this talk
right now. But I'll just conclude by going back to these five problems that I listed
at the beginning, right, that people can't communicate effectively, that as a result
people don't put data on the Web, and other people can't find it. So I'd like -- I
hope -- I believe that the approach that we're taking of making easy data
visualization part of the Web ecology would actually help us to address these
problems.
We do so by separating data from presentation, thinking of data visualization as
an authoring process rather than a programming process. And if we did so,
anyone would now be able to create interesting data and visualizations which
would motivate them to do so and put that data out where other people could
access it.
And so coming back to what I said right at the beginning to Eric, this is not about
creating sophisticated information tools, it's about creating simple tools that let
people do the sophisticated work. Okay?
These are all of the students who have worked on different pieces of the tools
that I've described. I had to asterisk David Huynh because he was the one who
really kind of led us off in this direction of simple Web authoring of information
visualizations by creating the initial exhibit framework. Okay?
So you can play with all of the tools that I've showed you by going to these
various websites. And of course you can e-mail me if you want to -- if you want
to discuss any of them.
>> Eric Horvitz: Thank you very much.
[applause].
>>: You said [inaudible] about what you were doing at Microsoft and [inaudible].
>> David Karger: Sure. Sure. So I gave a 20-minute version of this, without the
reboot in the middle, to Harry Shum [phonetic] about a year ago, and he thought
it was pretty exciting, and we thought that bing might be an environment where
this might happen. And I've got a white paper about this that I'm happy to
circulate around.
The idea is to provide an environment somewhat like many eyes but different in
important ways where people can -- are supported in their authoring of data and
visualizations of that data.
The idea would be that you would go over to a site that's part of bing and you
would there find tools. So suppose, for example, that you have a small store
front you want to sell a bunch of products or something. And you want to create,
you know, a -- some product catalog that people can navigate.
Well, bing would give you the tools for creating that product catalog without doing
all sorts of fancy software installation and so on. As a first step the products that
you want to sell are probably already in bing. So we would give you like a little
shopping cart to go through bing's structured data repository and pluck out the
items that are part of your catalog.
Now you have a dataset. Well, you would perhaps want to augment with that
dataset, for example, with your comments about the products and what the
prices of the products that you want to sell. Now you want to create a
visualization on top of that. Well, that's a Web authoring task. So you -- you
know, you use something like DIDO that I just showed to sort of WYSIWYG edit
up the way you want your product page to look.
When you finish, you pull down that product page and you put it on your website,
okay? And the advantage to bing is that by going through this process, you've
actually authored additional structured information, right? Your prices, your
comments. And because it's structured and bing knows how it was authored,
bing has an easy time gathering that additional structured information and
enriching its own structured data repository.
>>: [inaudible].
>> David Karger: Exactly. Yeah. I've got a whole sort of virtuous cycle picture in
my white paper to sort of show how it helps everybody along the way.
>>: [inaudible].
>> David Karger: What was that?
>>: How has that gone?
>> David Karger: How has that gone. I've gotten quite a lot of enthusiasm and
not enough development resources. So we're still trying is the basic story, to
figure out how to make -- how to move it forward.
>>: [inaudible] doing the same thing, trying to bring the [inaudible] SharePoint.
>> David Karger: Yeah. Well, you know, I -- and so again, lots of enthusiasm,
many presentations to different people on the SharePoint team. And sort of
invitations to talk to this person and that person. But actually just as I was
coming out this week, I sent a e-mail to ask them if they'd like to meet again and
they said oh, we've gotten distracted with something else very big and don't have
time to talk about it right now. So that may not be going anywhere. We'll see.
>>: [inaudible].
>> David Karger: Okay. So that's the story. So I fear -- so like I said, you know,
I do have -- I do have a few words about my other projects, but I think that the
time may have been eaten in the reboot. So ->> Eric Horvitz: We could arrange another seminar while you're here if you
would like to do that.
>> David Karger: I could do that if you want.
[applause]
Download