come here to give us a talk. Craig is... research professor at USC. He's also chief scientist from... >> Moderator:

advertisement
>> Moderator: Okay. It's my great pleasure to welcome Craig Knoblock to
come here to give us a talk. Craig is Senior Project Leader ISI and he's also a
research professor at USC. He's also chief scientist from Fetch Technology and
Geosemble Technology, both of them are spinoffs from USC. And he's my fellow
colleague and alumni and he has published many books, articles and served in
senior program committee for many AI conferences. And he's co-chair of the
2008 AAAI track of AI on the web and he's conference chair of 2011 IJCAI. He's
currently President on the international conference on automated planning and
scheduling, a trustee of international joint conference on AI and a fellow of AAAI.
Okay, without further adieu (indiscernible) Craig.
>> Craig Knoblock: Thank you very much. Okay. So today I'm going to be
talking about work we've been doing at USC on building Mashups by example. I
want to acknowledge my collaborators. It's Rattapoom Tuchinda, who's a PhD
student. He's working on his PhD and has really been the driving force behind a
lot of this work and I certainly give him credit. And then Pedro Szekely, who's
also at the University of Southern California. We've been jointly advising
Rattapoom on this work, so...
So I'm going to delve down and talk more about the work we've been doing on
this topic. So I'm sure most of you are familiar with Mashups and you've seen all
kinds of sort of interesting Mashups throughout these days. But just for review,
Mashup is basically some kind of application that combines the content from one
or more source to provide some kind of unified or integrated experience.
There's lots and lots of examples of Mashups out there. Some notable ones are
things like taking crime data and putting on a map. Those are the simplest kind
of Mashups where you are simply taking some kind of data that's been published
in one place and saying, okay, I'm going to combine with this other application.
The Google maps thing seem to launch the Mashup craze where everyone was
just taking things and sticking on top of Google maps. But there's lots of other
examples of these now. You know, Zilo is another example and it's strictly
speaking a Mashup, but it has the same flavor where they're essentially taking all
of this very interesting information about property from a whole variety of
sources, putting it together, sticking it on top of a map and then providing sort of
a unified experience there.
One of my favorite ones, the one on the right hear, Skibonk, is one where they've
basically taken all the information about locations of ski resorts, combined it with
weather data and then stuck it on a map so that in a glance you can decide
whether or not and where you want to go skiing that day.
No one of those websites by themselves actually provides all the information in
one site. There will be one website that has a listing of all the ski resorts.
Another one that's got the recent weather conditions. And so to put it all together
in one place often provides something that an end user might want. And this is
sort of the point, which is what an end user wants depends on the user. Right?
Every user is sort of looking for something different.
And for whatever Mashups that are out there, there's always some new Mashup
that someone wants. Right? There's some new user that says, hey, if I can take
data from here and put it over here that would be really great, then I could do
this. Today you largely have to wait for somebody to build you that Mashup.
Typically the combined data gives you new insight and provides some new data
or service that's not there. It's not there in any existing web source.
Now if you look at sort of the trend for Mashups, this is just a screen shot from
the programmable web and it shows you the new Mashups that were constructed
in the last six months. You can see that people are building these all the time.
But these are largely programs. Right? People that are using -- either
programming them from scratch or using some kind of programatic tool to
actually create these things.
But you can see that the interest in these things in general is quite large. Right?
This is the total number that just been posted on the programmable web, which is
I don't know almost 3,200. And the bottom, they're divided into different
categories here, so -- and these are just to give you a sense for the different
kinds of things.
Mapping tends to be the biggest one, right? That's the largest part of the pie
here, but there's also photoshopping, search, video, travel and so on are different
kinds of Mashups that people are creating.
So the focus on this particular work is really on things related to the first four
categories that we often see, mapping, photoshopping and search. And those
accounted for roughly 47% of the most popular ones that are out there. There's
no reason it couldn't be expanded to the other ones, but different types of
Mashups have sort of different properties and different kind of user interface
requirements.
Okay. So let me talk now about sort of the different Mashup building issues.
Okay. And there's sort of a set of general problems that have to be solved in
order to create Mashups.
The first one is of course the data retrieval. Alright. And we've been working on
sort of web-based extraction for many years now and there's been lots and lots of
work. The point is that somehow you have to get the data to build the Mashup.
Right? And maybe it comes from some nice API where you can just create the
information, but more often than not it's stuck on some HTML page that you got
to navigate to and pull the information out. So that's the first problem.
The typical approach today is to have some kind of rapper technology, which
basically goes from a site that looks like this to something that's in a more
structured format which you can actually get the data.
The second issue is sort of the calibration and this consists of two pieces. One is
what we call source modeling. So you pull the information off. You typically
need to know something about what this information is actually providing. You
know, is it a date? Is it the name of a restaurant? I mean, so there's a whole
variety of things that you might actually be pulling off of that web page and you
want to understand what this source is actually providing. And that becomes
very important when you want to do some is kind of integration with other
sources.
The second one is data cleaning because what you typically find when you
integrate data across sources is that there's all kinds of minor variations that sort
of get in the way of actually presenting the information in the form you want or
combining it with another source. And I'll describe some examples of these, but
these might be simple things where in one case the name is abbreviated and the
other case it's not or it's going to be arranged in some way or if it's a date format,
what are the huge number of variations on date format that they use and if it's
different than another format or the one that you want then those kinds of issues
can be a problem.
The next issue is the integration. So what happens in a Mashup typically is that
you want to actually combine the data. Right? The whole idea of the Mashup is
to put two or more applications together. Maybe you're just going to one source
and you're restructuring it some way, but typically you're really trying to combine
the information in some way. And the challenging part here is to actually specify
in what way you want to actually combine the information.
Then finally there is the display issue. How do you actually want -- once you
have decided what sources I'm going to get the data from and you've cleaned up
the sources and you've modeled them and you've integrated them, then there is
the presentation issue. For the most part in this talk I'm not really going to talk
about presentation, that's sort of a whole thesis project in itself, which we haven't
done yet. So we're going to focus on the other issues here.
I should mention more generally that these type of Mashups, these are the
general information integration issues people that have been working on in
general for the last 20 years, 30 years. What's different here is you know the real
focus of our project is enabling the users to actually do these things. Right?
We're not trying to just automate these different tasks and put them together and
magically create the Mashup for the users, but really to come up with a
framework in which we can actually support the user that wants to create the
Mashup to actually solve these problems.
Okay. So I'm going to go through just a few different types of Mashups and you
will see in our experiments that we actually use these different types to do
experiments and so on. So very simple type of Mashup might be one where we
simply do the retrieval, do some modeling and cleaning of the data and then just
display the information on a map for example. That's sort of the simplest kind of
Mashup and that would correspond really to the Google maps that are really
popular.
The second type is some kind of a union and these are quite common, too. So
maybe I have two different restaurant review sites that I go to. I say, well,
wouldn't it be great if I had all that information in one place and it was put on
topical maps so I could sort of see the information. So now you get into a little bit
more challenging kinds of issues because now we need to solve the problem of
extracting the information from the sources and then deal with the sort of the
calibration issues here and then specifying how you're going to combine it.
Unions are usually pretty straightforward, right. You are just combining the
sources, the data and just depending all the data together to have the combined
set of information.
Although this can potentially get into more complicated problems if you want to
deal with things like duplication if you had overlap between the sources
themselves. You will see that in many Mashups they don't, they don't deal with
that at all. They just do a very simple tying of pulling the information together and
display it.
And then finally you want to create some kind of integrated interface for
something like this is just showing a map kind of interface. We have taken the
data and displayed the points on the map.
The third type is very closely related to the first one except that now instead of
just doing straightforward extraction from the web page now you are doing some
kind of integration -- interaction with the web form. There is a whole lot of data
you can get that is behind a web form where you go and you are interested in
information and provide some data. And then you go through the same kinds of
things where then you do the cleaning and display.
And then the last type here is really some kind of joint where before we had a
union which was where you're just taking the data and combining it into one large
table, now we want to take the data and actually combine it where you're
essentially saying, well, I might have a set of information about restaurants and I
want to put all that information together and I want to relate the information
across the different restaurants. And This typically is harder to specify because
usually what you're doing is you have to specify, well, which restaurant in one
source corresponds to which restaurant in another source, for example, doing
those combinations. So somehow you have to be able to show to the system
how are you going to combine the information.
All right. And then -- well, the last type that we're not really going to talk about is
kind of this customized display. How do you actually deal with that? And that
really sort of expands to all of these. Right? Almost every kind of Mashup has
some type of display because usually the user wants to present it in some way.
I'm not going to spend time talking about how to solve that problem today.
Okay. So existing approaches. I mean basically the goal of this work is to create
Mashups without programming. And the problem today is with the existing
approaches it doesn't turn out, I mean you have to some basic knowledge about
programming it turns out in terms of the way the existing systems actually work.
So here's a screen shot from Yahoo Pipes. Those who haven't seen it, it's
basically a tool, Yahoo Pipes is basically a tool for creating Mashups across web
services. Right? So you have these service-based interfaces and you're really
describing how it is you're going to integrate the data across these different
services. And what you see is it's based on this widget paradigm where you
have these -- you basically have these different widgets which you place on the
screen that you're -- where you're building the Mashup and they perform
operations on the data. You can specify ways of extracting information from
different kinds of feeds. They could specify different ways of putting information
together and stuff.
You know, it's not, you don't have to write java code, but at some level you're still
in this sort of paradigm where I have these operations on the data and I'm going
to put operations down and I'm going to specify how I'm going to connect the
operations up. This type of approach is also used in a system called Microsoft
Popfly. A similar kind of thing, except they have even more widgets.
Typically what happens for the users they have to spend a lot of time essentially
locating and learning how to basically customize these widgets. They're typically
quite powerful. You can write regular expression and you can do all kinds of
interesting things with them. They are really great for programmers because we
are used to that kind of paradigm.
They focus -- I mean, typically the existing systems focus on some of the issues
that I just went through and they ignore other ones depending on what the focus
of the project is.
So our goals then is to come up with a framework that addresses all of these
issues while still making the Mashup building process easy for the end user and
our target here was not programmers, but people that maybe haven't had a lot of
experience programming but want to create their own Mashups.
Okay. So here is sort of the key contribution in terms of what we've done. We've
developed this program by demonstration approach that uses a single table for
basic code in the Mashup. The table in some sense is the unifying framework for
the user in (inaudible). It then provides this integrated approach that actually
combines these different pieces. So in a lot of systems today the different
components that I describe, the data extraction, the modeling, cleaning and
integration are each sort of their own piece in the system. You go and solve this
program and then once that's solved you go to the next problem and we really try
to treat this more uniformly in the sense that it is all embedded in the same
paradigm using the same user interface, so it's a little more natural for the user.
And we allow the system to actually build fairly sophisticated kinds of queries, but
do it using the same paradigm. So it using the (inaudible) demonstration idea,
you can actually write what under the hood are fairly complicated queries, but the
user doesn't see them as necessary complicated queries because they never
write a query.
So some of the key ideas here then are we focus on the data and not on the
operations. Right? The users are mostly familiar with the data. In some sense
they know what the data is that they want to extract. They know in some sense
how they want to put it together. The idea is that the user then manipulates the
data instead of manipulating the operations.
Another key idea here is to basically leverage the existing data and the idea here
is that over time if I'm building, if I'm interested in a particular topic and I'm
building Mashups on a topic then I've got to build up a repository of sort of
previous data sources and things that I created in the past. This can help you a
lot in terms of doing the modeling where I'm describing what the source is
actually providing, cleaning the data so you know the same type of operations
may have performed in the past and then doing the integration and so -And the other thing here is that as opposed to most problems in computer
science where we want to do divide and conquer we really took the approach
that by pulling these pieces together and solving them is one integrated piece, it's
more natural to the user and solving one issue can also help in solving the other
issues. The different components here that we're talking about are often very
closely interconnected so often it's very hard to actually completely separate
things. And we do this by interacting with this sort of single table or spreadsheet.
Okay. So we built this system called Karma and this is just a screen shot from
the system. So what you see here is on the left you have essentially an
embedded browser, which is the focus has largely been on extracting from web
pages and so we've got a browser built right into the interface here. And then the
idea is that you can then do sort of simple kinds of cutting and pasting into the
table.
So on the right here we have the table. So this is the actual table that the user's
ally interacting with and it's very close to sort of the spreadsheet model that a lot
of users are familiar with, even non(inaudible) often work the spreadsheet
themselves where you just have this sort of table of data that you're
manipulating.
And then here in the bottom are sort of the interaction modes and this is the way
for the user to sort of specify what mode of the system they want to be in. You
may not be able to see it very well, but it has the different tabs here to identify
what step in the process that they're currently working at.
Okay. And so let me just go through sort of a motivating example. So let's say
that you have a source of data on restaurants and so you want to pull in the
restaurant name, address, phone number, review from some restaurant review
site, which is shown on the left here. This is the starting place.
Then on the right what we have is The L.A. Department of Public Health where
every restaurant in L.A. County actually gets rated on a regular basis by this
department. You like to know, okay, I want to know when this restaurant was last
inspected and what was the score was because I don't want to eat at
substandard restaurants. I want to basically do something very simple, which is
pull these two sources together, clean up the data and then display it on the map.
What you see here is sort of the extraction step here where I'm going to pull off
the basic information, name, address, phone number, review.
Over here I'm going to pull up again the name, address, but here I have the date
of inspection and the score. And I want to clean up the data and then combine it,
do some kind of integration here across the two sources and then stick it on
number.
Okay. Now for the purposes, for the example I'm going to use in the rest of the
talk I'm going to assume just to simplify the talk that essentially I've already done
the shaded part here and I've stuck it into a database. So I have a database with
the same set of information, so it's basically the same task, but I won't have to go
through both parts of it.
Okay. And you know, in this database this is sort of just the general database
that you have as part of the system which contains all the past sort of Mashups
that you built. As we go along let's see if these can be useful in actually creating
new examples.
All right. So let's start with the data, the data retrieval test. There has been a lot
of work on extraction. There's a variety of tools and stuff. We took a relatively
simple, but sort of very easy to use type of approach where you're essentially
taking, you're looking at a web page. We have got the web page here on the left
side and what we're doing is basically copying the information into the table to
essentially show the user what information we want to extract. And what's
happening under the hood is that we're using the very common sort of approach,
which is to exploit the document object model underlying the page. By doing that
you can very quickly generalize on the page to figure out what the information is.
Right?
So what happens then is the system basically builds the x path expression that
describes the information you're extracting and then does some generalization
over that. You can see well, he only pulled in Japan Bistro, but really his
intention is to say, well, what I want are all the things essentially at that level in
the document.
So the model is shown -- the document model here is shown on the left. So the
user pulled in this, but this system can fairly easily generalize over this type of
model and say, okay, it looks like correspond to some path that such and such
corresponds to this expression here and then with one example, which is very
nice, here you can get the complete list of all the restaurants on the page. Yes?
>> Question: Did the user (inaudible) that or ->> Craig Knoblock: Yes. Yeah. So it's a little hard to show that on this slide,
but the idea is that the user basically copy and pasted or dragged that over from
the actual web page.
Okay. So we've got this first level that we get from this and there's a number of
tools out there that are using this kind of model. I will mention that since some
people have asked about this that one of the issues can exploiting the object
model is that it works where it works and then where it doesn't work you are kind
of stuck. But for quite a few web page its works quite nicely.
One of the things that happens is that lots of times it is not enough just to get the
information that you want from a single page so typically have you to do some
kind of navigation. Right? So on this particular page when I click on Japan
Bistro, then what happens is it brings up this whole other page here, which is
shown here, which now contains maybe the address and the phone number for
the restaurant, the picture. I think in this case it even has a movie you can
watch. But it also has the reviews, or at least the number of reviews on this
page. So you may want to extract additional information. So instead of just the
name I may want to get out the address, the maybe some kind of description
that's there, the number of reviews and so on.
So let's take the case where here I want to extract the reviews. Now I have a
problem that, well, the reviews weren't shown on the original page. Right? So
the original top-level page didn't show the reviews. So somehow I have to
connect up this detailed page with the page that's above it. So what's really
going on here is that in the original document object model I had the name, the
address and then some brief description about the restaurant.
And what I'm going to do then is use that information. Well, use the name which
is on this page in the URL, sort of the underlying URL that's linked to these
names to then link to the detail pages so that I can get all of these additional, the
reviews or at least the number of reviews off the next page.
So what's really happening there is that we're basically building this X path
expression that's basically going to traverse these simple kinds of URL links so
that when you have a page and it links to a detail page, which is very common, it
is really is just going to connect up the pages across these things using those
kind of URL links. So then we can basically fill out the table.
So all the user is doing is essentially navigating to the next page saying, I want
the number of reviews here, copies this in and the system figures out the
navigation page.
Yes?
>> Question: (inaudible) -- users to register in the system to participate in the
(inaudible)?
>> Craig Knoblock: You mean for the underlying website? So the question is
does the user have to register to actually use these pages. I mean, typically
depends -- what's that?
>> Question: To rank the pages.
>> Craig Knoblock: Oh, to rank the pages. No, so in our system all the user's
doing here is going to the pages and pulling out the information they want. There
wasn't necessarily any kind of registration process. They're simply going through
and saying that the task the user is trying to do is to aggregate the data about
what are the restaurants listed here and then how many reviews were actually
available for each of the restaurants. Yes?
>> Question: (inaudible) -- rank the page and only have the (inaudible) of the
user. You know, I think (inaudible) ranking of the page in this rating because a
user can create a page.
>> Craig Knoblock: Uh-huh.
>> Question: And he himself go for the page.
>> Craig Knoblock: Well, the task we're trying to solve is that there is a set of
websites that are available out there. And it's up to the user to decide, you know,
what it is they want to do, I mean what sources they want to combine which
sources they actually trust and how they want to use the data. So we're not
really taking a position on that. It is usually people decide for themselves which
people they want to trust. What we are trying to do is to provide the tools that
allow the end users to say, hey, I want to take this information and combine over
here.
>> Question: Users extract information from the databases that organization
link to your Mashup system or just from web pages?
>> Craig Knoblock: Well, there's no way to link to our Mashup is system, so
we're really -- we're ->> Question: Or (inaudible) so how do they store the information about
restaurants? In their databases or on in some web pages?
>> Craig Knoblock: Yeah, we don't know and it doesn't matter to us. Right?
It's on the web. All the data, we're basically assuming all the data is on the web
and we are not taking a position how they store it or how they access it. Our
assumption is that we have access to it through a set of web pages (inaudible).
>> Question: The web pages not important to the system?
>> Craig Knoblock: Oh, it's very important. But the key here is that we are
trying to support the user. I mean, a lot of people look at information integration,
myself included, as a process where I specify a query and the system goes out
and magically pulls information together, integrates it and presents to the user
and the user is not involved in the process. The goal with this project is very
much the user is deeply involved in the process. Right? What we're trying to do
is allow the builder to build their Mashup. Right? So you say, I want to actually
build this Mashup where I combine the restaurant data with the help grading data
and stick on the map. I've already spent the time to evaluate the sources
available, how I want to combine the information, which information I want to use
and which information (inaudible) and I'm basically creating the tool.
>> Question: (inaudible) that a lot of sites really customize their pages to the
user. So like for example you say browser (inaudible).
>> Craig Knoblock: Uh-huh.
>> Question: Like if you go to Avalon, I would get a very different page from
you. How would you account for that?
>> Craig Knoblock: That's a really good question. We don't account for that
today. One of the things I should emphasize here is that we've taken sort of an
initial approach at the different subcolumns and our contribution is not
necessarily specific on the extraction piece or the source modeling piece or the
integration piece. It's really the unification across the different sources.
So for any one of the issues and I have related work site at the end and I will talk
about what people have done. For any one of these issues we could spend a lot
of time delving into all kind of issues like that and my position on that is well,
they're all sort of problems that are solvable in some way, but really what I'm
trying to do is look at the big picture here. How you put these different (inaudible)
together and then you can delve in.
There's lots of tricks you can play with those kinds of things. You can either
model what the cookies are actually representing or you can actually clear the
cookies every time so they are not there, there's a bunch of ways to deal with
that. In the system I'm talking about today, we ignore them. Okay.
Okay. All right. So basically we finish the extraction piece, you know, we've
basically exploited the document object model here. We can get -- we can
navigate through the page, pull out the information and build the initial set of data
that the user wants to work with. So the next step is for the system to build some
kind of model of what this information contains. And This is useful for naming, for
one thing, just so you have reasonable names about the kinds of information
you're displaying. But it will also come up with -- it turns out it is very useful with
respect to the actual integration step.
So we pull out some initial data. Let's say here we have a list of some set of
names. And we may already have some data in the system. So we may already
have extracted the L.A. health rating. If you remember, I assumed we had
already done it and should have extracted the database. You might also have
some other sources of data. Maybe I've gotten a list, some information about
artists. Maybe another source, which is the Zigot site, which also has
restaurants. The idea is that you can use previous information you previously
modeled.
In some sense these have tags associated with it. These are restaurant names
and Zigot ratings and here we have artist names. And so what happens in this
step is the system uses essentially previous extracted information to try to
identify things that it is going to identify from having extracted them previously.
Here we find that maybe we discover of this set we've extracted here we find that
three of those are -- occur under restaurant name in previous sources. One of
them occurs on artist name. So we're gong to say, well, we're going to take
(inaudible) propose that and in fact I actually think that in the current version
system we actually propose all the possible ones that we've seen in the past and
the user can select one or say no, this is a new one I've entered.
The next step here is the actual cleaning process. So the problem here is that
I've got my extracted information and you know here I've decided these are
restaurant names. And I have all the other information previously that I've
extracted from other sources. So maybe I want to look at the restaurant names.
And one issue that frequently comes up, you get minor differences across
sources. Right? So you get problems with maybe there's a restaurant that's
misspelled or maybe there's some systematic way in which these sites, these
have been changed, probably not restaurant names, but restaurant names you
may just want to be able to identify okay, potentially there's some problems here.
So here you see that S Sushi Roka here is Sushi Roku there and those kinds of
things are a real problem, right? Especially if what I'm trying to do is actually
combine the information from the new source with the existing because then it is
not going to realize that they're actually the same.
And so the way we deal with this is we really allow the user, there is really two
kinds of capabilities in the data communication. The first one is we allow the
user to basically manually clean up the data if they want. Sometimes that is what
it takes. Right? I want to combine the data and places I am going to have to fix
things. And so this is a simple sort of table interface where you know I've got the
set of data here and then it opens up three new tabs. One is sort of a suggestion
what it thinks this is supposed to be. The next one is what the user sort of
provides as an example of what they want and then the final thing here, which is
this allows the user to say, well, in some cases the suggestions may be right, but
maybe I want to override it.
So in this case the user might open this tab and say, well, what I want to do is
take this sort of number of reviews and I want to enter, what I really want to do is
get rid of the reviews. Right? Because later maybe I want to do comparison and
find the ones with the most reviews or 10 reviews or something like that. So the
user enters 31 here. What happens under the hood is the system goes through
and basically tries to find the transformation that would get the system from this
to this.
Okay. And we have a set of predefined sort of rule classes in the system, such
as substring or take the first token or the last token, those kinds of things. So
you have a set of predefined tokens. The system then computes based on the
examples and says well, can any of the rules calculate that? If not it won't
produce anything. But if it can it will actually generate the suggestion which this
table doesn't fill this out, but it would generate all the suggested ones based on
this rule and then the user can essentially confirm this and say, okay, yeah this,
is the transformation that I want to apply here.
So there is really two types of cleaning that go on. One is that the user
essentially tries to find -- well, specialize a set of generic rules to this case here
so you can apply them to the cleaning. And the second one to essentially allow
the user to then override so if there is noise in the data they can clean it up. And
that is a big problem with that ->> Question: (inaudible) -- successively refine the rules so that you might -- that
first rule might work for five and you get to number six, well, this is kind of an
exception, so can you apply something that applies to everything?
>> Craig Knoblock: That would be a great thing to have. Right now basically
the system tries to come up with one rule that will apply across all the data, but
then allow the user to then do some final edits on it. Yeah. You could then go
back and maybe you missed one or two and can clean this up mainly.
>> Question: (inaudible) -- best case where a user would want to (inaudible) or
should be some code in there?
>> Craig Knoblock: We don't support that because it doesn't really fit in our
model. You could argue that maybe you should allow them to do whatever and
tools but we really tried to keep, we really try to keep with the original goal here
of sort of trying to match, okay, what is an end user going to do? We know how
to write (inaudible) questions, but (inaudible) not allowed to tackle them.
>> Question: I would imagine the user would want something even simpler
where they initially just selected the number 31 and copied that to the table rather
than 31 reviews and then figure out how to further ->> Craig Knoblock: Oh, well, they could have done that. No, they could have
done that in the original extraction. But in fact that is one issue with the using the
document object model approach, which is the easy thing to extract in the
document object model is going to be those things that are at the level of the
document (inaudible). Right? And so then all of a sudden once you are down
within it now all of a sudden it becomes harder to pull out exactly that piece of
information.
>> Question: (inaudible) -- well, as the rule at the same time.
>> Craig Knoblock: One of the things we're doing now, I actually think the
(inaudible) approach is somewhat limited. One of the things we're doing now is
actually integrating in a more sophisticated learning approach, which then they
could have gotten 31 to begin with. So you could see other kind of examples that
aren't just subtracting a subset. For example, let's say I have last name, comma,
first name. And now I want to switch them around. That's a very simple kind of
transformation to kind of show the system. The system could easily have seen
that type of transformation before. So could say, oh, okay, I see what the user is
trying to do there. But it wouldn't be easier to extract from the original page.
Unless you extract -- I guess you could extract it last thing separately, but maybe
then put them together.
Any questions? Okay.
Now we're to the data integration piece. And what happens here is okay, I'm
looking at a table of information that I have and the integration is well, I want to
integrate some data with the table of data I'm currently looking at. And what
happens down here is I'm in the integration tab and what's available to the user
at this point is to essentially put data in any of these positions.
So I can actually -- I can basically if I add information here then really I'm trying to
specify some kind of union existing table with the information. If I want to specify
information over here then I'm really trying to do some kind of joint where I'm
expanding out the set of information that I want to combine with the data I'm
already looking at.
Up here the way it works is the system may provide a set of options, which are
thing it can see how it could join with the existing data set. Okay, these are
additional attributes that may be in the system. So what happens is that the user
goes into integration phase and essentially hits the fill button. And what gets
generated then -- if this is the original table and these are sort of sources in my
repository then what the system does is look for ways that it could actually add
additional information. You can see here that the L.A. health rating, there's the
actual health ratings for the restaurants. Right? And for Zagot, here's the Zagot
rating, which is maybe the food rating from Zagot.
Again, it can basically compute the possible potential joins across the different
sources to sort of say, okay this, particular box -- well, this data here could
potentially be joined with this information over here because there is of overlap in
the restaurant and address. Likewise this information here could be joined
across just the restaurant name and no address information.
And what happens then is we essentially compute these steps. So what
happens in the user interface is that we have this set of information up here,
which these are the possible attributes that the user could select from some
drop-down list. And then down here we have the actual values. So the system
actually computes potential values based on the database. So the user may
know what the data looks like that they want to actually add here and select that.
So all of that gets computed based on the databases that are available. And
then the user essentially then -- the system just expands out the information. So
if I were to select health rating at the top column, then it's just going to basically
combine the information across different sources. It's easy for the user to do
that.
So let me go through an example though in terms of -- the more general example
in terms of the types of queries the user can specify. Here is a single column of
information. So what is really happening here is that we can compute the -- at
the top level and this essentially is essentially going to be the top of the table. So
this is going to be the attribute level.
So we can compute the attributes that can fit into a particular column. And then
here you have the cell level, which is these are the possible values you can
basically do. These are essentially going to be, you know, the level is going to
be the set intersection across all the different values of all the different rows. And
the values are going to be the place essentially the values that are essentially
consistent with that particular attribute.
So let's say here I stick in Los Angeles into this table. And I've got a source here
that has information, the source, the attribute city and it has a set of information
about (inaudible). I may also have a song name that matches Los Angeles and it
is essentially gotten information about pop music.
And so what's going to happen in the other cells is I can actually put in values
here that correspond to other possible values that might correspond to both of
those things, both the city and a song name. Then if I type in "Honolulu," then it
may find that okay the -- this corresponds to city and same source that I had here
and may also correspond to some other source here. But that -- so then the
question is can you use this to help you determine sort of the plausible attributes
for the table? And the idea is Christmas, you are basically narrowing it down
essentially based on the information that you have seen there. And so you see
once you have a few examples, okay, that is likely to be city. We actually keep a
drop-down list here, so that it turns out the system conjecture is wrong then you
can change it.
The final step is the actual map generation. So the idea here is that we
essentially apply sort of general set of heuristics today to essentially look at the
data that is in any of the columns and say, do I actually know how to place any of
these kinds of things on a map. Right?
So is it a street address? In which case I can run a geocoder on it or does it
correspond to some other type of maybe a city name. In that case I could place
on my map. Or is it a state name, for example. So there is just a general set of
rules that allow it to take the data and stick it on a map. If it's amenable to stick it
on a map then you just click a button and say, okay, show me the data on the
map. Otherwise, today we haven't really spent time on the user interface part of
it or the display part of it. So otherwise today you're just stuck either looking at
the data as a table or sticking with the map.
Okay. So let me spend a few minutes now and talk about sort of the valuation of
this. We use as a baseline a combination of a system called Dapper, which is
you know a Mashup tool that really focuses primarily on the extraction piece. It
works quite nicely. It's very easy. There's a web-based interface you can go to
and build these extraction things in Dapper. And we looked at that combined
with the system called Yahoo Pipes, which I mentioned earlier is sort of a widget
based paradigm, where you are basically using widgets to collect information up.
In the Mashup community it is actually quite common for people to combine
different Mashup tools.
So people in fact oftentimes combine dapper and Pipe so it seemed like a natural
way to do this. These are popular in widely used tools. So we thought, okay,
this is a reasonable combination of tools to look at how it compares to our
approach here.
And what we want to do is look at a couple different issues. One was could we
actually show that users with no program experience could actually build the
different Mashup types that we have here? And then second one is could we
show that karma actually takes less time to complete each of the subtasks and
scales better as the tasks get harder. And finally overall, does the user take less
time to build the same Mashup and karma as compared to Dapper and Pipes?
And for users in this experiment, we had 20 programmers. These were
essentially students in the course that I teach. We gave them extra credit if they
would run the experiment for us and so we got a whole bunch of students, but
these are basically master's students in computer science that have probably
(inaudible) amount of program experience.
And then we took three nonprogrammers. These are essentially administrative
assistants that typically are familiar with Excel and Word, but really haven't had
any experience in programming.
Okay. And so in terms of the claims themselves, the general set of users are
used to support the second two claims here that takes less time for the subtask
and overall we could do it faster and the nonprogrammers here are used for this
kind of (inaudible) which is that the -- that someone with no previous experience
could actually build the four Mashup types.
Okay. So the set up was the following, which is that for the people that were
programmers since they took this class we actually had them do assignments in
Dapper and Yahoo Pipes. So they already were familiar with these tools. They
actually had to do a couple different assignments for this and each of them use
projects.
But in addition we gave them review package prior to the experiment so they -we sent them and somebody say, okay, do the following task so you remember
how to use the tools. This was early in the semester. Then right before the
experiment we gave them a 30-minute tutorial going through the different
systems and how they work, so they were everyone was familiar. It wasn't like a
new tool to them.
And then the tutorial covered -- the tutorial I can say was about 15 minutes on
Dapper and Pipes and then 15 minutes basically teaching them how to actually
use Karma. Then since they hadn't previously used Karma we let them practice
on two to three tasks using Karma, just simple tasks that were different from the
(inaudible) so they had some familiarity with Karma, as well.
Then the test we gave them three tasks. And for the programmers they basically
alternated between Karma and Dapper Pipe for each of the tasks. For the
nonprogrammers they only had them use Karma simply because we didn't think
there was practical for them to sue Dapper and Pipes.
And then everything was basically recorded using video capture software and
that just allowed us to go back and measure the amount of time after the fact so
we could see how much time they spent on the different subtasks.
So here is an example of the results. So this is one task here. The different
subject numbers and then we broke out the time based on extraction modeling,
cleaning and integration and then the total time. And notice here we have these
five, we basically had a five-minute cutoff time for the users. And that is simply
because many of the users eventually fail on some of the tasks. Then we
measured these things for both Dapper and Pipes and for (inaudible). Okay.
Then the variation task. We had three tasks they sort of correspond to the
original task I talked about. So the type 1 simple extraction task extraction put
information on the map and here we have different things, the data extraction,
the source modeling and cleaning and integration and we rated -- this is sort of
our projected analysis about how hard these different task were and you can see
the second task really combined both dealing with a form kind of interface and
then doing some union over the data and again this one they had to do the union,
a simple cleaning task, a simple model.
Then on the third task it was a little more complicated. We had to join two
sources together. The data extraction modeling were pretty simple, but doing
this process was more complicated. And there was no cleaning for that. And
these were real test, real sources for this.
Okay, so let's take a look at the claims one by one. So the users with no
program experience can build off our program types. Claim two was Karma
takes less time to complete the subtasks and skills better as the tests get harder.
Then you can see here that we have different difficulties on some of the tasks
that we compare. And then claim three, which was overall the user takes less
time to build the same matchup with Karma as compared to Dapper and Pipes.
And so we look at the overall time to do the end-to-end task.
Okay. So let's look at the first plane here. Here are the results for the
nonprogrammers. There are only three users here, which is somewhat limited.
Really we are just trying to show a proof of concept. We can take people not
trained as programmers and in fact they could actually complete the task. And
you can see that these are the three subjects color coded here. We have the
time on the axis here and then the task one, two and three. For the
nonprogrammer subjects we used a time limit of 10 minutes and you can see that
they all completed within the time and you know they did pretty well. They did
pretty well in the tasks overall. Clearly the red subject took more time than the
other two, but overall they completed the task.
The second claim, which was Karma takes less time to complete to the subtask.
So I'm just will go through quickly each of the subtasks here. So we have the
extraction work. You see again same setup here, which these are organized
by -- ordered by simple to hard. So simple, moderate and difficult tasks during
different orders, this task three, task one, task two. We order them from easy to
hard and then on the X axis we group the time and different intervals and the Y
axis we had the number of subjects.
So you can see the Karma shown in green and Dapper Pipes shown in blue.
You can see on the simple tasks Karma did better. Then as we get to the
modern hard task it continue the same pattern, which in general Karma is
outperforming Dapper and Pipes across these abstractions, which -- I don't have
one with me unfortunately. I should have brought one. All right. Part is we are
in the midst of a complete reengineering of the system. So (inaudible) on
Monday, so it is kind of in flux at the moment.
Okay. So here we have the ->> Question: (inaudible) ->> Craig Knoblock: No, we haven't made it available. You know it really has
very much been a research project. But we have a movie of it and I'll put the
movie up on the website and you can take a look at that movie. You know, it's
very much of an active research project, where we're looking at improving each
of the different components. Now that we have the system working we're
working on different components and I'll talk more about that in terms of where
we're going next with that. But it's pretty much work now where we're doing
some additional work on sort of expanding generality so it wil work on more
websites and do more stuff. But eventually we'd like to make it available.
Okay. So you see here not only one of the things I want to point out here not
only is Dapper pulling better -- Karma pulling better in terms of the time, but on
Dapper and Pipes you can see there is a certain number of people failing on
these tasks. For the simplest one, everyone completes it. But for the modern
hard task these are program users and some of them are actually failing to even
complete the task, which it is somewhat surprising given they've all been trained
to use the systems.
Source modeling is a little different. So What happens in source modeling, we
have Dapper and Pipes and you can see now it is outperforming Karma, at least
on the first two paths. That is simply because there is not so much of a need to
do modeling. Since they are not supporting integration at the same level it's
actually the -- any modeling that has to get done in Dapper and Pipes is actually
simpler to do.
In this case here, Karma was able to do the modeling automatically because it
actually recognized the particular thing had to get modeled and generated the
right type. It didn't require time and user at all. But you see this time it is quite
small. So even though Karma performs worse in task one and two, there's a
relatively small difference. Like going to 30 seconds and you will see this
savings gets realized in the integration step. We do the integration it actually
makes life easier for the user.
Then we get to the databanks up here. Again we see there are only two tasks
we had in data cleaning. You can see Karma is doing better than Dapper and
Pipes here. And as the task gets harder more subjects are failing in the class.
So this task we are getting 35% are failing and here we are getting 83% that are
actually failing. Yeah?
>> Question: (inaudible) -- extra credit?
>> Craig Knoblock: Absolutely. Or they would never forgive you. They were
all motivated. I think everyone really wanted to complete the task. They didn't
know how it was going to be scored.
Then here we have the data integration step and there were only two tasks we
had to do this. So you can see here that the union task -- well, there is almost no
work for the user here. Happens almost immediately. There is more work to do
this in Yahoo Pipes and then for this task you can see most of the users failed
here and this really had to do with the model that happens for doing the joins
across in Pipes. And it's quite sort of tedious in the user has to go to this menu
and kind of select which thing you want to join with and stuff and it's quite time
consuming to do it. So either they ran out of time or they just gave up because it
was quite (inaudible) there was one or two users that were able to complete that.
So here we have again about 30% fail on the union task and about 95% failed on
the joint task. Okay. And then overall -- so these are really just the numbers
aggregated together. You can see here for Task one, you know the green is
here is Karma, which performed much better tests. Two, again performed much
better. You can see these times are quite large. And task three overall again,
overall Karma is doing better on these tests.
Okay. If we look at the relationships, these are really just the speedup or
slowdown so on extractions about 2.2 times faster. Source modeling is a little
slower. Cleaning was about four times cleaner and integration six times faster.
Overall we get about three.
Okay. You know we did this significant test and so Karma is specifically -significantly faster when we do averages, except for on the source modeling task,
which where Dapper and Pipes system faster. Yes.
>> Question: Do the programmers do the tasks in the same order like they
always do with Karma first?
>> Craig Knoblock: No, we switched up the orders. So they would do -- some
of the students would do ->> Question: (inaudible).
>> Craig Knoblock: No, there's not. And in fact we switched them. There are
three tasks, so which task they started first we would also vary.
Okay. So now there's been a fair amount of related work. I don't really have
time to talk about all the works in general specifically, but let me specifically but
let me describe it in general. There are a number of tools out there, so there are
systems like MIT had a system called Simile and There's Dapper and Pipes.
Microsoft has a system called Pop Fly, Marmite, Intel Match Maker, there's a
bunch of interesting programs out there that have all tried to do some piece of
this kind of Mashup work.
If you look at sort of the types recorded, we are classifying from our view of the
world in terms of things we were doing here. Very few of them really try to do all
the different sort of pieces of the integration. Right. They do different pieces of
this. You know either the unions or joins or both. Intel Match Maker is the only
one that covered the same classes of task that we did here and if you look at
how the system works it requires an expert. Once you get into those kinds of
tasks and read the paper and stuff to get to integration task you are doing joins
and stuff basically there is an expert that is basically writing the integration for
you there.
So in some sense yeah, they support it and I suppose -- expert, but I think they
have different levels of users and that's the base level of the user really just sort
of invokes page and it applies to Mashup.
Simile. This is sort of early work using the document model. It was focused on
extraction in the web (inaudible). It was an interesting system, but it supported
very limited kinds of work. There are follow-on system here was Potluck and
also creates An RDF for (inaudible) stuff where the user manually specifies data
integration test.
Dapper mainly focuses on extraction and really could do linear type information.
It could take this information within this and this other source. Then all the
systems are kind of grouped together. These are all doing widget-based
approach. Microsoft PopFly has a much fancier UI and more widgets. Morami
uses this kind of work-flow based approach. In their user space they found that
the user really got confused by sort of the whole work thing approach.
Google My Map, you know they take points from some persons taking on a map.
This is work we did at USC, this is really our system that predicted this called the
Agent Widget. This is more of a question and answering type of approach. The
idea was that the user -- you essentially go through and ask the user a series of
questions to create the Mashup. And the problem is it didn't scale well in the
sense that the questions get quite tedious and tasks get complicated and there
are more and more questions. You never know when you will get to the end of
the questions. But it is sort of modeled on the tax program, question and answer
you have your Mashup. But it wasn't user warranted.
This is cards and other system, Microsoft sort of more of a topple-based
approach where everything is mapped into couples and things are linked
between them. Finally, Karma on the bottom.
So a lot of these systems address some piece of the problem and some sense
require most of them require sort of more expertise to get to the same level.
Then there is a whole set of work on each of the subproblems. I'm not going to
go through it. There is quite a bit of interesting work on for example information
extraction. We've done some of this work. You know we sort of fit into this
category here of -- sorry, on sort of exploiting the document object model, but in
fact we are moving to this model here. So in fact we have a new version of the
system where we have actually integrated work we have done previously on
machine learning type of approach's to do the extraction and this is really just to
get more generality for the type of pages we compile this to.
I'll have to work on source modeling. Bill here has been doing work himself on
this. You know really what we're doing here is just leveraging very simple kinds
of techniques in the current system and we want integrated to more sophisticated
kinds of (inaudible).
Data cleaning. There has been a lot of work on data cleaning and providing
some generic model where many of these techniques in fact are integrated in.
And then the integration piece, you know I think the most closely related work
here is other work on programming by demonstration. Allow for example, has
done work in Washington on this piece. And you know, there finding similar
techniques but doing it in a new framework and sort of a novel way.
Okay. So let me just wrap up. Clearly Mashups are here to stay. A lot of
interest in this whole idea of being able to take existing sources and put them
together in novel ways and the need here is you know to find some nice way that
will really allows end users, the web users out there that use browsers to do the
same thing and build their own Mashups. Our contributions here is this
programming by demonstration approach where we use the single table as the
unifying paradigm to hold the information together. We solved really these four
pieces we view as central to Mashup construction and the extraction, modeling,
cleaning and integration. And then the sort of query formulation technique that
really allows the user to specify integration and stuff and say I want to integrate
this source into this source in this way, which I think is fairly natural for the user
but maps fairly complicated queries in that.
Then finally we evaluated sort or demonstrated this approach by showing real
users could actually complete these tasks and get some significant improvement
over existing approaches.
We're very interested in future work. We're already working on some of these.
One area we've done almost no work is just customizing displays, for example. I
think this is a really interesting topic which is one of the things you see if you look
at all the different Mashup tools out there, there's a huge number of different
kinds of clever interfaces and displays people use for Mashups. We ignored that
and said, put on maps. I think there's a really interesting opportunity there for
allowing the end user to customize it their way.
Another area we're interested in pushing on is learning and generalizing over the
tasks. Wee really want to be able to store the integration plan and reexecute
them on new data as the information is changing on the web and stuff.
This is really a natural extension of what we have here. It is not (inaudible). The
next one I mentioned is work on machine learning for extraction of tasks and in
fact doing work in collaboration with technology because they have been building
new tools for automatically wrapping form data and automatically extracting the
data on the result page from the forms and so we're looking now at sort of
integrating that in technology into this so we can get sort of more coverage that
don't aren't handled well by the document (inaudible).
Then we have other work we've been doing in the past on integrating -- well,
work on automatic source modeling and right now we're using some fairly simple
Karma and want to integrate that piece anywhere we actually do more
sophisticated kinds of models. You actually learn from the data you have seen in
the past. I recognize this data format and once you know the date format you
can normalize it from that standpoint.
I think, just mention a couple papers. These are about my web page. One
appeared in the 2008 IUI conference and that was in the 2007 IA conference so if
you are interested in the work, there's tapes. So that's it. Thank you.
(Applause)
Questions? You guys exhaust all your questions?
>> Question: Yeah. (inaudible) -- end user interact available ->> Craig Knoblock: (inaudible) -- that is a longer process because to get to the
point where you can put up a tool people use requires you get to a particular
level with the software and it's not, you know ->> Question: (inaudible) -- tests?
>> Craig Knoblock: Yeah. We do, but then putting up on the web is sort of
more complicated right, so you either have a piece of software you let people
down load and install or you have to some kind of web-based interface that
allows them to use it. We had users test it. We said, here is a computer. We
put the computer in front of them and let them run the software. Just a matter of
sort of resources and cycles to support the (inaudible). But I will put up the video
because you know I really should have brought the video to show it now. It is
nice to see what it looks like.
>> Question: So for machine learning instruction part, so are you I guess I'm
kind of curious how you plan on doing that with the user? The user has to
interact to set the boundary with the extraction rate or?
>> Craig Knoblock: The way that works is we have the interaction now so we
have an actual version where right now you copy and paste the data from the
web page into the form. We are using that data as labeled training data to do the
extraction. In some sense what is happening under the hood, every time a user
copies and pastes to the table it treats that as a learning task. Goes off and tries
to learn the extraction, generates the data it would produce and maybe it is what
the user wants, maybe it's not. The user can refine it by bringing more examples.
So It takes a stab based on the first example and then you know if this is get right
copy more examples and brings it in. But it is not as closely tied to the document
object. That is the real advantage. The disadvantage it may require more
interaction and we're talking about maybe some hybrid approach where you can
get it used (inaudible) and do that and when you can't then you can use this data
approach.
>> Question: (inaudible) -- instructed is that integrating the restaurant
(inaudible) matching on the page or ->> Craig Knoblock: No, it's (inaudible). It's a good idea but there is another
path to go. People have been doing more work on just sort of automatic
extraction from pages themselves where another model we're exploring is you
basically automatically extract automatically extract the data off a page and
you're essentially trying to segment the data into different pieces of information
and then given an example from the user what the example tells you which
information the user is looking for on that page and then you can retrieve -- pull
that information out, which I think that is probably the most promising approach
because that will probably minimize the amount of training data.
The issue with previous work on sort of machine learning for building Rappers
and doing extraction is simply amount of training data it takes to really get it right.
Users have limited patience for providing training data in the real world. Ideally
they want to get it in the first example and maybe they're willing to give a second
example. After that they drop their hands and say, it's a stupid system.
Other questions? Yeah.
>> Question: (inaudible) services -- basically just (inaudible)?
>> Craig Knoblock: Actually we've more recently looked at that -- we version of
the system now that where we do the extraction we can do the extraction from
Excel. Happened to web services? But oh, just plain text files. But there is no
reason we couldn't add web service to that, as well, that would be quite natural.
Yeah, we have been trying to generalize the set of sources not everything is in
web pages. Yes?
>> Question: Project called (inaudible) for (inaudible) organizations (inaudible)
but organizations can link their database sets to the Wiki pages and users can
pour into the databases and extract information and do some kind of data
analysis and integration and the problem that we are facing is that experts are
somehow reluctant to contribute in the Wiki because for example they're not sure
if the other user come to their pages and might collect data and they like to lock
their pages.
>> Craig Knoblock: Right.
>> Question: And they are already concerned about the credibility of the source
of information that other users ->> Craig Knoblock: Uh-huh.
>> Question: -- link to the Wiki.
>> Craig Knoblock: Right.
>> Question: And they are looking for some ways to filter the reliable data out
from the unreliable data in order to save them in their (inaudible).
>> Craig Knoblock: Right.
>> Question: So now my question is do you have a (inaudible) for defining
access for different hubs of your users because I'm sure you will have some
experts in your system users from (inaudible).
>> Craig Knoblock: One of the system is this is a tool that a user would down
load and use or use on the web. And so we don't really have a need for access.
They are really just going out to public sources or publically available sources
and pulling them together and building their integrated tools. We're not really
trying -- we're not making any data available ourselves and so we don't really
have to provide any kind of access information.
It's a slightly different problem, right? You're in the position where you're trying to
manage your own users' data and who has access to it and what they are going
to do with it.
>> Question: (inaudible) -- to sampling from the others (inaudible).
>> Craig Knoblock: Uh-huh.
>> Question: (inaudible) on or two really like to share data information from the
Wiki with other people.
>> Craig Knoblock: Right.
>> Question: So it is not just necessarily just I mean that we cannot assume
that the data is just from the database so people can (inaudible) and gather some
information (inaudible).
>> Craig Knoblock: Right. There's a larger issue here in general which is, you
know, what's the overall business model here for the people providing the data,
right? I'm a data provider. Maybe it's a publicly available data source like the
L.A. county department of health. That's just a public service your tax dollars pay
to have the data available. You could imagine that with health ratings and those
kind of things there is issue of how the organizations are making money and are
you circumventing their business model by extracting during the (inaudible). So
those issues I haven't taken any position on here, we're looking at the technology
to be able to put stuff together but those kinds of things still are open issues.
Any more questions? Okay. Thank you. (applause)
Download