>> Moderator: Okay. It's my great pleasure to welcome Craig Knoblock to come here to give us a talk. Craig is Senior Project Leader ISI and he's also a research professor at USC. He's also chief scientist from Fetch Technology and Geosemble Technology, both of them are spinoffs from USC. And he's my fellow colleague and alumni and he has published many books, articles and served in senior program committee for many AI conferences. And he's co-chair of the 2008 AAAI track of AI on the web and he's conference chair of 2011 IJCAI. He's currently President on the international conference on automated planning and scheduling, a trustee of international joint conference on AI and a fellow of AAAI. Okay, without further adieu (indiscernible) Craig. >> Craig Knoblock: Thank you very much. Okay. So today I'm going to be talking about work we've been doing at USC on building Mashups by example. I want to acknowledge my collaborators. It's Rattapoom Tuchinda, who's a PhD student. He's working on his PhD and has really been the driving force behind a lot of this work and I certainly give him credit. And then Pedro Szekely, who's also at the University of Southern California. We've been jointly advising Rattapoom on this work, so... So I'm going to delve down and talk more about the work we've been doing on this topic. So I'm sure most of you are familiar with Mashups and you've seen all kinds of sort of interesting Mashups throughout these days. But just for review, Mashup is basically some kind of application that combines the content from one or more source to provide some kind of unified or integrated experience. There's lots and lots of examples of Mashups out there. Some notable ones are things like taking crime data and putting on a map. Those are the simplest kind of Mashups where you are simply taking some kind of data that's been published in one place and saying, okay, I'm going to combine with this other application. The Google maps thing seem to launch the Mashup craze where everyone was just taking things and sticking on top of Google maps. But there's lots of other examples of these now. You know, Zilo is another example and it's strictly speaking a Mashup, but it has the same flavor where they're essentially taking all of this very interesting information about property from a whole variety of sources, putting it together, sticking it on top of a map and then providing sort of a unified experience there. One of my favorite ones, the one on the right hear, Skibonk, is one where they've basically taken all the information about locations of ski resorts, combined it with weather data and then stuck it on a map so that in a glance you can decide whether or not and where you want to go skiing that day. No one of those websites by themselves actually provides all the information in one site. There will be one website that has a listing of all the ski resorts. Another one that's got the recent weather conditions. And so to put it all together in one place often provides something that an end user might want. And this is sort of the point, which is what an end user wants depends on the user. Right? Every user is sort of looking for something different. And for whatever Mashups that are out there, there's always some new Mashup that someone wants. Right? There's some new user that says, hey, if I can take data from here and put it over here that would be really great, then I could do this. Today you largely have to wait for somebody to build you that Mashup. Typically the combined data gives you new insight and provides some new data or service that's not there. It's not there in any existing web source. Now if you look at sort of the trend for Mashups, this is just a screen shot from the programmable web and it shows you the new Mashups that were constructed in the last six months. You can see that people are building these all the time. But these are largely programs. Right? People that are using -- either programming them from scratch or using some kind of programatic tool to actually create these things. But you can see that the interest in these things in general is quite large. Right? This is the total number that just been posted on the programmable web, which is I don't know almost 3,200. And the bottom, they're divided into different categories here, so -- and these are just to give you a sense for the different kinds of things. Mapping tends to be the biggest one, right? That's the largest part of the pie here, but there's also photoshopping, search, video, travel and so on are different kinds of Mashups that people are creating. So the focus on this particular work is really on things related to the first four categories that we often see, mapping, photoshopping and search. And those accounted for roughly 47% of the most popular ones that are out there. There's no reason it couldn't be expanded to the other ones, but different types of Mashups have sort of different properties and different kind of user interface requirements. Okay. So let me talk now about sort of the different Mashup building issues. Okay. And there's sort of a set of general problems that have to be solved in order to create Mashups. The first one is of course the data retrieval. Alright. And we've been working on sort of web-based extraction for many years now and there's been lots and lots of work. The point is that somehow you have to get the data to build the Mashup. Right? And maybe it comes from some nice API where you can just create the information, but more often than not it's stuck on some HTML page that you got to navigate to and pull the information out. So that's the first problem. The typical approach today is to have some kind of rapper technology, which basically goes from a site that looks like this to something that's in a more structured format which you can actually get the data. The second issue is sort of the calibration and this consists of two pieces. One is what we call source modeling. So you pull the information off. You typically need to know something about what this information is actually providing. You know, is it a date? Is it the name of a restaurant? I mean, so there's a whole variety of things that you might actually be pulling off of that web page and you want to understand what this source is actually providing. And that becomes very important when you want to do some is kind of integration with other sources. The second one is data cleaning because what you typically find when you integrate data across sources is that there's all kinds of minor variations that sort of get in the way of actually presenting the information in the form you want or combining it with another source. And I'll describe some examples of these, but these might be simple things where in one case the name is abbreviated and the other case it's not or it's going to be arranged in some way or if it's a date format, what are the huge number of variations on date format that they use and if it's different than another format or the one that you want then those kinds of issues can be a problem. The next issue is the integration. So what happens in a Mashup typically is that you want to actually combine the data. Right? The whole idea of the Mashup is to put two or more applications together. Maybe you're just going to one source and you're restructuring it some way, but typically you're really trying to combine the information in some way. And the challenging part here is to actually specify in what way you want to actually combine the information. Then finally there is the display issue. How do you actually want -- once you have decided what sources I'm going to get the data from and you've cleaned up the sources and you've modeled them and you've integrated them, then there is the presentation issue. For the most part in this talk I'm not really going to talk about presentation, that's sort of a whole thesis project in itself, which we haven't done yet. So we're going to focus on the other issues here. I should mention more generally that these type of Mashups, these are the general information integration issues people that have been working on in general for the last 20 years, 30 years. What's different here is you know the real focus of our project is enabling the users to actually do these things. Right? We're not trying to just automate these different tasks and put them together and magically create the Mashup for the users, but really to come up with a framework in which we can actually support the user that wants to create the Mashup to actually solve these problems. Okay. So I'm going to go through just a few different types of Mashups and you will see in our experiments that we actually use these different types to do experiments and so on. So very simple type of Mashup might be one where we simply do the retrieval, do some modeling and cleaning of the data and then just display the information on a map for example. That's sort of the simplest kind of Mashup and that would correspond really to the Google maps that are really popular. The second type is some kind of a union and these are quite common, too. So maybe I have two different restaurant review sites that I go to. I say, well, wouldn't it be great if I had all that information in one place and it was put on topical maps so I could sort of see the information. So now you get into a little bit more challenging kinds of issues because now we need to solve the problem of extracting the information from the sources and then deal with the sort of the calibration issues here and then specifying how you're going to combine it. Unions are usually pretty straightforward, right. You are just combining the sources, the data and just depending all the data together to have the combined set of information. Although this can potentially get into more complicated problems if you want to deal with things like duplication if you had overlap between the sources themselves. You will see that in many Mashups they don't, they don't deal with that at all. They just do a very simple tying of pulling the information together and display it. And then finally you want to create some kind of integrated interface for something like this is just showing a map kind of interface. We have taken the data and displayed the points on the map. The third type is very closely related to the first one except that now instead of just doing straightforward extraction from the web page now you are doing some kind of integration -- interaction with the web form. There is a whole lot of data you can get that is behind a web form where you go and you are interested in information and provide some data. And then you go through the same kinds of things where then you do the cleaning and display. And then the last type here is really some kind of joint where before we had a union which was where you're just taking the data and combining it into one large table, now we want to take the data and actually combine it where you're essentially saying, well, I might have a set of information about restaurants and I want to put all that information together and I want to relate the information across the different restaurants. And This typically is harder to specify because usually what you're doing is you have to specify, well, which restaurant in one source corresponds to which restaurant in another source, for example, doing those combinations. So somehow you have to be able to show to the system how are you going to combine the information. All right. And then -- well, the last type that we're not really going to talk about is kind of this customized display. How do you actually deal with that? And that really sort of expands to all of these. Right? Almost every kind of Mashup has some type of display because usually the user wants to present it in some way. I'm not going to spend time talking about how to solve that problem today. Okay. So existing approaches. I mean basically the goal of this work is to create Mashups without programming. And the problem today is with the existing approaches it doesn't turn out, I mean you have to some basic knowledge about programming it turns out in terms of the way the existing systems actually work. So here's a screen shot from Yahoo Pipes. Those who haven't seen it, it's basically a tool, Yahoo Pipes is basically a tool for creating Mashups across web services. Right? So you have these service-based interfaces and you're really describing how it is you're going to integrate the data across these different services. And what you see is it's based on this widget paradigm where you have these -- you basically have these different widgets which you place on the screen that you're -- where you're building the Mashup and they perform operations on the data. You can specify ways of extracting information from different kinds of feeds. They could specify different ways of putting information together and stuff. You know, it's not, you don't have to write java code, but at some level you're still in this sort of paradigm where I have these operations on the data and I'm going to put operations down and I'm going to specify how I'm going to connect the operations up. This type of approach is also used in a system called Microsoft Popfly. A similar kind of thing, except they have even more widgets. Typically what happens for the users they have to spend a lot of time essentially locating and learning how to basically customize these widgets. They're typically quite powerful. You can write regular expression and you can do all kinds of interesting things with them. They are really great for programmers because we are used to that kind of paradigm. They focus -- I mean, typically the existing systems focus on some of the issues that I just went through and they ignore other ones depending on what the focus of the project is. So our goals then is to come up with a framework that addresses all of these issues while still making the Mashup building process easy for the end user and our target here was not programmers, but people that maybe haven't had a lot of experience programming but want to create their own Mashups. Okay. So here is sort of the key contribution in terms of what we've done. We've developed this program by demonstration approach that uses a single table for basic code in the Mashup. The table in some sense is the unifying framework for the user in (inaudible). It then provides this integrated approach that actually combines these different pieces. So in a lot of systems today the different components that I describe, the data extraction, the modeling, cleaning and integration are each sort of their own piece in the system. You go and solve this program and then once that's solved you go to the next problem and we really try to treat this more uniformly in the sense that it is all embedded in the same paradigm using the same user interface, so it's a little more natural for the user. And we allow the system to actually build fairly sophisticated kinds of queries, but do it using the same paradigm. So it using the (inaudible) demonstration idea, you can actually write what under the hood are fairly complicated queries, but the user doesn't see them as necessary complicated queries because they never write a query. So some of the key ideas here then are we focus on the data and not on the operations. Right? The users are mostly familiar with the data. In some sense they know what the data is that they want to extract. They know in some sense how they want to put it together. The idea is that the user then manipulates the data instead of manipulating the operations. Another key idea here is to basically leverage the existing data and the idea here is that over time if I'm building, if I'm interested in a particular topic and I'm building Mashups on a topic then I've got to build up a repository of sort of previous data sources and things that I created in the past. This can help you a lot in terms of doing the modeling where I'm describing what the source is actually providing, cleaning the data so you know the same type of operations may have performed in the past and then doing the integration and so -And the other thing here is that as opposed to most problems in computer science where we want to do divide and conquer we really took the approach that by pulling these pieces together and solving them is one integrated piece, it's more natural to the user and solving one issue can also help in solving the other issues. The different components here that we're talking about are often very closely interconnected so often it's very hard to actually completely separate things. And we do this by interacting with this sort of single table or spreadsheet. Okay. So we built this system called Karma and this is just a screen shot from the system. So what you see here is on the left you have essentially an embedded browser, which is the focus has largely been on extracting from web pages and so we've got a browser built right into the interface here. And then the idea is that you can then do sort of simple kinds of cutting and pasting into the table. So on the right here we have the table. So this is the actual table that the user's ally interacting with and it's very close to sort of the spreadsheet model that a lot of users are familiar with, even non(inaudible) often work the spreadsheet themselves where you just have this sort of table of data that you're manipulating. And then here in the bottom are sort of the interaction modes and this is the way for the user to sort of specify what mode of the system they want to be in. You may not be able to see it very well, but it has the different tabs here to identify what step in the process that they're currently working at. Okay. And so let me just go through sort of a motivating example. So let's say that you have a source of data on restaurants and so you want to pull in the restaurant name, address, phone number, review from some restaurant review site, which is shown on the left here. This is the starting place. Then on the right what we have is The L.A. Department of Public Health where every restaurant in L.A. County actually gets rated on a regular basis by this department. You like to know, okay, I want to know when this restaurant was last inspected and what was the score was because I don't want to eat at substandard restaurants. I want to basically do something very simple, which is pull these two sources together, clean up the data and then display it on the map. What you see here is sort of the extraction step here where I'm going to pull off the basic information, name, address, phone number, review. Over here I'm going to pull up again the name, address, but here I have the date of inspection and the score. And I want to clean up the data and then combine it, do some kind of integration here across the two sources and then stick it on number. Okay. Now for the purposes, for the example I'm going to use in the rest of the talk I'm going to assume just to simplify the talk that essentially I've already done the shaded part here and I've stuck it into a database. So I have a database with the same set of information, so it's basically the same task, but I won't have to go through both parts of it. Okay. And you know, in this database this is sort of just the general database that you have as part of the system which contains all the past sort of Mashups that you built. As we go along let's see if these can be useful in actually creating new examples. All right. So let's start with the data, the data retrieval test. There has been a lot of work on extraction. There's a variety of tools and stuff. We took a relatively simple, but sort of very easy to use type of approach where you're essentially taking, you're looking at a web page. We have got the web page here on the left side and what we're doing is basically copying the information into the table to essentially show the user what information we want to extract. And what's happening under the hood is that we're using the very common sort of approach, which is to exploit the document object model underlying the page. By doing that you can very quickly generalize on the page to figure out what the information is. Right? So what happens then is the system basically builds the x path expression that describes the information you're extracting and then does some generalization over that. You can see well, he only pulled in Japan Bistro, but really his intention is to say, well, what I want are all the things essentially at that level in the document. So the model is shown -- the document model here is shown on the left. So the user pulled in this, but this system can fairly easily generalize over this type of model and say, okay, it looks like correspond to some path that such and such corresponds to this expression here and then with one example, which is very nice, here you can get the complete list of all the restaurants on the page. Yes? >> Question: Did the user (inaudible) that or ->> Craig Knoblock: Yes. Yeah. So it's a little hard to show that on this slide, but the idea is that the user basically copy and pasted or dragged that over from the actual web page. Okay. So we've got this first level that we get from this and there's a number of tools out there that are using this kind of model. I will mention that since some people have asked about this that one of the issues can exploiting the object model is that it works where it works and then where it doesn't work you are kind of stuck. But for quite a few web page its works quite nicely. One of the things that happens is that lots of times it is not enough just to get the information that you want from a single page so typically have you to do some kind of navigation. Right? So on this particular page when I click on Japan Bistro, then what happens is it brings up this whole other page here, which is shown here, which now contains maybe the address and the phone number for the restaurant, the picture. I think in this case it even has a movie you can watch. But it also has the reviews, or at least the number of reviews on this page. So you may want to extract additional information. So instead of just the name I may want to get out the address, the maybe some kind of description that's there, the number of reviews and so on. So let's take the case where here I want to extract the reviews. Now I have a problem that, well, the reviews weren't shown on the original page. Right? So the original top-level page didn't show the reviews. So somehow I have to connect up this detailed page with the page that's above it. So what's really going on here is that in the original document object model I had the name, the address and then some brief description about the restaurant. And what I'm going to do then is use that information. Well, use the name which is on this page in the URL, sort of the underlying URL that's linked to these names to then link to the detail pages so that I can get all of these additional, the reviews or at least the number of reviews off the next page. So what's really happening there is that we're basically building this X path expression that's basically going to traverse these simple kinds of URL links so that when you have a page and it links to a detail page, which is very common, it is really is just going to connect up the pages across these things using those kind of URL links. So then we can basically fill out the table. So all the user is doing is essentially navigating to the next page saying, I want the number of reviews here, copies this in and the system figures out the navigation page. Yes? >> Question: (inaudible) -- users to register in the system to participate in the (inaudible)? >> Craig Knoblock: You mean for the underlying website? So the question is does the user have to register to actually use these pages. I mean, typically depends -- what's that? >> Question: To rank the pages. >> Craig Knoblock: Oh, to rank the pages. No, so in our system all the user's doing here is going to the pages and pulling out the information they want. There wasn't necessarily any kind of registration process. They're simply going through and saying that the task the user is trying to do is to aggregate the data about what are the restaurants listed here and then how many reviews were actually available for each of the restaurants. Yes? >> Question: (inaudible) -- rank the page and only have the (inaudible) of the user. You know, I think (inaudible) ranking of the page in this rating because a user can create a page. >> Craig Knoblock: Uh-huh. >> Question: And he himself go for the page. >> Craig Knoblock: Well, the task we're trying to solve is that there is a set of websites that are available out there. And it's up to the user to decide, you know, what it is they want to do, I mean what sources they want to combine which sources they actually trust and how they want to use the data. So we're not really taking a position on that. It is usually people decide for themselves which people they want to trust. What we are trying to do is to provide the tools that allow the end users to say, hey, I want to take this information and combine over here. >> Question: Users extract information from the databases that organization link to your Mashup system or just from web pages? >> Craig Knoblock: Well, there's no way to link to our Mashup is system, so we're really -- we're ->> Question: Or (inaudible) so how do they store the information about restaurants? In their databases or on in some web pages? >> Craig Knoblock: Yeah, we don't know and it doesn't matter to us. Right? It's on the web. All the data, we're basically assuming all the data is on the web and we are not taking a position how they store it or how they access it. Our assumption is that we have access to it through a set of web pages (inaudible). >> Question: The web pages not important to the system? >> Craig Knoblock: Oh, it's very important. But the key here is that we are trying to support the user. I mean, a lot of people look at information integration, myself included, as a process where I specify a query and the system goes out and magically pulls information together, integrates it and presents to the user and the user is not involved in the process. The goal with this project is very much the user is deeply involved in the process. Right? What we're trying to do is allow the builder to build their Mashup. Right? So you say, I want to actually build this Mashup where I combine the restaurant data with the help grading data and stick on the map. I've already spent the time to evaluate the sources available, how I want to combine the information, which information I want to use and which information (inaudible) and I'm basically creating the tool. >> Question: (inaudible) that a lot of sites really customize their pages to the user. So like for example you say browser (inaudible). >> Craig Knoblock: Uh-huh. >> Question: Like if you go to Avalon, I would get a very different page from you. How would you account for that? >> Craig Knoblock: That's a really good question. We don't account for that today. One of the things I should emphasize here is that we've taken sort of an initial approach at the different subcolumns and our contribution is not necessarily specific on the extraction piece or the source modeling piece or the integration piece. It's really the unification across the different sources. So for any one of the issues and I have related work site at the end and I will talk about what people have done. For any one of these issues we could spend a lot of time delving into all kind of issues like that and my position on that is well, they're all sort of problems that are solvable in some way, but really what I'm trying to do is look at the big picture here. How you put these different (inaudible) together and then you can delve in. There's lots of tricks you can play with those kinds of things. You can either model what the cookies are actually representing or you can actually clear the cookies every time so they are not there, there's a bunch of ways to deal with that. In the system I'm talking about today, we ignore them. Okay. Okay. All right. So basically we finish the extraction piece, you know, we've basically exploited the document object model here. We can get -- we can navigate through the page, pull out the information and build the initial set of data that the user wants to work with. So the next step is for the system to build some kind of model of what this information contains. And This is useful for naming, for one thing, just so you have reasonable names about the kinds of information you're displaying. But it will also come up with -- it turns out it is very useful with respect to the actual integration step. So we pull out some initial data. Let's say here we have a list of some set of names. And we may already have some data in the system. So we may already have extracted the L.A. health rating. If you remember, I assumed we had already done it and should have extracted the database. You might also have some other sources of data. Maybe I've gotten a list, some information about artists. Maybe another source, which is the Zigot site, which also has restaurants. The idea is that you can use previous information you previously modeled. In some sense these have tags associated with it. These are restaurant names and Zigot ratings and here we have artist names. And so what happens in this step is the system uses essentially previous extracted information to try to identify things that it is going to identify from having extracted them previously. Here we find that maybe we discover of this set we've extracted here we find that three of those are -- occur under restaurant name in previous sources. One of them occurs on artist name. So we're gong to say, well, we're going to take (inaudible) propose that and in fact I actually think that in the current version system we actually propose all the possible ones that we've seen in the past and the user can select one or say no, this is a new one I've entered. The next step here is the actual cleaning process. So the problem here is that I've got my extracted information and you know here I've decided these are restaurant names. And I have all the other information previously that I've extracted from other sources. So maybe I want to look at the restaurant names. And one issue that frequently comes up, you get minor differences across sources. Right? So you get problems with maybe there's a restaurant that's misspelled or maybe there's some systematic way in which these sites, these have been changed, probably not restaurant names, but restaurant names you may just want to be able to identify okay, potentially there's some problems here. So here you see that S Sushi Roka here is Sushi Roku there and those kinds of things are a real problem, right? Especially if what I'm trying to do is actually combine the information from the new source with the existing because then it is not going to realize that they're actually the same. And so the way we deal with this is we really allow the user, there is really two kinds of capabilities in the data communication. The first one is we allow the user to basically manually clean up the data if they want. Sometimes that is what it takes. Right? I want to combine the data and places I am going to have to fix things. And so this is a simple sort of table interface where you know I've got the set of data here and then it opens up three new tabs. One is sort of a suggestion what it thinks this is supposed to be. The next one is what the user sort of provides as an example of what they want and then the final thing here, which is this allows the user to say, well, in some cases the suggestions may be right, but maybe I want to override it. So in this case the user might open this tab and say, well, what I want to do is take this sort of number of reviews and I want to enter, what I really want to do is get rid of the reviews. Right? Because later maybe I want to do comparison and find the ones with the most reviews or 10 reviews or something like that. So the user enters 31 here. What happens under the hood is the system goes through and basically tries to find the transformation that would get the system from this to this. Okay. And we have a set of predefined sort of rule classes in the system, such as substring or take the first token or the last token, those kinds of things. So you have a set of predefined tokens. The system then computes based on the examples and says well, can any of the rules calculate that? If not it won't produce anything. But if it can it will actually generate the suggestion which this table doesn't fill this out, but it would generate all the suggested ones based on this rule and then the user can essentially confirm this and say, okay, yeah this, is the transformation that I want to apply here. So there is really two types of cleaning that go on. One is that the user essentially tries to find -- well, specialize a set of generic rules to this case here so you can apply them to the cleaning. And the second one to essentially allow the user to then override so if there is noise in the data they can clean it up. And that is a big problem with that ->> Question: (inaudible) -- successively refine the rules so that you might -- that first rule might work for five and you get to number six, well, this is kind of an exception, so can you apply something that applies to everything? >> Craig Knoblock: That would be a great thing to have. Right now basically the system tries to come up with one rule that will apply across all the data, but then allow the user to then do some final edits on it. Yeah. You could then go back and maybe you missed one or two and can clean this up mainly. >> Question: (inaudible) -- best case where a user would want to (inaudible) or should be some code in there? >> Craig Knoblock: We don't support that because it doesn't really fit in our model. You could argue that maybe you should allow them to do whatever and tools but we really tried to keep, we really try to keep with the original goal here of sort of trying to match, okay, what is an end user going to do? We know how to write (inaudible) questions, but (inaudible) not allowed to tackle them. >> Question: I would imagine the user would want something even simpler where they initially just selected the number 31 and copied that to the table rather than 31 reviews and then figure out how to further ->> Craig Knoblock: Oh, well, they could have done that. No, they could have done that in the original extraction. But in fact that is one issue with the using the document object model approach, which is the easy thing to extract in the document object model is going to be those things that are at the level of the document (inaudible). Right? And so then all of a sudden once you are down within it now all of a sudden it becomes harder to pull out exactly that piece of information. >> Question: (inaudible) -- well, as the rule at the same time. >> Craig Knoblock: One of the things we're doing now, I actually think the (inaudible) approach is somewhat limited. One of the things we're doing now is actually integrating in a more sophisticated learning approach, which then they could have gotten 31 to begin with. So you could see other kind of examples that aren't just subtracting a subset. For example, let's say I have last name, comma, first name. And now I want to switch them around. That's a very simple kind of transformation to kind of show the system. The system could easily have seen that type of transformation before. So could say, oh, okay, I see what the user is trying to do there. But it wouldn't be easier to extract from the original page. Unless you extract -- I guess you could extract it last thing separately, but maybe then put them together. Any questions? Okay. Now we're to the data integration piece. And what happens here is okay, I'm looking at a table of information that I have and the integration is well, I want to integrate some data with the table of data I'm currently looking at. And what happens down here is I'm in the integration tab and what's available to the user at this point is to essentially put data in any of these positions. So I can actually -- I can basically if I add information here then really I'm trying to specify some kind of union existing table with the information. If I want to specify information over here then I'm really trying to do some kind of joint where I'm expanding out the set of information that I want to combine with the data I'm already looking at. Up here the way it works is the system may provide a set of options, which are thing it can see how it could join with the existing data set. Okay, these are additional attributes that may be in the system. So what happens is that the user goes into integration phase and essentially hits the fill button. And what gets generated then -- if this is the original table and these are sort of sources in my repository then what the system does is look for ways that it could actually add additional information. You can see here that the L.A. health rating, there's the actual health ratings for the restaurants. Right? And for Zagot, here's the Zagot rating, which is maybe the food rating from Zagot. Again, it can basically compute the possible potential joins across the different sources to sort of say, okay this, particular box -- well, this data here could potentially be joined with this information over here because there is of overlap in the restaurant and address. Likewise this information here could be joined across just the restaurant name and no address information. And what happens then is we essentially compute these steps. So what happens in the user interface is that we have this set of information up here, which these are the possible attributes that the user could select from some drop-down list. And then down here we have the actual values. So the system actually computes potential values based on the database. So the user may know what the data looks like that they want to actually add here and select that. So all of that gets computed based on the databases that are available. And then the user essentially then -- the system just expands out the information. So if I were to select health rating at the top column, then it's just going to basically combine the information across different sources. It's easy for the user to do that. So let me go through an example though in terms of -- the more general example in terms of the types of queries the user can specify. Here is a single column of information. So what is really happening here is that we can compute the -- at the top level and this essentially is essentially going to be the top of the table. So this is going to be the attribute level. So we can compute the attributes that can fit into a particular column. And then here you have the cell level, which is these are the possible values you can basically do. These are essentially going to be, you know, the level is going to be the set intersection across all the different values of all the different rows. And the values are going to be the place essentially the values that are essentially consistent with that particular attribute. So let's say here I stick in Los Angeles into this table. And I've got a source here that has information, the source, the attribute city and it has a set of information about (inaudible). I may also have a song name that matches Los Angeles and it is essentially gotten information about pop music. And so what's going to happen in the other cells is I can actually put in values here that correspond to other possible values that might correspond to both of those things, both the city and a song name. Then if I type in "Honolulu," then it may find that okay the -- this corresponds to city and same source that I had here and may also correspond to some other source here. But that -- so then the question is can you use this to help you determine sort of the plausible attributes for the table? And the idea is Christmas, you are basically narrowing it down essentially based on the information that you have seen there. And so you see once you have a few examples, okay, that is likely to be city. We actually keep a drop-down list here, so that it turns out the system conjecture is wrong then you can change it. The final step is the actual map generation. So the idea here is that we essentially apply sort of general set of heuristics today to essentially look at the data that is in any of the columns and say, do I actually know how to place any of these kinds of things on a map. Right? So is it a street address? In which case I can run a geocoder on it or does it correspond to some other type of maybe a city name. In that case I could place on my map. Or is it a state name, for example. So there is just a general set of rules that allow it to take the data and stick it on a map. If it's amenable to stick it on a map then you just click a button and say, okay, show me the data on the map. Otherwise, today we haven't really spent time on the user interface part of it or the display part of it. So otherwise today you're just stuck either looking at the data as a table or sticking with the map. Okay. So let me spend a few minutes now and talk about sort of the valuation of this. We use as a baseline a combination of a system called Dapper, which is you know a Mashup tool that really focuses primarily on the extraction piece. It works quite nicely. It's very easy. There's a web-based interface you can go to and build these extraction things in Dapper. And we looked at that combined with the system called Yahoo Pipes, which I mentioned earlier is sort of a widget based paradigm, where you are basically using widgets to collect information up. In the Mashup community it is actually quite common for people to combine different Mashup tools. So people in fact oftentimes combine dapper and Pipe so it seemed like a natural way to do this. These are popular in widely used tools. So we thought, okay, this is a reasonable combination of tools to look at how it compares to our approach here. And what we want to do is look at a couple different issues. One was could we actually show that users with no program experience could actually build the different Mashup types that we have here? And then second one is could we show that karma actually takes less time to complete each of the subtasks and scales better as the tasks get harder. And finally overall, does the user take less time to build the same Mashup and karma as compared to Dapper and Pipes? And for users in this experiment, we had 20 programmers. These were essentially students in the course that I teach. We gave them extra credit if they would run the experiment for us and so we got a whole bunch of students, but these are basically master's students in computer science that have probably (inaudible) amount of program experience. And then we took three nonprogrammers. These are essentially administrative assistants that typically are familiar with Excel and Word, but really haven't had any experience in programming. Okay. And so in terms of the claims themselves, the general set of users are used to support the second two claims here that takes less time for the subtask and overall we could do it faster and the nonprogrammers here are used for this kind of (inaudible) which is that the -- that someone with no previous experience could actually build the four Mashup types. Okay. So the set up was the following, which is that for the people that were programmers since they took this class we actually had them do assignments in Dapper and Yahoo Pipes. So they already were familiar with these tools. They actually had to do a couple different assignments for this and each of them use projects. But in addition we gave them review package prior to the experiment so they -we sent them and somebody say, okay, do the following task so you remember how to use the tools. This was early in the semester. Then right before the experiment we gave them a 30-minute tutorial going through the different systems and how they work, so they were everyone was familiar. It wasn't like a new tool to them. And then the tutorial covered -- the tutorial I can say was about 15 minutes on Dapper and Pipes and then 15 minutes basically teaching them how to actually use Karma. Then since they hadn't previously used Karma we let them practice on two to three tasks using Karma, just simple tasks that were different from the (inaudible) so they had some familiarity with Karma, as well. Then the test we gave them three tasks. And for the programmers they basically alternated between Karma and Dapper Pipe for each of the tasks. For the nonprogrammers they only had them use Karma simply because we didn't think there was practical for them to sue Dapper and Pipes. And then everything was basically recorded using video capture software and that just allowed us to go back and measure the amount of time after the fact so we could see how much time they spent on the different subtasks. So here is an example of the results. So this is one task here. The different subject numbers and then we broke out the time based on extraction modeling, cleaning and integration and then the total time. And notice here we have these five, we basically had a five-minute cutoff time for the users. And that is simply because many of the users eventually fail on some of the tasks. Then we measured these things for both Dapper and Pipes and for (inaudible). Okay. Then the variation task. We had three tasks they sort of correspond to the original task I talked about. So the type 1 simple extraction task extraction put information on the map and here we have different things, the data extraction, the source modeling and cleaning and integration and we rated -- this is sort of our projected analysis about how hard these different task were and you can see the second task really combined both dealing with a form kind of interface and then doing some union over the data and again this one they had to do the union, a simple cleaning task, a simple model. Then on the third task it was a little more complicated. We had to join two sources together. The data extraction modeling were pretty simple, but doing this process was more complicated. And there was no cleaning for that. And these were real test, real sources for this. Okay, so let's take a look at the claims one by one. So the users with no program experience can build off our program types. Claim two was Karma takes less time to complete the subtasks and skills better as the tests get harder. Then you can see here that we have different difficulties on some of the tasks that we compare. And then claim three, which was overall the user takes less time to build the same matchup with Karma as compared to Dapper and Pipes. And so we look at the overall time to do the end-to-end task. Okay. So let's look at the first plane here. Here are the results for the nonprogrammers. There are only three users here, which is somewhat limited. Really we are just trying to show a proof of concept. We can take people not trained as programmers and in fact they could actually complete the task. And you can see that these are the three subjects color coded here. We have the time on the axis here and then the task one, two and three. For the nonprogrammer subjects we used a time limit of 10 minutes and you can see that they all completed within the time and you know they did pretty well. They did pretty well in the tasks overall. Clearly the red subject took more time than the other two, but overall they completed the task. The second claim, which was Karma takes less time to complete to the subtask. So I'm just will go through quickly each of the subtasks here. So we have the extraction work. You see again same setup here, which these are organized by -- ordered by simple to hard. So simple, moderate and difficult tasks during different orders, this task three, task one, task two. We order them from easy to hard and then on the X axis we group the time and different intervals and the Y axis we had the number of subjects. So you can see the Karma shown in green and Dapper Pipes shown in blue. You can see on the simple tasks Karma did better. Then as we get to the modern hard task it continue the same pattern, which in general Karma is outperforming Dapper and Pipes across these abstractions, which -- I don't have one with me unfortunately. I should have brought one. All right. Part is we are in the midst of a complete reengineering of the system. So (inaudible) on Monday, so it is kind of in flux at the moment. Okay. So here we have the ->> Question: (inaudible) ->> Craig Knoblock: No, we haven't made it available. You know it really has very much been a research project. But we have a movie of it and I'll put the movie up on the website and you can take a look at that movie. You know, it's very much of an active research project, where we're looking at improving each of the different components. Now that we have the system working we're working on different components and I'll talk more about that in terms of where we're going next with that. But it's pretty much work now where we're doing some additional work on sort of expanding generality so it wil work on more websites and do more stuff. But eventually we'd like to make it available. Okay. So you see here not only one of the things I want to point out here not only is Dapper pulling better -- Karma pulling better in terms of the time, but on Dapper and Pipes you can see there is a certain number of people failing on these tasks. For the simplest one, everyone completes it. But for the modern hard task these are program users and some of them are actually failing to even complete the task, which it is somewhat surprising given they've all been trained to use the systems. Source modeling is a little different. So What happens in source modeling, we have Dapper and Pipes and you can see now it is outperforming Karma, at least on the first two paths. That is simply because there is not so much of a need to do modeling. Since they are not supporting integration at the same level it's actually the -- any modeling that has to get done in Dapper and Pipes is actually simpler to do. In this case here, Karma was able to do the modeling automatically because it actually recognized the particular thing had to get modeled and generated the right type. It didn't require time and user at all. But you see this time it is quite small. So even though Karma performs worse in task one and two, there's a relatively small difference. Like going to 30 seconds and you will see this savings gets realized in the integration step. We do the integration it actually makes life easier for the user. Then we get to the databanks up here. Again we see there are only two tasks we had in data cleaning. You can see Karma is doing better than Dapper and Pipes here. And as the task gets harder more subjects are failing in the class. So this task we are getting 35% are failing and here we are getting 83% that are actually failing. Yeah? >> Question: (inaudible) -- extra credit? >> Craig Knoblock: Absolutely. Or they would never forgive you. They were all motivated. I think everyone really wanted to complete the task. They didn't know how it was going to be scored. Then here we have the data integration step and there were only two tasks we had to do this. So you can see here that the union task -- well, there is almost no work for the user here. Happens almost immediately. There is more work to do this in Yahoo Pipes and then for this task you can see most of the users failed here and this really had to do with the model that happens for doing the joins across in Pipes. And it's quite sort of tedious in the user has to go to this menu and kind of select which thing you want to join with and stuff and it's quite time consuming to do it. So either they ran out of time or they just gave up because it was quite (inaudible) there was one or two users that were able to complete that. So here we have again about 30% fail on the union task and about 95% failed on the joint task. Okay. And then overall -- so these are really just the numbers aggregated together. You can see here for Task one, you know the green is here is Karma, which performed much better tests. Two, again performed much better. You can see these times are quite large. And task three overall again, overall Karma is doing better on these tests. Okay. If we look at the relationships, these are really just the speedup or slowdown so on extractions about 2.2 times faster. Source modeling is a little slower. Cleaning was about four times cleaner and integration six times faster. Overall we get about three. Okay. You know we did this significant test and so Karma is specifically -significantly faster when we do averages, except for on the source modeling task, which where Dapper and Pipes system faster. Yes. >> Question: Do the programmers do the tasks in the same order like they always do with Karma first? >> Craig Knoblock: No, we switched up the orders. So they would do -- some of the students would do ->> Question: (inaudible). >> Craig Knoblock: No, there's not. And in fact we switched them. There are three tasks, so which task they started first we would also vary. Okay. So now there's been a fair amount of related work. I don't really have time to talk about all the works in general specifically, but let me specifically but let me describe it in general. There are a number of tools out there, so there are systems like MIT had a system called Simile and There's Dapper and Pipes. Microsoft has a system called Pop Fly, Marmite, Intel Match Maker, there's a bunch of interesting programs out there that have all tried to do some piece of this kind of Mashup work. If you look at sort of the types recorded, we are classifying from our view of the world in terms of things we were doing here. Very few of them really try to do all the different sort of pieces of the integration. Right. They do different pieces of this. You know either the unions or joins or both. Intel Match Maker is the only one that covered the same classes of task that we did here and if you look at how the system works it requires an expert. Once you get into those kinds of tasks and read the paper and stuff to get to integration task you are doing joins and stuff basically there is an expert that is basically writing the integration for you there. So in some sense yeah, they support it and I suppose -- expert, but I think they have different levels of users and that's the base level of the user really just sort of invokes page and it applies to Mashup. Simile. This is sort of early work using the document model. It was focused on extraction in the web (inaudible). It was an interesting system, but it supported very limited kinds of work. There are follow-on system here was Potluck and also creates An RDF for (inaudible) stuff where the user manually specifies data integration test. Dapper mainly focuses on extraction and really could do linear type information. It could take this information within this and this other source. Then all the systems are kind of grouped together. These are all doing widget-based approach. Microsoft PopFly has a much fancier UI and more widgets. Morami uses this kind of work-flow based approach. In their user space they found that the user really got confused by sort of the whole work thing approach. Google My Map, you know they take points from some persons taking on a map. This is work we did at USC, this is really our system that predicted this called the Agent Widget. This is more of a question and answering type of approach. The idea was that the user -- you essentially go through and ask the user a series of questions to create the Mashup. And the problem is it didn't scale well in the sense that the questions get quite tedious and tasks get complicated and there are more and more questions. You never know when you will get to the end of the questions. But it is sort of modeled on the tax program, question and answer you have your Mashup. But it wasn't user warranted. This is cards and other system, Microsoft sort of more of a topple-based approach where everything is mapped into couples and things are linked between them. Finally, Karma on the bottom. So a lot of these systems address some piece of the problem and some sense require most of them require sort of more expertise to get to the same level. Then there is a whole set of work on each of the subproblems. I'm not going to go through it. There is quite a bit of interesting work on for example information extraction. We've done some of this work. You know we sort of fit into this category here of -- sorry, on sort of exploiting the document object model, but in fact we are moving to this model here. So in fact we have a new version of the system where we have actually integrated work we have done previously on machine learning type of approach's to do the extraction and this is really just to get more generality for the type of pages we compile this to. I'll have to work on source modeling. Bill here has been doing work himself on this. You know really what we're doing here is just leveraging very simple kinds of techniques in the current system and we want integrated to more sophisticated kinds of (inaudible). Data cleaning. There has been a lot of work on data cleaning and providing some generic model where many of these techniques in fact are integrated in. And then the integration piece, you know I think the most closely related work here is other work on programming by demonstration. Allow for example, has done work in Washington on this piece. And you know, there finding similar techniques but doing it in a new framework and sort of a novel way. Okay. So let me just wrap up. Clearly Mashups are here to stay. A lot of interest in this whole idea of being able to take existing sources and put them together in novel ways and the need here is you know to find some nice way that will really allows end users, the web users out there that use browsers to do the same thing and build their own Mashups. Our contributions here is this programming by demonstration approach where we use the single table as the unifying paradigm to hold the information together. We solved really these four pieces we view as central to Mashup construction and the extraction, modeling, cleaning and integration. And then the sort of query formulation technique that really allows the user to specify integration and stuff and say I want to integrate this source into this source in this way, which I think is fairly natural for the user but maps fairly complicated queries in that. Then finally we evaluated sort or demonstrated this approach by showing real users could actually complete these tasks and get some significant improvement over existing approaches. We're very interested in future work. We're already working on some of these. One area we've done almost no work is just customizing displays, for example. I think this is a really interesting topic which is one of the things you see if you look at all the different Mashup tools out there, there's a huge number of different kinds of clever interfaces and displays people use for Mashups. We ignored that and said, put on maps. I think there's a really interesting opportunity there for allowing the end user to customize it their way. Another area we're interested in pushing on is learning and generalizing over the tasks. Wee really want to be able to store the integration plan and reexecute them on new data as the information is changing on the web and stuff. This is really a natural extension of what we have here. It is not (inaudible). The next one I mentioned is work on machine learning for extraction of tasks and in fact doing work in collaboration with technology because they have been building new tools for automatically wrapping form data and automatically extracting the data on the result page from the forms and so we're looking now at sort of integrating that in technology into this so we can get sort of more coverage that don't aren't handled well by the document (inaudible). Then we have other work we've been doing in the past on integrating -- well, work on automatic source modeling and right now we're using some fairly simple Karma and want to integrate that piece anywhere we actually do more sophisticated kinds of models. You actually learn from the data you have seen in the past. I recognize this data format and once you know the date format you can normalize it from that standpoint. I think, just mention a couple papers. These are about my web page. One appeared in the 2008 IUI conference and that was in the 2007 IA conference so if you are interested in the work, there's tapes. So that's it. Thank you. (Applause) Questions? You guys exhaust all your questions? >> Question: Yeah. (inaudible) -- end user interact available ->> Craig Knoblock: (inaudible) -- that is a longer process because to get to the point where you can put up a tool people use requires you get to a particular level with the software and it's not, you know ->> Question: (inaudible) -- tests? >> Craig Knoblock: Yeah. We do, but then putting up on the web is sort of more complicated right, so you either have a piece of software you let people down load and install or you have to some kind of web-based interface that allows them to use it. We had users test it. We said, here is a computer. We put the computer in front of them and let them run the software. Just a matter of sort of resources and cycles to support the (inaudible). But I will put up the video because you know I really should have brought the video to show it now. It is nice to see what it looks like. >> Question: So for machine learning instruction part, so are you I guess I'm kind of curious how you plan on doing that with the user? The user has to interact to set the boundary with the extraction rate or? >> Craig Knoblock: The way that works is we have the interaction now so we have an actual version where right now you copy and paste the data from the web page into the form. We are using that data as labeled training data to do the extraction. In some sense what is happening under the hood, every time a user copies and pastes to the table it treats that as a learning task. Goes off and tries to learn the extraction, generates the data it would produce and maybe it is what the user wants, maybe it's not. The user can refine it by bringing more examples. So It takes a stab based on the first example and then you know if this is get right copy more examples and brings it in. But it is not as closely tied to the document object. That is the real advantage. The disadvantage it may require more interaction and we're talking about maybe some hybrid approach where you can get it used (inaudible) and do that and when you can't then you can use this data approach. >> Question: (inaudible) -- instructed is that integrating the restaurant (inaudible) matching on the page or ->> Craig Knoblock: No, it's (inaudible). It's a good idea but there is another path to go. People have been doing more work on just sort of automatic extraction from pages themselves where another model we're exploring is you basically automatically extract automatically extract the data off a page and you're essentially trying to segment the data into different pieces of information and then given an example from the user what the example tells you which information the user is looking for on that page and then you can retrieve -- pull that information out, which I think that is probably the most promising approach because that will probably minimize the amount of training data. The issue with previous work on sort of machine learning for building Rappers and doing extraction is simply amount of training data it takes to really get it right. Users have limited patience for providing training data in the real world. Ideally they want to get it in the first example and maybe they're willing to give a second example. After that they drop their hands and say, it's a stupid system. Other questions? Yeah. >> Question: (inaudible) services -- basically just (inaudible)? >> Craig Knoblock: Actually we've more recently looked at that -- we version of the system now that where we do the extraction we can do the extraction from Excel. Happened to web services? But oh, just plain text files. But there is no reason we couldn't add web service to that, as well, that would be quite natural. Yeah, we have been trying to generalize the set of sources not everything is in web pages. Yes? >> Question: Project called (inaudible) for (inaudible) organizations (inaudible) but organizations can link their database sets to the Wiki pages and users can pour into the databases and extract information and do some kind of data analysis and integration and the problem that we are facing is that experts are somehow reluctant to contribute in the Wiki because for example they're not sure if the other user come to their pages and might collect data and they like to lock their pages. >> Craig Knoblock: Right. >> Question: And they are already concerned about the credibility of the source of information that other users ->> Craig Knoblock: Uh-huh. >> Question: -- link to the Wiki. >> Craig Knoblock: Right. >> Question: And they are looking for some ways to filter the reliable data out from the unreliable data in order to save them in their (inaudible). >> Craig Knoblock: Right. >> Question: So now my question is do you have a (inaudible) for defining access for different hubs of your users because I'm sure you will have some experts in your system users from (inaudible). >> Craig Knoblock: One of the system is this is a tool that a user would down load and use or use on the web. And so we don't really have a need for access. They are really just going out to public sources or publically available sources and pulling them together and building their integrated tools. We're not really trying -- we're not making any data available ourselves and so we don't really have to provide any kind of access information. It's a slightly different problem, right? You're in the position where you're trying to manage your own users' data and who has access to it and what they are going to do with it. >> Question: (inaudible) -- to sampling from the others (inaudible). >> Craig Knoblock: Uh-huh. >> Question: (inaudible) on or two really like to share data information from the Wiki with other people. >> Craig Knoblock: Right. >> Question: So it is not just necessarily just I mean that we cannot assume that the data is just from the database so people can (inaudible) and gather some information (inaudible). >> Craig Knoblock: Right. There's a larger issue here in general which is, you know, what's the overall business model here for the people providing the data, right? I'm a data provider. Maybe it's a publicly available data source like the L.A. county department of health. That's just a public service your tax dollars pay to have the data available. You could imagine that with health ratings and those kind of things there is issue of how the organizations are making money and are you circumventing their business model by extracting during the (inaudible). So those issues I haven't taken any position on here, we're looking at the technology to be able to put stuff together but those kinds of things still are open issues. Any more questions? Okay. Thank you. (applause)