>> Christian Konig: Good morning. It's my great pleasure to introduce Parag Agrawal from Stanford University. He's been working in a wide range of area. The main thrust of his work is on probabilistic and uncertain data, but he's also been working on data cleaning, namely indexing containment of data. But also things like secondary index maintenance and distributed systems and even something as farfetched as blog ranking. So if you meet with him later, feel free to ask him about any of these areas. And other than that I'll hand it over to Parag. >> Parag Agrawal: Thanks Christian for the introduction. He already finished off two of my slides so please forgive the repetition. So as Christian said, I'm going to talk about -- the title of the talk is Coping with Uncertain Data. I'll talk about some of my thesis work, which is all about uncertain data. And then I'll talk about some work I did here at Microsoft Research with Drago and Arvin [phonetic] which was about indexing [inaudible] containment but it has challenges. One of the challenges comes out of dealing with uncertainty. Okay? So as Christian also mentioned I do a bunch of other things. I'll try to give you some introduction to some of those things like one slide over like one bullet overviews to promote talking about those things in my one-on-ones. So let's get started with uncertain data. Let me start with an example and tell you what uncertain data is on uncertain databases are. So let's say I wanted to subtract dates of 2010 World Series games. And let's say for the sake of Power Point that there are three days in the World Series game. So I suppose have Wikipedia article that I want to extract from. And I write a very, very naive data extraction. It's not smart, it's not optimized. And it just finds two dates for them and nothing else. It doesn't tell you which game is when or any such information. So at this point, if I think about it, this could mean that game one and game two are on October 28 and 30th by knowing these two dates. But it would also mean that game one is on 28th and game three is on 30th or it could mean that game two and three are on 28th and 30th. Notice that I've used some knowledge here in generating these alternatives or possibilities, which is that game one is before game two, which is before game three, correct? And I've also sort of implicitly assumed that these dates are associated with at least one game and that games are on distinct dates. So based on all this knowledge I've sort of conceptionally created what we call possible worlds. So we have three possible worlds listed here for a database, which was about games and their dates. So this I'd argue these yellow -- this yellow three tables, yellow tables are three possible worlds of and uncertain database. So intuitively and uncertain database consists of possible world and the intuitive meaning is that the real state of the database is one of these possible worlds. Okay? Let's move on. Another way uncertain data can be created is through sensors. And let's assume we have like three rooms. And we have a sensor. Let's say the blue sensor which can sense some part of room one, room A and room B. Okay? And the sensing is of the form that if there's an object within the shaded blue area of room A or room B, the sensor is going to read it's ID. So suppose this sensor reads an ID. At this point it reads the ID object too so it knows that the object two is in this blue shaded a. So it would have to be either in room A or in room B. So it's created conceptionally this uncertain database which is room A or room B. Similarly suppose there is this other sensor, the yellow sensor why now again detected object two. This again reports and uncertain database which says the object two is in room B or room C. At this point, knowing what the sensors do, we know the object has to be in this light green area because it's protected by both of the sensors. Also notice that this color scheme will sort of go through the talk that blue and yellow combine to form green. So at this point, by combining information from both of these sensors on doing data integration in some sense you should be able to conclude that the object is in room B. Okay? So here what we've conceptionally done for this example, we've done some reasoning with uncertain and sort of resolved or removed the up certainty. Okay? And this would be a theme during the first part of the talk. Another kind of uncertainty which is very different is that data has non-canonical representations. So here the motivating example I'm talking about is we have a venue and their names. But people refer to these names using different ways or different representations. The user could refer to square as SQ and New York as NY. So due to this non-canonical representations what we conceptionally have is multiple names for the same venue. And there could be a very large number of them if there are more such non-canonical representations; in fact, and exponentially large number of them. One of the challenges this introduces in this case, the motivating example is searching these names is one of efficiency. How do we do efficient search in the presence of this large exponential explosion? And the second part of the talk we'll sort of deal with this challenge and get into more details there. So with this, the talk outline is in the first part of the talk we'll talk about data integration, which was about reasoning with uncertainty as we saw on one of the slides. The second part will deal with efficiency -- efficient fuzzy lookups, which is work I did at Microsoft Research Drago and Arvin. And then I'll just have one slide overviews of some work I've done on the trendy topics to promote discussions. So what's an uncertain database, revisiting that? As I mentioned, uncertain database is essentially a collection of possible worlds. So I'm not talking about uncertain databases with probabilities for this talk. I'm focusing on uncertain databases instead of possible worlds. Probabilistic databases typically are these possible worlds with a probability distribution over them. And I'm not going to get that during that talk. So think of uncertain databases sets of possible worlds. And what are the semantics of queries for uncertain databases? When a query's issued, it essentially in concept is issued for each possible world. The answer is created as a possible world of the result. So the result is also an uncertain database which is a set of possible worlds. So conceptionally the query is applied to each possible world and you get an uncertain database as a result. So this is the semantic layer, never actually done in practice. What you have is the set of possible worlds can be very large. These are represented using a compact representation. Think of the non-canonical slide wherein I essentially said by storing just how user might refer to square, SQ to square, that's a compact representation of a large number of possible worlds. So you would have a compact representation for an uncertain database. Query processing now involves getting us compact representation of the result without ever expanding out to the possible worlds. So this is the efficiency part and this is what -- the bottom part is what a typical uncertain database implements and the upper part defines what they want to implement of the semantics part. Okay? So what are implications of uncertain databases? I mentioned sensors, which is a commonly used implication. Extraction is another important application because generates a lot of uncertain data. Scientific data is another place where a lot of uncertain occurs. And data integration because you might not mostly -- you might not know exact results for entity matching. So but one thing you can observe in all these applications is that there's often not only one source of data, the blue source, but also other sources like the yellow and red source. For example, you have multiple sensors which are reporting partially overlapping values. You can have multiple extractors on the same webpage. You can have multiple webpages which are talking about the same information. So you would conceptually like to be able to combine information from all of these. And each source essentially is an uncertain database. Okay? In scientific data as well you could have multiple experiments of multiple observations supporting the same hypothesis. So one common theme in data integration by definition has multiple sources. So one common theme is that in a lot of applications of uncertain data is that you have multiple sources of information, all of which could be uncertain. So that's sort of what we'll handle in this talk. I forgot to mention please stop me at my point to ask questions, clarifications, anything. So let's move on with this motivation to talk about uncertain data integration. So what would -- let's say [inaudible] same picture. What would uncertain data integration look like? So what we have is a collection of uncertain databases, the yellow, the blue, and the red and more. In typical data integration fashion, we'd like to have a unified query interface for querying information from all of these sources, which would be unrelated schema. We need a bunch of mappings to associate each database with the mediated schema. This is just standard data integration applied to the uncertain context. And now one question is when a query is issued to this mediated schema what should the result be? Okay. So this is one thing we'll talk about during this talk. The second part which is how do we compactly represent these uncertain databases, how do we efficiently do query processing over these compact representations to get a result representation? This something I will not discussion in this talk, but we have some work in our papers about this. So what I'll do is focus on the upper parted defining what should be the results of uncertain data integration. Okay? So will be primarily theory and definitions. The second part will be more applied. Okay? So let's take a step back and see what data integration typically looks like and the high level objective. In a very simplistic world I have two certain tables, the blue table and the yellow table. I'm trying to combine them. So intuitively what I want to do is get that union in this sense. So here what I'm trying to show is the green part is tuples that I've got in both database and the yellow part is tuples that are only in the yellow database and simply for the blue part. So what we've gotten is more data as a result of data integration. So notice that I've assumed sort of that both of these are in the same schema this their identity mappings and everything. But intuitively we get more data as a result of data integration. In the uncertain world what we have is we're starting with two sets of possible worlds, two uncertain databases and we're trying to combine them. What I am going to argue is that you're going to get more data but you're also going to get less uncertain as a result of this combine as a final motivating example about getting a certain result for where the sensor was. So this will be a theme in the talk. How to get less uncertain. How do we make this happen. So again, one intuition that -- sure? >>: When you say it's only less uncertain how do you know [inaudible]. >> Parag Agrawal: So that's a very good question. So it's not always less uncertain. For instance if you have two databases which are totally unrelated with each other, that is we cannot correlate the possible will of one with the other, it tells that you there's absolutely no overlap in them. That means that when we combine them, you're not going to get any more -- any less uncertain. So important or less uncertain is sort of ill defined at the moment, right? So one way to think about it in the context of same knowing about the same data, the fewer possible worlds there are, the more certainty there is. But that is in the context of the same data. If you get information about hundred more tuples, counting just doesn't do it, okay? So let's now revisit the like extraction example. What we had was we had one extractor which gives two dates, 28th and 30th. And from that we created this uncertain data with all the assumptions I spoke about. Suppose you have another extractor which gives two other dates, okay, 27th and 28th. At this point, this also associated uncertain database. So what we'd like to do is combine these two uncertain databases to achieve this result which is at this point built on assumptions. Because if you reason yourself it's like we totally know now three dates on this. We have the assumption that all of these dates are associated with games. We also know that these games are dates and game one is before game two which is before game three. So you can reason your way to saying okay this should be my result. And so the way you get this from our uncertain databases is that two of the possible worlds are agreeing with each other on the game two dates. And the other possible worlds are contradicted by information in the other possible world. In the other database. Okay? So we essentially want to do formalize the intuitive reasoning you did to create this certain result. We like to formalize it in terms of data integration or possible worlds. Okay? So let's move on. So the setting we'll be thinking about our data integration problem will be the local-as-view setting. So what the local-as-view setting means is you have a selection of sources in mediated schema which is the unified query interface. And the way the mappings are defined between these sources and the scheme is what's the local-as-view. So you can think of each of the source as being a view or the actual database. So the mediated database is the actual database. Each of these sources defined by a query, QI, which is a view database and the constraint is that the mediated database or the actual database should contain at least all of the information in the view. And, hence, each of the sources is a view of this mediated schema. What implies is that M have to be such that it has enough data to such that when query [inaudible] this, you recovered your views and many more. Okay? So let's see how this works in the -- in the certain data case. Again, we have the blue and the yellow database. You combine them. This [inaudible] are satisfied by this M, called M one, which is essentially the union as I said before, plus one extra tuple red, which is not in the sources. This satisfies the local-as-view mappings. So this is a valued mediated database. Okay? Similarly that could be others which have some other red tuples. Okay? But among them there is this one, which is exactly the union, which is in some sense the least informative of all valued mediated databases, which does not have any information which is not implied by one of our sources. Okay? And that's essentially the definition. So I view it in terms of the unions, but you can apply the same argument over queries or mediated schema databases. So essentially the set of [inaudible] has been defined data integration is the least informative answers from a valued mediated database. Yes? >>: So what if I added a possible world in which [inaudible] fourth [inaudible] so that backed up to the previous slide you would [inaudible] and now there really isn't going to be a certain answer that I believe that's going to be covered by all of them, then what's -- does that mean data integration is dead? >> Parag Agrawal: No. So we are -- actually so if you had uncertainty of the form that there are four dates and only three real games. Conceptionally -- so the answer then at this point, given our straight-up knowledge is an uncertain database. And that is the need for doing uncertain data integration. So, for instance, to go to our example actually -- let me try to rewind that a little bit. Sorry. >>: [inaudible]. >> Parag Agrawal: Okay. Was it clear? I can explain it. Sorry about this. Yes. So suppose these two sensors had now reported let's say completely different dates, that they had reported say 29 and 31. Okay? But we've sort of -currently use the reasoning that all of these dates are correct. So we haven't cleared so possible world saying that 28th may not be the date for any of the games in which case having these two sources will in some sense become inconsistent which will come up later in the talk. But now if we reason that some of these games, if we had allowed the possibility that okay maybe these dates don't correspond to any games, in which case an empty database here and the databases with just single that game one is on 28th or game two is on 28th or game one is on -- or game three is on 28th and we've created all of these possible worlds, in which case when you combine them you'll essentially get sort of the intuitive database -- intuitive answer that you like, which are that there are these four dates in some order, in some permutation, some three of them correspond to game one, two, and three. That's what would happen as a result of what we do. That's very good question. Thanks. So moving on, in this certain case we said we have this notion of set answers which is the least informative of all valued mediated database answers. Yes? >>: [inaudible] is it the same subset of [inaudible] is it QI of SA subset of QI [inaudible]. >> Parag Agrawal: So it is SA is subset of QI of M. So think of it this way. These are constraints for the mediated schema. It's local -- so that what you're thinking of as global is view because the global is now -- you tell us how the global mediated database, you can add information from the source. Here what we are saying is putting constraints on the mediated database such that it has to have enough data to satisfy these mappings. Okay? >>: [inaudible]. >> Parag Agrawal: So and this example essentially tries to show that, that anything which has all the blue tuples is allowed. So the QI in this example is I, identity. Okay? So this is what the certain database -- yes? >>: [inaudible]. >> Parag Agrawal: Yes. This is the query [inaudible] mediated schema. And we like to define what the answer for that is. Okay? Now, conceptually any -according to these mappings, any M is valid as long as it contains essentially all your sources. If you think of identity mappings. Okay. So all of M1, M2, and M3 are valid mediated databases. Okay? So now the semantics of the query are that answers which are in all of these. So in the certain case this boils down to saying the least informative answer from all valid mediated database in concept is a certain answer. Okay? So that's what the notion of certain answer is. And now we are going to try corresponding -- try to find the corresponding notion in the uncertain case. Define what the result is. What do these mappings mean? That's where you're going. Okay? So the key here that I like to point out is this definition of containment. So wherever we've used containment, we've containment in defining what these mappings mean. We've used containment in defining -- sorry what least informative is. So conceptually in the certain case we know what the definition of containment is. If all the [inaudible] from database one are contained in the other it's contained in the other, right? And the same intuitive notion we have in our mind is what we've used in defining what these mappings mean in terms of what's -- what's more informative mediated answer, mediated database than all your sources, and which is the least informative among all mediated databases. So what we'll essentially try to do is define a containment definition for uncertain databases which captures the notion of information are more informative than two than achieve the corresponding definition of what these mappings mean and what the certain answer is, corresponding notion. Okay? So the key is the definition of containment. So let me now formalize what I mean by an uncertainty database so that I can move on to defining containment for these. So an uncertain database consists of two parts. One is the tuple set. Intuitively the tuple set is the collection of tuples that this database is aware of. A database is in the open world assumption so it does not know about all the tuples in the world. It knows about some slice of the world. Okay? So that's the tuple that's shown here in light blue. So it knows about say the object 2, rooms A and B. And it's possibility of being there. And there's a second part which is the set of possible worlds. Within these tuples it knows that here are the possible worlds for these tuples. So each possible world is a part of the tuple set. And by -- in this example W1 and W2 essentially say that one possible world is the object who is in world A, in room A and the other is object within room B. Here the information it is telling us is that object 2 is in at least one of these two rooms. Because if we did not have that information, it would have added the empty database as a possible world. It is also telling us that the object two is not simultaneously both in room A and room B. Because if that was the case, it would have added another possible world which had both of these tuples. Okay? So it's information comes by not enumerating some possible worlds. So think of now we're a [inaudible] case, the possible world set or just the power set of H. So all combinations. In which case this database is really giving us no information. It essentially says no answer in every bit of information you can ask it. I don't know whether this happens, I don't know whether these two happen together, I just don't know anything. So this is the no information case. Second is again to stress that absence of -- okay. So suppose now a tuple set contained a third tuple, 3P. Is the room -- is the object 3 in room P? Did that even happen? Now suppose I keep the possible world the same and the tuple set is not increased. So now what it's saying is 3 P does not occur in any possible world. So in the absence of a tuple in a possible world is also information because here it's saying that the object 3 can not be in room P. It has somehow extensively looked at room P and determined that object 3 is not in there. So absence of a tuple in possible world is also information. Okay? With that intuition let's move forward. Now let's try to define containment in one general case which is what if uncertain database had only one possible world. This sounds incredibly like a certain database but with a twist of it that absence of information is also information, absence of a tuple is also information. So think of the example. Let's say we have and a uncertain database U1 and this tuple set is this light set of tuples. Since it's only one possible world I can show it using this representation. The dark ones are the ones presenting in the W1 while the tuples are the entire thing. So in some sense you're saying that the light tuples here are not present in the database. Okay? So that's the information here. Suppose we are similarly a second database, the blue one. Yes? >>: [inaudible]. >> Parag Agrawal: So [inaudible] you can think of it this way. >>: [inaudible] some measure of -- >> Parag Agrawal: [inaudible] so typically what will happen is the extractor will enumerate the set of possible worlds for you and hence any tuple that occurs if any of these possible worlds are definitely in the tuple set. But then maybe more. Think of it this way. In the base data, in the original relations we may not encounter scenarios when extractor explicitly gives you negative information, which is of the form that even just did not happen. But when we start combining these uncertain databases and we get a resulting uncertain database, we were able to eliminate that the object 2 is room A. We'd now like to capture it because by combining information we can get too negative information. Even if we started with no negative information. Okay? So to keep our model complete in terms of incorporating results of combination of uncertain databases and keep that within the world of uncertain databases, we need to be able to represent such negative information in our definition of uncertain database. Okay? But you can also imagine that these sources do tell you about negative results. Think of it this way. Suppose I created a source like [inaudible]. So I have a sensor which complete records information for two rooms. Okay? I know I do not detect the object to that. Okay? So it can now insist that the object 2 is not room A and it's not in room B. Okay? This can be information that this uncertain database corresponding to this sensor can report. Okay? So now, again to go back to the slide, in the world of single possible world databases, the light tuples are ones with this database alerting that these tuples are not present in the database. The dark ones are ones that are alerting that these are present in the database. We are trying to figure out whether U1 is containing U2. So it's -- the definition here is kind of trivial, right, that you want to make sure that for everything that you will know about, which is definitely contained, you too must know about it. So in terms of presence of tuples U2 should have more information than you would. Similarly in terms of absence of tuples U2 should have more information than U1. Okay? If these things are satisfied U1 is said to be contained in U2. So this is just a minor extension to the containment definition for regular database incorporating this absence of tuple information. Okay? So now let's try to generalize this. Two sets of possible worlds. Okay? So again we have these two uncertain databases that are in blue. We are trying to figure out whether blue uncertain database has more information than the red uncertain database. Intuitively we found fewer possible worlds and more information. Okay? So to do this what we want to say is for all worlds in U2 for each possible world in U2 does their exist a possible -- can we find a possible world in U1? So is that the single possible world definition that we had profile applies. Okay? So basically what it's saying is that we don't need a possible world for all red possible worlds to be represented in the blue one because the blue one is allowed to be more certain. It is allowed to eliminate possibilities from the red world. Right? But the blue one, for everything the blue one it should be allowed by the red database. The red database is not allowed to collect any of that or give additional information there. Okay? So for every blue database we would like a possible world in this blue database to be mor -- there to be a corresponding one in the red database so that there's this containment. Which says that this blue one is more informative in that it is able to eliminate some of the possible worlds of the red database. But it contains all the information in the other one, the ones which it allows and is bigger than that. Okay? So that's the definition for containment for uncertain databases explained intuitively. For people familiar with power domains, this is a Smyth lifting of the regular -- of the single possible world containment definition. Okay? So the key changes we've made from going from regular -- uncertain regular databases is we said absence of tuples is information in the context of uncertain databases. So that intuition was presented to create a definition for single possible world databases. And then we said that fewer possible worlds are more information and hence we sues the Smyth lifting to find the containment information for uncertain databases. Okay? Now let's see how this definition takes us to data integration. So again, we have the same setup, the [inaudible] setup. We have [inaudible] mappings from sources to the mediated schema. We have for this example now two sources, each of which is uncertain, the blue and the yellow. A possible mediated database is by combining two possible worlds and adding some tuples as before, the red ones, similarly there could be another such database, which is [inaudible] so these arrows corresponding to containment that this M2, which is an uncertain database, again a single possible uncertain database that contains the entire -- the entire source as one. And it also contains the possible world demonstrated for the corresponding possible world for S1. So now you could have many such ones but you could also of uncertain databases which contain both of these, hence the query result can be uncertain as a result of data integration as we observed before. Again, there will be a informative valid mediated database. And we show in the paper a result which says that there is a unique such answer to every query. So given the definition of containment which is a partial order, this may not always be unique. For the definition I have presented, there is a unique minimum which is what we'll call the strongest correct answer. So notice again that you use the definition of containment in two parts. One, we have used it to define what valid mediated databases are and we've used it to define which one of these is the least informative one. And we can show that the least informative one is unique. So now that is the query result you would like and the result [inaudible]. Okay? >>: The least informative one? >> Parag Agrawal: Yes. So think of it this way. The valid mediated database is the set of databases with the mapping constraint to be more informative than each of your sources. There's a lot of such databases. But some of these databases of will have information that is not implied by any of your sources, which are these red tuples, which is just junk in some sense that none of your sources tell you this information. With the mapping still permeated as a valid mediated database, okay? So you have this large set some of which is more informative than any of your sources. Among these you want the least informative one as your answer. Just like the intuition for certain answers. And that's what we call the strongest correct answer in this case. And it's the corresponding notion to certain answers for uncertain data integration. >>: Smallest database [inaudible] necessarily [inaudible] you want the smallest database [inaudible]. >> Parag Agrawal: Yes, the smallest database, yes. >>: So [inaudible]. So when he answered you don't require a single instance of the mediated database, right? [inaudible] constraints. For the given query you want the intersection of the -- the result over all these, right? >>: Yes. Yes. >>: So you have now [inaudible] saying that multiple possible answer in databases. And what does it me to you to say you [inaudible]. >>: The intersection. >>: [inaudible] intersection? >>: Yes, in some sense. >>: Query a result of any one of these databases is uncertain database? >>: Yes. >>: And the query result on -- and so on. So you have a bunch of insert databases as a query [inaudible] what [inaudible]. >> Parag Agrawal: Okay. So think of the intersection operation in the certain world as the operation which is that take the containment defined according to just set containment for tuples. Okay? The intersection gives you the unique minimal set that is contained if all of your answers. That is the intersection operation. Right? So the corresponding operation here is we have a containment answer for uncertain databases. You have a bunch of answers which are all uncertain databases. A set or a database contained in each of these is the intersection operation. So you've taken the intersection operation, defined it according to a partial ->>: [inaudible] there exists a unique ->> Parag Agrawal: Yes, it's well defined. Exactly. >>: [inaudible] the negative tuples. >> Parag Agrawal: What? >>: Compared to the other case, the main difference is you're accounting for the negative tuples; is that correct? >> Parag Agrawal: No, so that's two differences. When -- okay. So when you degenerate this stuff, two single possible world database the difference is negative -- is uncertain databases. But the Smyth lifting is the second part. So there's two ingredients. Knowing that uncertain databases can have information about tuples not existing. Two, that you essentially less possible worlds is more information. And these two form together to form this containment definition which can form the code of the entire semantics for uncertain data integration. Okay? So ->>: So ->> Parag Agrawal: Yes? >>: You said there's a unique database. >> Parag Agrawal: There's a unique uncertain database. >>: There's a unique uncertain database, yes, which uncertain database, is that a single world or is that a distribution over possible worlds? >> Parag Agrawal: There is a collection of possible worlds. So in this case, M4 is a unique -- so let's say a map is for identity and a query was identity for this example, in which case M4 would be the unique answer we would present. Which is an uncertain database. But ->>: [inaudible]. >> Parag Agrawal: One question which I [inaudible] to earlier in response to a question was that you can get inconsistency. How does that happen? Think of the two uncertain databases the red one and the blue one. And we are just tuples. The red database asserts that A and B are mutually exclusive. They do not occur together in any world. Because it presumably knows about both of them and does not allow the AB possible world. Similarly, the blue database results that A and B always occur together. So if A occurs, B has to occur, if B occurs, A has to occur. These two pieces of information are conceptually inconsistent. There can be no scalable world where both of these things are true. And hence, no uncertain database. According to our definition it can contain both of these uncertain databases. Hence, this set of sources is now inconsistent. So definition of consistency, which is fairly natural, is does there exist a mediated database which satisfies all your sources and all your mappings? Okay? So that's what the definition of consistency is. And there's an example where database may not be consistent. >>: So these are [inaudible]. >> Parag Agrawal: Yes. [inaudible] sources. >>: And so why couldn't I have a distribution over the possible worlds so what is it so every possible world in a mediated schema has to contain both of them, is that the problem? >> Parag Agrawal: No, so the problem would be that -- so think of an uncertain database. If it has to contain information from the red database, it would have to say things like it knows for the first possible world of A and B, it knows about A and B, so it has to have, according to the single possible world definition, that A exists and B does not exist. It has to have a world. So for all of its world it would have to have this information that B does not exist when A exists. Okay? >>: I see. So every possible world has to be consistent ->> Parag Agrawal: [inaudible] yes. >>: And there's no possible world that can be ->> Parag Agrawal: Exactly. >>: Consistent with these two sources? >> Parag Agrawal: Yes. Thanks for that question. So now this material leads us to the first question which is how do we check whether collection of sources is consistent? So the problem completely defined as a given collection of sources. You're given their mediated schema and you're given a bunch of mappings. The first question is without even having a query of the mediated schema are these sources consistent with each other? So it's easy to see that this problem is NP-hard. Another observation you can make is it's PTIME, and that's a constant number of sources. So the hardness comes from a large number of sources. Okay? But a collection about applications would have performed that extraction. And we said that each extract over each webpage could be and uncertain database. So we'd like to be able to handle larger number of sources. So this hardness result is in some sense bad for us. So it's -- the hardness is not in the size of each uncertain database but in the number of uncertain databases? Okay. We also have another PTIME result that says that whenever these sources substance abuse acyclic hypergraphs. And I'm not going into how hypergraphs are induced from sources. But I'm happy to answer that in more time when we have it. We can show that in PTIME. Okay. So these are just a collection of theory result to set up the problem. Yes? >>: [inaudible] just wondering why [inaudible] completely extract [inaudible] produce the ->> Parag Agrawal: Mediated databases. >>: [inaudible]. >> Parag Agrawal: So I think -- okay. So there's two answers to that. One, local-as-view in itself may not be very interesting. But the containment definition is not restricted to -- in its application to local-as-view. The containment definition guides us in doing globalized view as well. We just haven't [inaudible] there. Okay? Second, local-as-view is interesting when you don't want to materialize the mediated database. And the reason for it being interesting more so in the context of uncertain databases is that you have a collection of sources, each of which has their representation. So [inaudible] uncertain databases is set of possible worlds is this union operation of just putting things in may not computationally be feasible efficient. It might computationally be helpful to think of not creating this mediated database and deferring this uncertainty reasoning, which is the hard part. For the query answering part to do it on demand only when required are. >>: [inaudible] as far as I understand [inaudible] definition [inaudible]. >> Parag Agrawal: So, yeah. [inaudible] for sure. So uncertain -- so monotonic queries basically. So what we're talking about for mappings and for answering queries are monotonic queries. >>: Mappings I can -- yeah, I can assume that that's [inaudible]. >> Parag Agrawal: Right, for -- yeah, but these sort of definitions break when you don't have monotonic queries. So -- because certain answers now becomes trickier to define. The definition will break. So that's correct. If there are no more questions, I'll move O. So there's a collection dogs there's a paper with some interpreting theory results that if people are interested, they can look at. This, I'm going to conclude this part of the talk and move on to the second part. So I'll just tell you what else is interesting here. This slide essentially says how much we understand for like the darker things are the more we understand. About how to do it. Lighter things are we don't. So we know how to -- so we said -- in the talk I always kept identity mappings and very simple things. But we can do multiple tables and monotonic queries. We understand how to do those. We know how to do some things with efficient representation, how to do efficient query answering for this restricted cases. When we start adding probabilities, we have some idea then we are writing those up in a new paper. And the idea is essentially at the high level to think of each uncertain database as essentially a belief function from evidence theory. So belief functions are -- belief functions are some generalization of property functions which let you express ignorance. And I'm happy to talk about this stuff later but this is very important because each of our uncertain database is ignoring it in many ways. It does not know about certain tuples, it does not know complete information. So that's sort of why we went -- a belief function but in the context of uncertain data integration. And using that, now we can do interesting things like represent non-reliability of sources, which can be part of ignorance. Okay? We can do things like know the consistency of sources is not partial. [inaudible] absolute zero, one, without probabilities. With probabilities only in some probability or some mass are chosen consistent. So you get this partial inconsistency. Then you can also think about when you have probabilities these sources may be dependent or independent. The way these probabilities are. And you need to reason differently. This something we understand less. We have not gone beyond local-as-view. We have not even thought of an implementation because these are foundational stuff to figure out how it can be done, the uncertain data integration problem. What you would want to eventually is like a specific application offered rather than a [inaudible]. So I just want to mention quickly a lot of related work is -- has happened in the uncertain databases world in the world of data integration which we leveraged in even some work about how uncertainty is introduced during the data integration and what you do with it. One interesting insight was that insert data integration is useful even if you have certain data and uncertain mappings. Our definitions still apply. They give the correct answers. That is an observation made in another paper which is just talking about data integration uncertainty. There's been work on data exchange in the uncertain space. We use leveraging evidence theory now quite deeply and there's a lot more work that has gone to and contributed to what we did. So going -- we finished like one part of the talk, which is the thee part. Yes? >>: Are there extractors that will give you probabilities in their answers or ->> Parag Agrawal: They give you scores. You collect scores everywhere. You cannot get probabilities. The only place when we know that you can get probabilities is if you are using -- for instance, is if you are using some sort of a forecaster. Saying that we have this parameterization forecasting sales in this region where it says what will our sales be. So people with probabilistic models to do this forecasting. So for worldwide events, the data you get are usually scores. You can sort of match them into probabilities, or find probabilities any way you like, but for forecasting it's more easier to think of the data you're getting as probabilities. >>: So is there value in interpreting scores pseudo probabilities derive from scores in order to do the kind of inferencing you're doing on the certain data and reason to believe that that will be a useful thing to do? >> Parag Agrawal: So nice question for philosophical argument. But let me just give you a very short answer on that. If you have exactly one number associated with each event, Cook's Theorem says that treating that as probabilities is the only correct way of doing it if you want certain nice properties. This is very old theorem, Cook's Theorem which is the justification of all of probability theory. If you allowed only one number as score to represent certain information about these events and you are to do reasoning on that, probability theory is the only way which has all these properties which are natural in [inaudible]. Okay? So that's sort of the argument saying that why you should try to get anything you have. If you are going to do anything with it, somehow bring them into the world of probabilities and then deal with them. How to do that is the hard part. And no one has sort of answered it. Okay? So I'm really running low on time, so I'll quickly move on to the second part which hopefully you'll find interesting as well. I'm going to rush through this part a little bit, because I'm low on time, so please interrupt me when I'm too fast. >>: [inaudible]. >> Christian Konig: So don't worry. We can go up to 12. >> Parag Agrawal: Okay. Great. So fuzzy lookups is the topic of the second part. I said -- so I skipped through this slide. The challenge here will be dealing with efficiency in the presence of uncertainty. So think of this part as the bottom part of this, that will define the semantics which is the top part in our possible worlds a priori and try to see how we can represent and efficiently execute those semantics for efficiency in the lower part in the possible world part. So fuzzy lookups is the topic of this part of the talk. On motivating applications for record matching and local search for fuzzy lookups. So in general fuzzy lookups are essentially based on a similarity function. So you have this large reference relation and you are trying to look up tuples which are similar to a target query. The reason they're fuzzy is they want to permit error tolerance. The query may not be exact match as a equality match which you could do using a simple hash table. And various similarity functions are used to this end, commonly edit distance, Jaccard similarity, Winkler distance, so on. The motivating applications we had for this were in the context of record matching, so in a data cleaning platform. You might want to do data cleaning and entry resolution. So as a first cut to find candidates for things which might be the same record, you want to do efficient lookup rather than doing pairwise comparisons across [inaudible]. In another motivating application is local search. So in both of these applications we are thinking about conceptionally small records. So we are not searching documents, we are searching records. Okay? And this will become important later. So another key to this is that there's various similarity function which are useful in various context. But if you think of such similarity builds lookups like say Jaccard similarity. They can form a primitive function for executing multiple such similarity functions, for instance. And that's why in a data platform, in a data cleaning platform or as a generic primitive such similarity functions are more interesting. And this was the -- this was something identified in previous work. So now we are going to use the set based similarity method called Jaccard containment. And let me try to define what it is. There's a bunch of you here who know this part, so please be patient. So set containment is what regular keyword searches thought of, that you have a reference relation. Think of the record in the reference relation are. And you have a query. It's in the result if the entire query is contained in your document. So if all the words in your document are in the query, it's the set containment lookup. In this case, we have and another word called pasta. Suppose we also have weights. Now Jaccard containment is defined as how much is the -- what fraction of the query is covered by the record? Okay? So think of in this case the word pasta is not in a record called Olive Garden. So the total weight -- so these are not probabilities, although the sum 2, 1 for convenience. The sum 2, 1 the weights and .8 is in the record. So .8 fraction of the query is covered by this record. Hence the Jaccard containment is pointed. Okay? This is a asymmetric version of Jaccard similarity. Because we are now penalizing only by the size of the query not by the size of the union. Okay? So we are not penalizing for large records in the reference relation in doing our similarity. So the problem of fuzzy lookups is to efficiently find all records. So the Jaccard containment value is beyond the threshold. So that's our lookup problem. Okay? Now, let me introduce some uncertainty into it. So we foe that users sometimes abbreviate square as SQ, New York as NY and make countless other ways of other mistakes. So what we conceptionally have, let's say, is a selection of transformations like this which say that square when a user says square, he might -- SQ, he might mean square. So one way to think about it is that the name is uncertain and that we like to make all of these names which are just representations of this name knowing what kind of other representation the you've uses as searchable. So we like these four possible words in some sense to be searchable. And now when we define Jaccard containment we like to define it as the maximum Jaccard containment over all possible worlds. So the Jaccard containment for this venue is over all possible worlds the maximum value, which is the max match. Okay? So this is the semantics we're attacking that you would like the max Jaccard containment over all possible worlds to a zero threshold. Okay? The challenge is efficiently processing this without actually creating all of these large number of possible worlds. By using the compact representation. Another way to think about this is the result -- the database is certain and the query is uncertain. So when the user says Madison SQ, NY, he means any one of these four queries which says that when he says SQ, he could mean SQ or he could mean square. Okay? When he means -- when he says NY, he could mean NY or he could mean New York. So now the relation is certain but our query is uncertain. Okay? So conceptually taking it as union of all these results is the same as the previous applied semantics that I defined. So in general, this notion of transformations is extremely powerful. The reason is it can capture similarity which is not captured by any textual method, for instance. Or even -- or it can capture similarities which may not be amenable to doing in the set containment world or in the set similarity world. Right? So this essentially lets us program similarity functions by just creating conceptionally a relation which basically keeps information about synonyms. Okay? So what we have is two kinds of error tolerance in this problem. Yes? >>: [inaudible] synonyms then couldn't uncertainty also be in the reference relation? >> Parag Agrawal: Yes. So conceptionally -- okay. So we don't treat these as synonyms. These are one way things. You can conceptually have transformations applied to both the query side and the reference relation side. And so -- yeah, so that is a valid question. In this talk we'll keep things simple and do it only on one side. So now ->>: [inaudible]. >> Parag Agrawal: Yes? >>: So let's say that in the simple case where there's no transformation, just Jaccard containment over let's say large dataset, is this program efficiently running this Jaccard containment overlap, is this problem well solved or ->> Parag Agrawal: So this problem -- there is a solution for that part of the problem called prefix filter. Our solution, as you will see, will be more efficient than that solution. So, yeah, even Jaccard containment without transformations is a hard problem. And we'll -- our solution will essentially improve on that problem while also solving the transformations case. That's a very nice question. Yes. Okay. So -- and the experiments like should show that. And if not, please ask me again. So now let's try to do Jaccard containment without transformations for a minute and see how one way we might be able to solve it. So we have a query Madison SQ, NY. And we like to do a Jaccard containment with threshold of .6. What this says is at least one-sixth part of this query should be covered by any record that is in the answer. Hence, one thing we can immediately see is if we issued these two red queries, Madison SQ and SQ, NY, both of these have weights which are greater than the threshold and get the results in union. That is the correct result. And the reason is for anything to cover more than .6 of the query, it would have to cover at least one of these two. Is that clear? Because if we only had a query which is Madison NY, that's only .5. So we don't want to issue those query. On the other hand, if we try to issue the entire query, Madison SQ, NY, the answer there is a subset of the answer of any one of these queries. So we'll get all of those answers. So we'll get all answers with threshold of either .8 or .7 or one. Those are only three values we can essentially have Jaccard containment on this. Okay? So essentially to issue a Jaccard containment query without transformations, woke find a collection of set containment queries which on unioning we can get a Jaccard containment answer. Okay? And this is how it's done. In the plans of transformation this generalizes nicely. You apply conceptually the transformed query, the green one. For that again do the same process of finding queries such that you can answer that. Now by taking all of the transform, all of these red queries, by -- so you generate all transformed queries, the green ones, by applying transformations. Then you apply find their subsets as I defined earlier, and you take all of these red queries, issue them, take their union and now that is the exact result you'd like for Jaccard containment in the presence of transformation. So this conceptually one naive way of doing things. Of course you can see that that's a large explosion here. The red queries are what we'll call variants in the rest of the talk. These are variants of the queries which are set containment queries whose union is the exact answer for Jaccard containment with transformations. Okay? So this tells us a naive solution outline, which is you're given a query. Generate all of its variants which are set containment queries. For set containment queries we know inverted index work well. Use an inverted index to answer each of these queries, get the results, do a union, fetch those from the reference relation. The union essentially gives you IDs. Fetch those from your reference relation and get the result. Okay? So this is a system that can be can be easily built using this naive solution. What are the problems with this naive solution? Why is this naive? This one, too many variants. The number of transformed queries can be exponentially large because it's essentially a cross-product. Secondly, each set containment lookup itself can be very expensive when we are having to do a lot of them to answer one request. So if we had to do only one set containment lookup, it is efficient enough, but when you have to a large number of them, we want to optimize that part as well. One reason why it can be very expensive is think of a query which is Madison, NY. Okay? Both of these are cities. So they're lists are fairly large. There's a lot of local places in Madison, there's a lot of places in NY. That intersection might be really small, which is things like Madison Square Garden, for instance. So the intersection -- to do intersection of two way [inaudible] is too expensive to get a very small result. So in this we'll attack both of these angles. First let's see intuition as to how we might not have to issue all these large number of variants to solve the problem. So there's intuition for issuing fewer number of queries. So again, the yellow is the original query, the green one is transformed query. The red ones are the variants. So we essentially conceptually like to issue all variants to answer our query. Think of a blue query, which is Madison. This arrow denotes that this query result contains all the results from these two red queries. Okay? Because if you have fewer keywords you get more results, a union of results. So now by issuing the blue query we assure that we get all the results from those two red queries. Similarly by issuing this other blue query, the NY, we get those two. So at this point we can notice that by issuing these two queries we'll get a superset of all results for all the variants. And this is what is called the variant covering. Notice that fewer queries are issued possibly by doing the covering but you can get false positives. There will be results which you'll get by issuing both of these queries that are not result in any of those four queries. Okay? So by issuing fewer queries we might be able to get more results, which is sort of a trivial statement because by issuing the empty query you get the entire database. So an empty query is always a variant covering. Okay? So we'd essentially like to find nice variant covering. And there are many of them. There's another variant query in square or SQ, there's a variant covering with three queries here. So essentially the problem is finding a good covering. So what this indicates is a solution outline, which how you might be able to change and improve it. One, as I hinted earlier, we're going to improve on the inverted index to make it more efficient for doing set containment queries by introducing something called the minimal infrequent index, which is a more efficient version. The second part will be -- will be doing an algorithm called CoverGen, which will generate good variant coverings efficiently. The intuition is that we want to issue fewer queries because we can't issue like an exponentially large number of queries. And a 30 pardon is now we'll get a supersets of the results. So we have to do this additional step of verifying which are the correct results. Okay? Because we are using a variant covering, the -- we need to do this final phase of throwing away results which actually don't [inaudible]. For this third part, that's an algorithm in our paper which uses maximal matching, maximum matching, which I'll not discuss at all in this talk. That algorithm essentially tries to say given a record and a query is the Jaccard containment greater than this threshold and does this efficiently? So first let's talk about a minimal infrequent index. In a minimal infrequent index, one intuition is we are not only going to create lists for individual keywords, we are going to create lists for sets of keywords as well, which can be thought of as queries. So instead of just having four lists for each of these four keywords you could have a list which materializes the result for the query Madison, NY, which is for a set. Okay? But we are not going to generate all of them. We are going to smart about that. So what we'll have is a parameter A. And say that we have a -- that's a frequency threshold which says that any list longer than size A is too frequent, okay, which will not index. So what we'll do is we'll not index long lists which are longer than this parameter A. We'll only index lists smaller than A. In addition we'll only index lists which are minimal among these infrequent lists. So for instance, we'll not index the list for Madison Square, NY because there is already a index part which is square, NY, which is small enough that -- and it contains all the results in Madison Square, NY. Okay? So we'll index minimal infrequent sets or minimal infrequent queries for a threshold A. Okay? So notice that we only index like three of these lists in this index, for example. Now, how is this index useful? Think of first only queries which have fewer than A results. So think of a query such that we know it has less than A results in its output. For this query what we can do to answer this is find by definition there is a minimal infrequent set that we have indexed, which contains this query. We can find that, we can fetch all of those records which are less than A and verify whether or not they actually satisfy the query and create our answer. And this example sort of says that Madison Square, NY was on you query and Madison Square is a list we have. And we're able to answer this query by finding first quickly which this list is and secondly, just scanning this list and verifying which record in this list. Now, when we have queries with more than A results, you can think of an a exponential set of parameters A, 2A, 4A which you can all [inaudible] in the same index. And then now a query conceptually which has between A and 2A results will be answered by the 2A index. A query which is less than A results will be answered by the A index and so on. Okay? So what this says is if a query has all results, if this always greater than A then are better than A by 2, we are going to do it in at most two times the number of results. But if it's like zero results we still need time A to make sure that has zero results or less than A results. So we have an output sensitive guarantee for doing set containment lookups. Okay? So this is the index we will use for exact containment lookups in our solution. The index will be useful in other ways as well because the index has metadata which is about frequency of various queries. So since we've materialized queries other than just single once, we have more knowledge about how big their results are, which we'll use in our CoverGen part of the paper. But also this index itself which has a output sensitive guarantee for set containment is often independent interest. Because it provides essentially a tradeoff in between space and efficiency. You say one query time, obviously you index month things. So we've seen the minimal infrequent index for efficient set containments. One of the -- yes? >>: [inaudible] index so do you make any assumptions on the way the data asks you to distribute it for this index to be applicable? >> Parag Agrawal: The index [inaudible] the size of the index can really blow up ->>: Realistically [inaudible] to me, realistically [inaudible]. >> Parag Agrawal: So -- okay. So if you have sort of small records, indexes, usually small. If you have you large records, the index can get large as defined. But another thing to notice this index is defined -- I used a definition of index which is frequency based. You can use any monotonic function or instead of just a frequency function. So you can essentially -- and that can be used to control the size of your index. For instance, if you have long records, one thing we folk about in the paper is two things have to be close to each other in some distance for them to be co-located in that index. So frequency threshold. Or you can say things like the biggest set we have every materialized has three keywords in it and we'll not do five keyword sets. So you can essentially choose for various settings difference monotonic functions to control the size of the index. We also did some investigation about how big the index would be for general data. And there is evidence in the frequency items that's literature to say just that the index is usually small. Because not all permutations are cut frequently. Okay? It's the same intuition as why a priori algorithm is usually efficient. It also generates all of these sets; in fact, it generates way more than the minimal infrequent sets. But -- yes? >>: Can you still [inaudible] index of single words too? >> Parag Agrawal: We don't need it, but our index is essentially a superset of that. We will definitely -- so think of it this way. Since we have an exponentially large number of parameters, think of any individual keyword and its list. Its frequency has to lie between two of these guys. So it will be minimal infrequent for one of them. For something, right? Hence, that list will be materialized for one of these. So an inverted index -- our index contains all the lists that are in the inverted index plus more. Okay? Another observation to make is when we use a parameter of A equals infinity as the lowest parameter, the index is the inverted index. So index for a certain parameter value is exactly the inverted index. Just because every individual keyword is minimal infrequent. For the A equals infinity. Yes. So I didn't -- we had a question about index size. We don't really make too many contributions about index size and build. We basically use literally from frequent itemsets to do our index build. And we have arguments to support why the index is usually small and tools to enable that. So the next part is how do we issue fewer queries. How do we generate this cover, which is good. There are two things we are challenging here. One, we don't want to verify too much. We don't like one covering would be the entire set [inaudible] get the entire database and verify it. That's too inefficient. We don't want to spend all our time in [inaudible] so that these two things we want to optimize for. As I hinted earlier, the information from our index will help us in -- will feed into how we do our CoverGen. So the index we can sort of have metadata which says that this query has this many results instead of just for single items. So we essentially have the frequency diagram in some sense of our data. Now, one approach which is the obvious approach is the set cover approach for doing -- generating this covering. Let me try to formalize what this problem is. So we had a query, a threshold bunch of transformations and we are able to generate our variants from this sector. Given these variants and given list that we have, so we have a parameter A collection, we have a parameter 2A collection, we have a parameter 4A collection and these all are index. Okay? Given this, we now know also the length of all of these lists. The question is can we cover? Now Madison covers the first two queries. Madison Square covers only the first query and square covers the middle two queries and so on. So this is a set cover problem in the sense can we cover our variants using some list that we have. So the cost for each set is the length of that list because we are going to fetch all of those and verify against them. So this I'll argue the blue set here, the three middle queries ask variant covering because it covers all of our variants. Okay. But so the thing is that we can -- by knowing the cost from our index -- yes? >>: [inaudible] the cost [inaudible] also have the CoverGen and index and cover costs? Does that also cover [inaudible]. >> Parag Agrawal: Okay. So [inaudible] so this cost metric is for defining what's the good cover. Is this -- so this cost is essentially only measuring the goodness of our covering. In terms of false positives. How many false positives will you get as a result of using this covering? >>: Not [inaudible] not the only thing, right? I mean there are other two aspects, the CoverGen and [inaudible]. >> Parag Agrawal: Absolutely. And that's why I said the second is not a good solution and then give you an alternative. So but if you plug in that cost function [inaudible] all these costs then you can [inaudible]. >> Parag Agrawal: So that's possible. So the way we went about -- this is we are trying to get good quality covering. And we're not accounting for how efficiently [inaudible] is covering. It's like saying in optimization you want a good plan but for now we're not accounting for how expensive it is to compute this plan. Okay? Think of this blue things as a plan. So the cost as we are implementing it is exactly the verification cost modulo overlap for this. So recall how this is going to work. We're going to take all of these lists, collect these IDs, fetch all of these records, which is again linear in this cost. For each of those records run the verification algorithm. So the cost after generating the covering is the proportional to the cost I have here. The cost of generating the covering is something I'm not measuring in defining this, okay? And that's actually where the are set algorithm will essentially have its downside. Okay? >>: [inaudible] like a dominating portion of the cluster. >> Parag Agrawal: We'll see that in the [inaudible] our experiments will show difference in where the cost goes. So okay. So to recap, we can now using metadata from our index and a greedy set cover algorithm generate a variant covering which has some set cover a guarantee that you're not getting too many false positives. There's a bound on the number of false positives you will get as a result of doing this. So set cover has this nice property which is a bound on false positives. You will not necessarily verify a lot of things. The problem with this as was noticed was that generating the covering itself may be expensive. Since I said there's a large number of these variants possible, in our set cover algorithm we are trying to generate all of these large number of variants and then run a set cover computation over them. So the set cover itself is very large. And hence this might be too expensive to generate, especially in the presence of a lot of transmissions. So to that end we have an alternative solution which is what we call the hitting set approach which essentially I'll not talk about in great detail. But here's the intuition. It essentially generates a covering which is this blue things which is this blue query to issue set containment to the set containment part which without actually enumerating all the variants. So we can guarantee that we will cover all our variants without ever generating these variants. And this usually essentially works from monotone dualization or lattice theory, which especially applied to frequent item sets to essentially indicate how this was done. This was -- this intuition exists in existing literature. But one thing that's interesting is that we require our index to have this property that it's enumerating all minimal infrequent sets for this algorithm to work. Because it does basically works on this lattice going up and down and it has to hit the bottom. Okay? So another thing to note here is while the variant covering may be generated more efficiently than set cover, the -- its heuristic in the number of false positives. So set cover had this nice guarantee that we have a theoretical bound and the number of false positives we'll have. Here that bound no longer applies. It's heuristic. But in our experiments we'll see how that affects us. So this is the second part of the -- the second approach which essentially addresses the problem of generating the covering itself efficiently and not only look in false positives. So to talk -- to set up the -- one experimental result that I'll show you, we have a prototype implementation as a stand-alone library. It uses an in-memory index, as you'd expect. So we used a bunch of datasets. The experiment I'll show you is from a places dataset. It is probably similar to the motivating example we've been running through. It has seven million records. We talked a little about the transformations we did in this experiment. We are up to 20 million transformations in different experiments. These experiments could be static which are of the form that Bob is Robert or SQ is square and so on. These are applied -- so that's a small amount of these. Then there's a large number of these programatic transformations like edit distance or abbreviations. Edit distances are everything within edit distance of one or two of a word. Or abbreviations are A could go to Alex or J could go to Jack, John, Jim, whatever. These are programatically provided. These are not like materialized in a table anywhere but programatically provided at query name. The experiment I'll show you is to compare the hitting set approach with the set approach and we'll see the time split as to in the various parts of the thing. To compare to prior work, unfortunately no prior work or fortunately no prior work handles the problem of Jaccard containment with transformations. As we mentioned, this prior work called prefix filter which handles the Jaccard containment caseload without transformations. One observation you can make is a hitting set algorithm can degenerate if you don't have transformations to this work. Secondly, our index, when you use the parameter A is equal to infinity can be generated to be exact index. That prefix filter uses which is the inverted index. So we can essentially use in our implementation implement prefix filter directly. And this is what we will compare to. Okay? So also this gives us a natural way of saying how does free fix filter generate the case of transformations? We use it on our implementation with transformations or the index equals infinity. Okay? So this is the plot comparing this. There's a lot of setup here. On the X axis we are sort of changing the amount of transformations in the three sets of plots. Each set has the three algorithms I've been mentioning, the prefix filter, the hitting set and set cover. On the Y axis we have performance. Lower is better. Quicker. So the three colors represent parts in various parts of the algorithm. The CoverGen is the red part. And the lookup fetch from the index and verify is the yellow, gold, whatever part. Okay? And the overheads are the green part. So one thing to note for all these three sets of plots is that the first plot uses less memory than the second and third because it uses the regular inverted index while the second and third part use our minimal infrequent index which for this dataset had a 20 percent overhead. Okay? So it's -- we're comparing performance but there is a memory difference between the first part and the second and third part. So now let's see what this result said. When you have no transformations by just the gain from using our index is essentially a three fold improvement in performance. And the hitting set and set cover algorithm are essentially -- since there's not a very large number of variants in general just because of the Jaccard containment threshold, we don't see the difference. So they're essentially similar. One thing this says is that the hitting set heuristic work well in this case. With -- as a we increase one of our transformations everything becomes slower which is kind of expected because there's now a larger number of queries you want to issue, more error tolerance, more results. And but as the transformation becomes really large you start seeing in the set cover the red part blows up. So the set cover generating the cover becomes very expensive at that point. Okay? And the hitting set has a significant benefit in when the number of transformations are large. Also you notice that the hitting set index plus verify part is only a little bit more than the set cover. So while heuristic in practice for this experiment it was not bad, the generated covering in terms of false positives. Yes? >>: To which degree do these numbers rely on the fact that your index [inaudible] index where sort of navigating to the start of the posting list in your index becomes relatively much more expensive. How would these things change and can you adapt your algorithm to this? >> Parag Agrawal: Okay. So let me repeat the question to see if I got it. So if our lists were [inaudible] on disk, that's an overhead of getting the top of the list. So the question is how does our algorithm verify relative to the others? That's a good question. So I would say that our algorithms would actually become less efficient in this case because if you look at the inverted list, there's a keyword that's a query and there's transformations. So there's a set of reachable keywords from those queries, which is a small set, which is in the size of the query and the number of transformations you have linear. So if you have 10 transformations which go to different end points, you have these 10 plus five, 15 essential keywords to look up. You can essentially collect all of them and sort of do work in memory in some sense to get the answer. So the inverted index solution touches fewer posting lists. Our solution touches far more posting lists than this. Hence, from that point of view, if you put it on disk I would -- my intuition suggests that our solution will become less efficient because it's just having to touch just more number of lists. But how this -- so both -- there's two opposing factors. My guess is that the prefix filter will sort of win. Okay. Yes? >>: [inaudible] how many word do you have in [inaudible] experiments here? >> Parag Agrawal: Query word? >>: Yes. Unique keywords that you [inaudible]. >> Parag Agrawal: I actually [inaudible] that question. So I do not have an answer. >>: How many distinct words in the whole collection? How -- like inverted lists how many lists would it have? >> Parag Agrawal: So yeah. So the conclusion of this experiment was that the hitting set went through a large number of transformations can help significantly and the heuristic index, the verification that happens it's not too bad. Again, as before, there's a the lot of related work that goes into this. There's work in similarity functions, fuzzy lookups have been defined for various other things. There's obviously inverted indexes and list intersection work. There's work from fuzzy autocompletion, exams relevant. Obviously we used a lot of work from frequent itemsets mining. And maybe more that I forgot to list here. So we've seen two main parts of our talk. Allow he to give you one slide overviews of the trendy topics part to promote discussion. The first one is how do you lose SQL inside what people are call NoSQL systems? So we started with a No SQL system called PNUTS. This is essentially a data store for serving Web content. What this system does is range scans and point lookups by -- and the key ingredients of this system are it works at a very large scale horizontal scalability, very low latency per Web serving. It does geographic replication to do low latency across like workloads which are global. It does weak consistency guarantees at a per record level rather than across record coordination. The system is designed to be highly available, very fault tolerant, thinks about bunch of failure modes, tries to make sure that the data stays and performance is not very well affected -- very affected in terms of various kinds of failures. What we did was we added this thing called notion called replica views which essentially enables you to do richer set of SQL on this rather than just the simple functionality which is secondary access, simple joins, equal joins, some group-by aggregates. So there's a lot of fine print as to what part of SQL you can do. You cannot do ad hoc SQL. You can do only predeclared SQL, and you can only do [inaudible] SQL. But we claim that those are the interesting parts for Web serving. I'm not going to the fine print and just leave you with the good stuff. And the key idea was replica views which helped us not have to worry about keeping all those scale by treating our views as PNUT stables, the same way as replica of PNUT stables were treated. We were able to bottle all the same consistency abstractions. We were able to get the whole geographic replication. We were able to get the fault tolerance. So the cool idea was by just reusing and conceptually thinking of our views as replicas, which are just acts of differently you get a lot of benefits in terms of abstractions and engineering. Okay? So this was -- this work I did with Yahoo! A second one was to shared scans in a map reduce like system. So the context was that Yahoo! has a Hadoop cluster with a bunch of datasets, some of these very large. Pig sits on top of this. And that's queries being written all of these datasets continuously. One objection is a large part of the time spent by this system was spent analyzing a few very popular datasets which are just that there's a bunch of queries attacking the same datasets. So there's an opportunity to share work across these queries. So one scan can sort of answer multiple queries or help answer multiple queries. So that's the opportunity. The way we went about it is we modelled how to anticipate these query arrivals because essentially to do the sharing you have to make one query wait while another one arrives which it can share with. So by anticipating query arrivals we were able to build a proprietary based scheduler. Which can help you do this scan sharing to enable efficient computation of our queries. The challenge there is that this increases throughput while possibly decreasing latency. So our contribution was to define a metric for how to share scans. Because in theory you can just make everyone wait for a long while and you had more sharing. So conceptually more throughput. But the latency increases. Okay? So we define metrics for saying how to do this scan sharing and we had a paper about that at -- in VLDB. So now I've spoken about the two parts in detail and two parts just one slide used to promote discussion. Let me just throw up a bunch of other work that I've done. I'm incurring working with folks at at&t on a data mining problem which is how do you summarize databases? I've done a study group essentially to make a case for RAMClouds, essentially a PNUTS like system for -- except the [inaudible] all your data is in memory. It's still viewable, but it's always [inaudible] from memory. And this introduces a bunch of challenges and a bunch of opportunities. And we essentially are a study group to figure this out. And we made a case for RAMClouds with a large number of people at Stanford. And then a bunch of other work on uncertain data which is not about integration but about code uncertain data. So with that, let me throw up a list of my coauthors over the last few years. I've worked with people like Yahoo! Research, Microsoft Research, at&t labs, Stanford Computer Science, outside of the InfoLab and a bunch of people at InfoLab. I'd like to thank you all of them as well as you guys for returning. Thank you very much. And I like have a minuter for questions. [laughter]. [applause]. >> Christian Konig: Okay. If we have one question, we'll take that. Otherwise we'll just stop here. I think we'll stop. Let's thank the speaker again. [applause]