>> Christian Konig: Good morning. It's my great... Agrawal from Stanford University. He's been working in a...

advertisement
>> Christian Konig: Good morning. It's my great pleasure to introduce Parag
Agrawal from Stanford University. He's been working in a wide range of area.
The main thrust of his work is on probabilistic and uncertain data, but he's also
been working on data cleaning, namely indexing containment of data. But also
things like secondary index maintenance and distributed systems and even
something as farfetched as blog ranking. So if you meet with him later, feel free
to ask him about any of these areas. And other than that I'll hand it over to
Parag.
>> Parag Agrawal: Thanks Christian for the introduction. He already finished off
two of my slides so please forgive the repetition.
So as Christian said, I'm going to talk about -- the title of the talk is Coping with
Uncertain Data. I'll talk about some of my thesis work, which is all about
uncertain data. And then I'll talk about some work I did here at Microsoft
Research with Drago and Arvin [phonetic] which was about indexing [inaudible]
containment but it has challenges. One of the challenges comes out of dealing
with uncertainty. Okay?
So as Christian also mentioned I do a bunch of other things. I'll try to give you
some introduction to some of those things like one slide over like one bullet
overviews to promote talking about those things in my one-on-ones.
So let's get started with uncertain data. Let me start with an example and tell you
what uncertain data is on uncertain databases are. So let's say I wanted to
subtract dates of 2010 World Series games. And let's say for the sake of Power
Point that there are three days in the World Series game. So I suppose have
Wikipedia article that I want to extract from. And I write a very, very naive data
extraction. It's not smart, it's not optimized. And it just finds two dates for them
and nothing else. It doesn't tell you which game is when or any such information.
So at this point, if I think about it, this could mean that game one and game two
are on October 28 and 30th by knowing these two dates. But it would also mean
that game one is on 28th and game three is on 30th or it could mean that game
two and three are on 28th and 30th.
Notice that I've used some knowledge here in generating these alternatives or
possibilities, which is that game one is before game two, which is before game
three, correct? And I've also sort of implicitly assumed that these dates are
associated with at least one game and that games are on distinct dates.
So based on all this knowledge I've sort of conceptionally created what we call
possible worlds. So we have three possible worlds listed here for a database,
which was about games and their dates. So this I'd argue these yellow -- this
yellow three tables, yellow tables are three possible worlds of and uncertain
database.
So intuitively and uncertain database consists of possible world and the intuitive
meaning is that the real state of the database is one of these possible worlds.
Okay? Let's move on.
Another way uncertain data can be created is through sensors. And let's assume
we have like three rooms. And we have a sensor. Let's say the blue sensor
which can sense some part of room one, room A and room B. Okay? And the
sensing is of the form that if there's an object within the shaded blue area of room
A or room B, the sensor is going to read it's ID. So suppose this sensor reads an
ID. At this point it reads the ID object too so it knows that the object two is in this
blue shaded a. So it would have to be either in room A or in room B. So it's
created conceptionally this uncertain database which is room A or room B.
Similarly suppose there is this other sensor, the yellow sensor why now again
detected object two. This again reports and uncertain database which says the
object two is in room B or room C. At this point, knowing what the sensors do,
we know the object has to be in this light green area because it's protected by
both of the sensors. Also notice that this color scheme will sort of go through the
talk that blue and yellow combine to form green.
So at this point, by combining information from both of these sensors on doing
data integration in some sense you should be able to conclude that the object is
in room B. Okay? So here what we've conceptionally done for this example,
we've done some reasoning with uncertain and sort of resolved or removed the
up certainty. Okay? And this would be a theme during the first part of the talk.
Another kind of uncertainty which is very different is that data has non-canonical
representations. So here the motivating example I'm talking about is we have a
venue and their names. But people refer to these names using different ways or
different representations.
The user could refer to square as SQ and New York as NY. So due to this
non-canonical representations what we conceptionally have is multiple names for
the same venue. And there could be a very large number of them if there are
more such non-canonical representations; in fact, and exponentially large
number of them.
One of the challenges this introduces in this case, the motivating example is
searching these names is one of efficiency. How do we do efficient search in the
presence of this large exponential explosion? And the second part of the talk
we'll sort of deal with this challenge and get into more details there.
So with this, the talk outline is in the first part of the talk we'll talk about data
integration, which was about reasoning with uncertainty as we saw on one of the
slides. The second part will deal with efficiency -- efficient fuzzy lookups, which
is work I did at Microsoft Research Drago and Arvin.
And then I'll just have one slide overviews of some work I've done on the trendy
topics to promote discussions.
So what's an uncertain database, revisiting that? As I mentioned, uncertain
database is essentially a collection of possible worlds. So I'm not talking about
uncertain databases with probabilities for this talk. I'm focusing on uncertain
databases instead of possible worlds. Probabilistic databases typically are these
possible worlds with a probability distribution over them. And I'm not going to get
that during that talk. So think of uncertain databases sets of possible worlds.
And what are the semantics of queries for uncertain databases? When a query's
issued, it essentially in concept is issued for each possible world. The answer is
created as a possible world of the result. So the result is also an uncertain
database which is a set of possible worlds.
So conceptionally the query is applied to each possible world and you get an
uncertain database as a result. So this is the semantic layer, never actually done
in practice. What you have is the set of possible worlds can be very large.
These are represented using a compact representation. Think of the
non-canonical slide wherein I essentially said by storing just how user might refer
to square, SQ to square, that's a compact representation of a large number of
possible worlds.
So you would have a compact representation for an uncertain database. Query
processing now involves getting us compact representation of the result without
ever expanding out to the possible worlds. So this is the efficiency part and this
is what -- the bottom part is what a typical uncertain database implements and
the upper part defines what they want to implement of the semantics part. Okay?
So what are implications of uncertain databases? I mentioned sensors, which is
a commonly used implication. Extraction is another important application
because generates a lot of uncertain data.
Scientific data is another place where a lot of uncertain occurs. And data
integration because you might not mostly -- you might not know exact results for
entity matching. So but one thing you can observe in all these applications is
that there's often not only one source of data, the blue source, but also other
sources like the yellow and red source.
For example, you have multiple sensors which are reporting partially overlapping
values. You can have multiple extractors on the same webpage. You can have
multiple webpages which are talking about the same information. So you would
conceptually like to be able to combine information from all of these. And each
source essentially is an uncertain database. Okay? In scientific data as well you
could have multiple experiments of multiple observations supporting the same
hypothesis.
So one common theme in data integration by definition has multiple sources. So
one common theme is that in a lot of applications of uncertain data is that you
have multiple sources of information, all of which could be uncertain. So that's
sort of what we'll handle in this talk.
I forgot to mention please stop me at my point to ask questions, clarifications,
anything. So let's move on with this motivation to talk about uncertain data
integration. So what would -- let's say [inaudible] same picture. What would
uncertain data integration look like? So what we have is a collection of uncertain
databases, the yellow, the blue, and the red and more.
In typical data integration fashion, we'd like to have a unified query interface for
querying information from all of these sources, which would be unrelated
schema. We need a bunch of mappings to associate each database with the
mediated schema. This is just standard data integration applied to the uncertain
context.
And now one question is when a query is issued to this mediated schema what
should the result be? Okay. So this is one thing we'll talk about during this talk.
The second part which is how do we compactly represent these uncertain
databases, how do we efficiently do query processing over these compact
representations to get a result representation? This something I will not
discussion in this talk, but we have some work in our papers about this. So what
I'll do is focus on the upper parted defining what should be the results of
uncertain data integration. Okay?
So will be primarily theory and definitions. The second part will be more applied.
Okay?
So let's take a step back and see what data integration typically looks like and
the high level objective. In a very simplistic world I have two certain tables, the
blue table and the yellow table. I'm trying to combine them. So intuitively what I
want to do is get that union in this sense.
So here what I'm trying to show is the green part is tuples that I've got in both
database and the yellow part is tuples that are only in the yellow database and
simply for the blue part. So what we've gotten is more data as a result of data
integration. So notice that I've assumed sort of that both of these are in the same
schema this their identity mappings and everything. But intuitively we get more
data as a result of data integration.
In the uncertain world what we have is we're starting with two sets of possible
worlds, two uncertain databases and we're trying to combine them. What I am
going to argue is that you're going to get more data but you're also going to get
less uncertain as a result of this combine as a final motivating example about
getting a certain result for where the sensor was. So this will be a theme in the
talk. How to get less uncertain. How do we make this happen.
So again, one intuition that -- sure?
>>: When you say it's only less uncertain how do you know [inaudible].
>> Parag Agrawal: So that's a very good question. So it's not always less
uncertain. For instance if you have two databases which are totally unrelated
with each other, that is we cannot correlate the possible will of one with the other,
it tells that you there's absolutely no overlap in them. That means that when we
combine them, you're not going to get any more -- any less uncertain. So
important or less uncertain is sort of ill defined at the moment, right? So one way
to think about it in the context of same knowing about the same data, the fewer
possible worlds there are, the more certainty there is. But that is in the context of
the same data. If you get information about hundred more tuples, counting just
doesn't do it, okay? So let's now revisit the like extraction example. What we
had was we had one extractor which gives two dates, 28th and 30th. And from
that we created this uncertain data with all the assumptions I spoke about.
Suppose you have another extractor which gives two other dates, okay, 27th and
28th. At this point, this also associated uncertain database. So what we'd like to
do is combine these two uncertain databases to achieve this result which is at
this point built on assumptions. Because if you reason yourself it's like we totally
know now three dates on this. We have the assumption that all of these dates
are associated with games. We also know that these games are dates and
game one is before game two which is before game three. So you can reason
your way to saying okay this should be my result.
And so the way you get this from our uncertain databases is that two of the
possible worlds are agreeing with each other on the game two dates. And the
other possible worlds are contradicted by information in the other possible world.
In the other database. Okay? So we essentially want to do formalize the
intuitive reasoning you did to create this certain result. We like to formalize it in
terms of data integration or possible worlds. Okay?
So let's move on. So the setting we'll be thinking about our data integration
problem will be the local-as-view setting. So what the local-as-view setting
means is you have a selection of sources in mediated schema which is the
unified query interface. And the way the mappings are defined between these
sources and the scheme is what's the local-as-view. So you can think of each of
the source as being a view or the actual database. So the mediated database is
the actual database. Each of these sources defined by a query, QI, which is a
view database and the constraint is that the mediated database or the actual
database should contain at least all of the information in the view. And, hence,
each of the sources is a view of this mediated schema. What implies is that M
have to be such that it has enough data to such that when query [inaudible] this,
you recovered your views and many more. Okay?
So let's see how this works in the -- in the certain data case. Again, we have the
blue and the yellow database. You combine them. This [inaudible] are satisfied
by this M, called M one, which is essentially the union as I said before, plus one
extra tuple red, which is not in the sources.
This satisfies the local-as-view mappings. So this is a valued mediated
database. Okay? Similarly that could be others which have some other red
tuples. Okay? But among them there is this one, which is exactly the union,
which is in some sense the least informative of all valued mediated databases,
which does not have any information which is not implied by one of our sources.
Okay? And that's essentially the definition. So I view it in terms of the unions,
but you can apply the same argument over queries or mediated schema
databases.
So essentially the set of [inaudible] has been defined data integration is the least
informative answers from a valued mediated database. Yes?
>>: So what if I added a possible world in which [inaudible] fourth [inaudible] so
that backed up to the previous slide you would [inaudible] and now there really
isn't going to be a certain answer that I believe that's going to be covered by all of
them, then what's -- does that mean data integration is dead?
>> Parag Agrawal: No. So we are -- actually so if you had uncertainty of the
form that there are four dates and only three real games. Conceptionally -- so
the answer then at this point, given our straight-up knowledge is an uncertain
database. And that is the need for doing uncertain data integration.
So, for instance, to go to our example actually -- let me try to rewind that a little
bit. Sorry.
>>: [inaudible].
>> Parag Agrawal: Okay. Was it clear? I can explain it. Sorry about this. Yes.
So suppose these two sensors had now reported let's say completely different
dates, that they had reported say 29 and 31. Okay? But we've sort of -currently use the reasoning that all of these dates are correct. So we haven't
cleared so possible world saying that 28th may not be the date for any of the
games in which case having these two sources will in some sense become
inconsistent which will come up later in the talk.
But now if we reason that some of these games, if we had allowed the possibility
that okay maybe these dates don't correspond to any games, in which case an
empty database here and the databases with just single that game one is on 28th
or game two is on 28th or game one is on -- or game three is on 28th and we've
created all of these possible worlds, in which case when you combine them you'll
essentially get sort of the intuitive database -- intuitive answer that you like, which
are that there are these four dates in some order, in some permutation, some
three of them correspond to game one, two, and three. That's what would
happen as a result of what we do. That's very good question. Thanks.
So moving on, in this certain case we said we have this notion of set answers
which is the least informative of all valued mediated database answers. Yes?
>>: [inaudible] is it the same subset of [inaudible] is it QI of SA subset of QI
[inaudible].
>> Parag Agrawal: So it is SA is subset of QI of M. So think of it this way.
These are constraints for the mediated schema. It's local -- so that what you're
thinking of as global is view because the global is now -- you tell us how the
global mediated database, you can add information from the source. Here what
we are saying is putting constraints on the mediated database such that it has to
have enough data to satisfy these mappings. Okay?
>>: [inaudible].
>> Parag Agrawal: So and this example essentially tries to show that, that
anything which has all the blue tuples is allowed. So the QI in this example is I,
identity. Okay? So this is what the certain database -- yes?
>>: [inaudible].
>> Parag Agrawal: Yes. This is the query [inaudible] mediated schema. And we
like to define what the answer for that is. Okay? Now, conceptually any -according to these mappings, any M is valid as long as it contains essentially all
your sources. If you think of identity mappings. Okay. So all of M1, M2, and M3
are valid mediated databases. Okay?
So now the semantics of the query are that answers which are in all of these. So
in the certain case this boils down to saying the least informative answer from all
valid mediated database in concept is a certain answer. Okay? So that's what
the notion of certain answer is. And now we are going to try corresponding -- try
to find the corresponding notion in the uncertain case. Define what the result is.
What do these mappings mean? That's where you're going. Okay?
So the key here that I like to point out is this definition of containment. So
wherever we've used containment, we've containment in defining what these
mappings mean. We've used containment in defining -- sorry what least
informative is. So conceptually in the certain case we know what the definition of
containment is. If all the [inaudible] from database one are contained in the other
it's contained in the other, right? And the same intuitive notion we have in our
mind is what we've used in defining what these mappings mean in terms of
what's -- what's more informative mediated answer, mediated database than all
your sources, and which is the least informative among all mediated databases.
So what we'll essentially try to do is define a containment definition for uncertain
databases which captures the notion of information are more informative than
two than achieve the corresponding definition of what these mappings mean and
what the certain answer is, corresponding notion. Okay? So the key is the
definition of containment.
So let me now formalize what I mean by an uncertainty database so that I can
move on to defining containment for these. So an uncertain database consists of
two parts. One is the tuple set. Intuitively the tuple set is the collection of tuples
that this database is aware of. A database is in the open world assumption so it
does not know about all the tuples in the world. It knows about some slice of the
world. Okay?
So that's the tuple that's shown here in light blue. So it knows about say the
object 2, rooms A and B. And it's possibility of being there. And there's a second
part which is the set of possible worlds. Within these tuples it knows that here
are the possible worlds for these tuples. So each possible world is a part of the
tuple set. And by -- in this example W1 and W2 essentially say that one possible
world is the object who is in world A, in room A and the other is object within
room B.
Here the information it is telling us is that object 2 is in at least one of these two
rooms. Because if we did not have that information, it would have added the
empty database as a possible world. It is also telling us that the object two is not
simultaneously both in room A and room B. Because if that was the case, it
would have added another possible world which had both of these tuples. Okay?
So it's information comes by not enumerating some possible worlds. So think of
now we're a [inaudible] case, the possible world set or just the power set of H.
So all combinations. In which case this database is really giving us no
information. It essentially says no answer in every bit of information you can ask
it. I don't know whether this happens, I don't know whether these two happen
together, I just don't know anything. So this is the no information case.
Second is again to stress that absence of -- okay. So suppose now a tuple set
contained a third tuple, 3P. Is the room -- is the object 3 in room P? Did that
even happen?
Now suppose I keep the possible world the same and the tuple set is not
increased. So now what it's saying is 3 P does not occur in any possible world.
So in the absence of a tuple in a possible world is also information because here
it's saying that the object 3 can not be in room P. It has somehow extensively
looked at room P and determined that object 3 is not in there. So absence of a
tuple in possible world is also information. Okay?
With that intuition let's move forward. Now let's try to define containment in one
general case which is what if uncertain database had only one possible world.
This sounds incredibly like a certain database but with a twist of it that absence
of information is also information, absence of a tuple is also information. So think
of the example.
Let's say we have and a uncertain database U1 and this tuple set is this light set
of tuples. Since it's only one possible world I can show it using this
representation. The dark ones are the ones presenting in the W1 while the
tuples are the entire thing.
So in some sense you're saying that the light tuples here are not present in the
database. Okay? So that's the information here. Suppose we are similarly a
second database, the blue one. Yes?
>>: [inaudible].
>> Parag Agrawal: So [inaudible] you can think of it this way.
>>: [inaudible] some measure of --
>> Parag Agrawal: [inaudible] so typically what will happen is the extractor will
enumerate the set of possible worlds for you and hence any tuple that occurs if
any of these possible worlds are definitely in the tuple set. But then maybe more.
Think of it this way. In the base data, in the original relations we may not
encounter scenarios when extractor explicitly gives you negative information,
which is of the form that even just did not happen.
But when we start combining these uncertain databases and we get a resulting
uncertain database, we were able to eliminate that the object 2 is room A. We'd
now like to capture it because by combining information we can get too negative
information. Even if we started with no negative information. Okay? So to keep
our model complete in terms of incorporating results of combination of uncertain
databases and keep that within the world of uncertain databases, we need to be
able to represent such negative information in our definition of uncertain
database. Okay?
But you can also imagine that these sources do tell you about negative results.
Think of it this way. Suppose I created a source like [inaudible]. So I have a
sensor which complete records information for two rooms. Okay? I know I do
not detect the object to that. Okay? So it can now insist that the object 2 is not
room A and it's not in room B. Okay? This can be information that this uncertain
database corresponding to this sensor can report. Okay?
So now, again to go back to the slide, in the world of single possible world
databases, the light tuples are ones with this database alerting that these tuples
are not present in the database. The dark ones are ones that are alerting that
these are present in the database. We are trying to figure out whether U1 is
containing U2.
So it's -- the definition here is kind of trivial, right, that you want to make sure that
for everything that you will know about, which is definitely contained, you too
must know about it. So in terms of presence of tuples U2 should have more
information than you would. Similarly in terms of absence of tuples U2 should
have more information than U1. Okay?
If these things are satisfied U1 is said to be contained in U2. So this is just a
minor extension to the containment definition for regular database incorporating
this absence of tuple information. Okay? So now let's try to generalize this. Two
sets of possible worlds. Okay?
So again we have these two uncertain databases that are in blue. We are trying
to figure out whether blue uncertain database has more information than the red
uncertain database. Intuitively we found fewer possible worlds and more
information. Okay?
So to do this what we want to say is for all worlds in U2 for each possible world in
U2 does their exist a possible -- can we find a possible world in U1? So is that
the single possible world definition that we had profile applies. Okay? So
basically what it's saying is that we don't need a possible world for all red
possible worlds to be represented in the blue one because the blue one is
allowed to be more certain. It is allowed to eliminate possibilities from the red
world. Right? But the blue one, for everything the blue one it should be allowed
by the red database. The red database is not allowed to collect any of that or
give additional information there. Okay? So for every blue database we would
like a possible world in this blue database to be mor -- there to be a
corresponding one in the red database so that there's this containment. Which
says that this blue one is more informative in that it is able to eliminate some of
the possible worlds of the red database.
But it contains all the information in the other one, the ones which it allows and is
bigger than that. Okay? So that's the definition for containment for uncertain
databases explained intuitively. For people familiar with power domains, this is a
Smyth lifting of the regular -- of the single possible world containment definition.
Okay?
So the key changes we've made from going from regular -- uncertain regular
databases is we said absence of tuples is information in the context of uncertain
databases. So that intuition was presented to create a definition for single
possible world databases. And then we said that fewer possible worlds are more
information and hence we sues the Smyth lifting to find the containment
information for uncertain databases. Okay?
Now let's see how this definition takes us to data integration. So again, we have
the same setup, the [inaudible] setup. We have [inaudible] mappings from
sources to the mediated schema. We have for this example now two sources,
each of which is uncertain, the blue and the yellow.
A possible mediated database is by combining two possible worlds and adding
some tuples as before, the red ones, similarly there could be another such
database, which is [inaudible] so these arrows corresponding to containment that
this M2, which is an uncertain database, again a single possible uncertain
database that contains the entire -- the entire source as one. And it also contains
the possible world demonstrated for the corresponding possible world for S1.
So now you could have many such ones but you could also of uncertain
databases which contain both of these, hence the query result can be uncertain
as a result of data integration as we observed before. Again, there will be a
informative valid mediated database. And we show in the paper a result which
says that there is a unique such answer to every query.
So given the definition of containment which is a partial order, this may not
always be unique. For the definition I have presented, there is a unique
minimum which is what we'll call the strongest correct answer. So notice again
that you use the definition of containment in two parts. One, we have used it to
define what valid mediated databases are and we've used it to define which one
of these is the least informative one. And we can show that the least informative
one is unique. So now that is the query result you would like and the result
[inaudible]. Okay?
>>: The least informative one?
>> Parag Agrawal: Yes. So think of it this way. The valid mediated database is
the set of databases with the mapping constraint to be more informative than
each of your sources. There's a lot of such databases. But some of these
databases of will have information that is not implied by any of your sources,
which are these red tuples, which is just junk in some sense that none of your
sources tell you this information. With the mapping still permeated as a valid
mediated database, okay?
So you have this large set some of which is more informative than any of your
sources. Among these you want the least informative one as your answer. Just
like the intuition for certain answers. And that's what we call the strongest correct
answer in this case. And it's the corresponding notion to certain answers for
uncertain data integration.
>>: Smallest database [inaudible] necessarily [inaudible] you want the smallest
database [inaudible].
>> Parag Agrawal: Yes, the smallest database, yes.
>>: So [inaudible]. So when he answered you don't require a single instance of
the mediated database, right? [inaudible] constraints. For the given query you
want the intersection of the -- the result over all these, right?
>>: Yes. Yes.
>>: So you have now [inaudible] saying that multiple possible answer in
databases. And what does it me to you to say you [inaudible].
>>: The intersection.
>>: [inaudible] intersection?
>>: Yes, in some sense.
>>: Query a result of any one of these databases is uncertain database?
>>: Yes.
>>: And the query result on -- and so on. So you have a bunch of insert
databases as a query [inaudible] what [inaudible].
>> Parag Agrawal: Okay. So think of the intersection operation in the certain
world as the operation which is that take the containment defined according to
just set containment for tuples. Okay? The intersection gives you the unique
minimal set that is contained if all of your answers. That is the intersection
operation. Right?
So the corresponding operation here is we have a containment answer for
uncertain databases. You have a bunch of answers which are all uncertain
databases. A set or a database contained in each of these is the intersection
operation. So you've taken the intersection operation, defined it according to a
partial ->>: [inaudible] there exists a unique ->> Parag Agrawal: Yes, it's well defined. Exactly.
>>: [inaudible] the negative tuples.
>> Parag Agrawal: What?
>>: Compared to the other case, the main difference is you're accounting for the
negative tuples; is that correct?
>> Parag Agrawal: No, so that's two differences. When -- okay. So when you
degenerate this stuff, two single possible world database the difference is
negative -- is uncertain databases. But the Smyth lifting is the second part. So
there's two ingredients.
Knowing that uncertain databases can have information about tuples not existing.
Two, that you essentially less possible worlds is more information. And these
two form together to form this containment definition which can form the code of
the entire semantics for uncertain data integration. Okay? So ->>: So ->> Parag Agrawal: Yes?
>>: You said there's a unique database.
>> Parag Agrawal: There's a unique uncertain database.
>>: There's a unique uncertain database, yes, which uncertain database, is that
a single world or is that a distribution over possible worlds?
>> Parag Agrawal: There is a collection of possible worlds. So in this case, M4
is a unique -- so let's say a map is for identity and a query was identity for this
example, in which case M4 would be the unique answer we would present.
Which is an uncertain database. But ->>: [inaudible].
>> Parag Agrawal: One question which I [inaudible] to earlier in response to a
question was that you can get inconsistency. How does that happen? Think of
the two uncertain databases the red one and the blue one. And we are just
tuples. The red database asserts that A and B are mutually exclusive. They do
not occur together in any world. Because it presumably knows about both of
them and does not allow the AB possible world.
Similarly, the blue database results that A and B always occur together. So if A
occurs, B has to occur, if B occurs, A has to occur.
These two pieces of information are conceptually inconsistent. There can be no
scalable world where both of these things are true. And hence, no uncertain
database. According to our definition it can contain both of these uncertain
databases. Hence, this set of sources is now inconsistent.
So definition of consistency, which is fairly natural, is does there exist a mediated
database which satisfies all your sources and all your mappings? Okay? So
that's what the definition of consistency is. And there's an example where
database may not be consistent.
>>: So these are [inaudible].
>> Parag Agrawal: Yes. [inaudible] sources.
>>: And so why couldn't I have a distribution over the possible worlds so what is
it so every possible world in a mediated schema has to contain both of them, is
that the problem?
>> Parag Agrawal: No, so the problem would be that -- so think of an uncertain
database. If it has to contain information from the red database, it would have to
say things like it knows for the first possible world of A and B, it knows about A
and B, so it has to have, according to the single possible world definition, that A
exists and B does not exist. It has to have a world.
So for all of its world it would have to have this information that B does not exist
when A exists. Okay?
>>: I see. So every possible world has to be consistent ->> Parag Agrawal: [inaudible] yes.
>>: And there's no possible world that can be ->> Parag Agrawal: Exactly.
>>: Consistent with these two sources?
>> Parag Agrawal: Yes. Thanks for that question. So now this material leads us
to the first question which is how do we check whether collection of sources is
consistent? So the problem completely defined as a given collection of sources.
You're given their mediated schema and you're given a bunch of mappings. The
first question is without even having a query of the mediated schema are these
sources consistent with each other? So it's easy to see that this problem is
NP-hard.
Another observation you can make is it's PTIME, and that's a constant number of
sources. So the hardness comes from a large number of sources. Okay? But a
collection about applications would have performed that extraction. And we said
that each extract over each webpage could be and uncertain database. So we'd
like to be able to handle larger number of sources. So this hardness result is in
some sense bad for us.
So it's -- the hardness is not in the size of each uncertain database but in the
number of uncertain databases? Okay. We also have another PTIME result that
says that whenever these sources substance abuse acyclic hypergraphs. And
I'm not going into how hypergraphs are induced from sources. But I'm happy to
answer that in more time when we have it. We can show that in PTIME. Okay.
So these are just a collection of theory result to set up the problem. Yes?
>>: [inaudible] just wondering why [inaudible] completely extract [inaudible]
produce the ->> Parag Agrawal: Mediated databases.
>>: [inaudible].
>> Parag Agrawal: So I think -- okay. So there's two answers to that. One,
local-as-view in itself may not be very interesting. But the containment definition
is not restricted to -- in its application to local-as-view. The containment definition
guides us in doing globalized view as well. We just haven't [inaudible] there.
Okay?
Second, local-as-view is interesting when you don't want to materialize the
mediated database. And the reason for it being interesting more so in the
context of uncertain databases is that you have a collection of sources, each of
which has their representation. So [inaudible] uncertain databases is set of
possible worlds is this union operation of just putting things in may not
computationally be feasible efficient. It might computationally be helpful to think
of not creating this mediated database and deferring this uncertainty reasoning,
which is the hard part. For the query answering part to do it on demand only
when required are.
>>: [inaudible] as far as I understand [inaudible] definition [inaudible].
>> Parag Agrawal: So, yeah. [inaudible] for sure. So uncertain -- so monotonic
queries basically. So what we're talking about for mappings and for answering
queries are monotonic queries.
>>: Mappings I can -- yeah, I can assume that that's [inaudible].
>> Parag Agrawal: Right, for -- yeah, but these sort of definitions break when
you don't have monotonic queries. So -- because certain answers now becomes
trickier to define. The definition will break. So that's correct. If there are no more
questions, I'll move O.
So there's a collection dogs there's a paper with some interpreting theory results
that if people are interested, they can look at.
This, I'm going to conclude this part of the talk and move on to the second part.
So I'll just tell you what else is interesting here. This slide essentially says how
much we understand for like the darker things are the more we understand.
About how to do it. Lighter things are we don't.
So we know how to -- so we said -- in the talk I always kept identity mappings
and very simple things. But we can do multiple tables and monotonic queries.
We understand how to do those. We know how to do some things with efficient
representation, how to do efficient query answering for this restricted cases.
When we start adding probabilities, we have some idea then we are writing those
up in a new paper. And the idea is essentially at the high level to think of each
uncertain database as essentially a belief function from evidence theory. So
belief functions are -- belief functions are some generalization of property
functions which let you express ignorance. And I'm happy to talk about this stuff
later but this is very important because each of our uncertain database is
ignoring it in many ways. It does not know about certain tuples, it does not know
complete information. So that's sort of why we went -- a belief function but in the
context of uncertain data integration. And using that, now we can do interesting
things like represent non-reliability of sources, which can be part of ignorance.
Okay?
We can do things like know the consistency of sources is not partial. [inaudible]
absolute zero, one, without probabilities. With probabilities only in some
probability or some mass are chosen consistent. So you get this partial
inconsistency.
Then you can also think about when you have probabilities these sources may
be dependent or independent. The way these probabilities are. And you need to
reason differently. This something we understand less. We have not gone
beyond local-as-view. We have not even thought of an implementation because
these are foundational stuff to figure out how it can be done, the uncertain data
integration problem. What you would want to eventually is like a specific
application offered rather than a [inaudible].
So I just want to mention quickly a lot of related work is -- has happened in the
uncertain databases world in the world of data integration which we leveraged in
even some work about how uncertainty is introduced during the data integration
and what you do with it.
One interesting insight was that insert data integration is useful even if you have
certain data and uncertain mappings. Our definitions still apply. They give the
correct answers. That is an observation made in another paper which is just
talking about data integration uncertainty.
There's been work on data exchange in the uncertain space. We use leveraging
evidence theory now quite deeply and there's a lot more work that has gone to
and contributed to what we did.
So going -- we finished like one part of the talk, which is the thee part. Yes?
>>: Are there extractors that will give you probabilities in their answers or ->> Parag Agrawal: They give you scores. You collect scores everywhere. You
cannot get probabilities. The only place when we know that you can get
probabilities is if you are using -- for instance, is if you are using some sort of a
forecaster. Saying that we have this parameterization forecasting sales in this
region where it says what will our sales be. So people with probabilistic models
to do this forecasting. So for worldwide events, the data you get are usually
scores. You can sort of match them into probabilities, or find probabilities any
way you like, but for forecasting it's more easier to think of the data you're getting
as probabilities.
>>: So is there value in interpreting scores pseudo probabilities derive from
scores in order to do the kind of inferencing you're doing on the certain data and
reason to believe that that will be a useful thing to do?
>> Parag Agrawal: So nice question for philosophical argument. But let me just
give you a very short answer on that.
If you have exactly one number associated with each event, Cook's Theorem
says that treating that as probabilities is the only correct way of doing it if you
want certain nice properties. This is very old theorem, Cook's Theorem which is
the justification of all of probability theory. If you allowed only one number as
score to represent certain information about these events and you are to do
reasoning on that, probability theory is the only way which has all these
properties which are natural in [inaudible]. Okay? So that's sort of the argument
saying that why you should try to get anything you have. If you are going to do
anything with it, somehow bring them into the world of probabilities and then deal
with them. How to do that is the hard part. And no one has sort of answered it.
Okay?
So I'm really running low on time, so I'll quickly move on to the second part which
hopefully you'll find interesting as well.
I'm going to rush through this part a little bit, because I'm low on time, so please
interrupt me when I'm too fast.
>>: [inaudible].
>> Christian Konig: So don't worry. We can go up to 12.
>> Parag Agrawal: Okay. Great. So fuzzy lookups is the topic of the second
part. I said -- so I skipped through this slide. The challenge here will be dealing
with efficiency in the presence of uncertainty.
So think of this part as the bottom part of this, that will define the semantics
which is the top part in our possible worlds a priori and try to see how we can
represent and efficiently execute those semantics for efficiency in the lower part
in the possible world part.
So fuzzy lookups is the topic of this part of the talk. On motivating applications
for record matching and local search for fuzzy lookups. So in general fuzzy
lookups are essentially based on a similarity function. So you have this large
reference relation and you are trying to look up tuples which are similar to a
target query. The reason they're fuzzy is they want to permit error tolerance.
The query may not be exact match as a equality match which you could do using
a simple hash table.
And various similarity functions are used to this end, commonly edit distance,
Jaccard similarity, Winkler distance, so on.
The motivating applications we had for this were in the context of record
matching, so in a data cleaning platform. You might want to do data cleaning
and entry resolution.
So as a first cut to find candidates for things which might be the same record,
you want to do efficient lookup rather than doing pairwise comparisons across
[inaudible]. In another motivating application is local search. So in both of these
applications we are thinking about conceptionally small records. So we are not
searching documents, we are searching records. Okay? And this will become
important later.
So another key to this is that there's various similarity function which are useful in
various context. But if you think of such similarity builds lookups like say Jaccard
similarity. They can form a primitive function for executing multiple such
similarity functions, for instance. And that's why in a data platform, in a data
cleaning platform or as a generic primitive such similarity functions are more
interesting. And this was the -- this was something identified in previous work.
So now we are going to use the set based similarity method called Jaccard
containment. And let me try to define what it is. There's a bunch of you here
who know this part, so please be patient. So set containment is what regular
keyword searches thought of, that you have a reference relation. Think of the
record in the reference relation are. And you have a query. It's in the result if the
entire query is contained in your document.
So if all the words in your document are in the query, it's the set containment
lookup.
In this case, we have and another word called pasta. Suppose we also have
weights. Now Jaccard containment is defined as how much is the -- what
fraction of the query is covered by the record? Okay? So think of in this case
the word pasta is not in a record called Olive Garden. So the total weight -- so
these are not probabilities, although the sum 2, 1 for convenience. The sum 2, 1
the weights and .8 is in the record. So .8 fraction of the query is covered by this
record. Hence the Jaccard containment is pointed. Okay? This is a asymmetric
version of Jaccard similarity. Because we are now penalizing only by the size of
the query not by the size of the union. Okay? So we are not penalizing for large
records in the reference relation in doing our similarity.
So the problem of fuzzy lookups is to efficiently find all records. So the Jaccard
containment value is beyond the threshold. So that's our lookup problem. Okay?
Now, let me introduce some uncertainty into it. So we foe that users sometimes
abbreviate square as SQ, New York as NY and make countless other ways of
other mistakes. So what we conceptionally have, let's say, is a selection of
transformations like this which say that square when a user says square, he
might -- SQ, he might mean square. So one way to think about it is that the
name is uncertain and that we like to make all of these names which are just
representations of this name knowing what kind of other representation the
you've uses as searchable. So we like these four possible words in some sense
to be searchable.
And now when we define Jaccard containment we like to define it as the
maximum Jaccard containment over all possible worlds. So the Jaccard
containment for this venue is over all possible worlds the maximum value, which
is the max match. Okay? So this is the semantics we're attacking that you would
like the max Jaccard containment over all possible worlds to a zero threshold.
Okay?
The challenge is efficiently processing this without actually creating all of these
large number of possible worlds. By using the compact representation. Another
way to think about this is the result -- the database is certain and the query is
uncertain. So when the user says Madison SQ, NY, he means any one of these
four queries which says that when he says SQ, he could mean SQ or he could
mean square. Okay? When he means -- when he says NY, he could mean NY
or he could mean New York. So now the relation is certain but our query is
uncertain. Okay? So conceptually taking it as union of all these results is the
same as the previous applied semantics that I defined.
So in general, this notion of transformations is extremely powerful. The reason is
it can capture similarity which is not captured by any textual method, for instance.
Or even -- or it can capture similarities which may not be amenable to doing in
the set containment world or in the set similarity world. Right? So this
essentially lets us program similarity functions by just creating conceptionally a
relation which basically keeps information about synonyms. Okay?
So what we have is two kinds of error tolerance in this problem. Yes?
>>: [inaudible] synonyms then couldn't uncertainty also be in the reference
relation?
>> Parag Agrawal: Yes. So conceptionally -- okay. So we don't treat these as
synonyms. These are one way things. You can conceptually have
transformations applied to both the query side and the reference relation side.
And so -- yeah, so that is a valid question. In this talk we'll keep things simple
and do it only on one side.
So now ->>: [inaudible].
>> Parag Agrawal: Yes?
>>: So let's say that in the simple case where there's no transformation, just
Jaccard containment over let's say large dataset, is this program efficiently
running this Jaccard containment overlap, is this problem well solved or ->> Parag Agrawal: So this problem -- there is a solution for that part of the
problem called prefix filter. Our solution, as you will see, will be more efficient
than that solution. So, yeah, even Jaccard containment without transformations
is a hard problem. And we'll -- our solution will essentially improve on that
problem while also solving the transformations case. That's a very nice question.
Yes. Okay. So -- and the experiments like should show that. And if not, please
ask me again.
So now let's try to do Jaccard containment without transformations for a minute
and see how one way we might be able to solve it. So we have a query Madison
SQ, NY. And we like to do a Jaccard containment with threshold of .6. What this
says is at least one-sixth part of this query should be covered by any record that
is in the answer. Hence, one thing we can immediately see is if we issued these
two red queries, Madison SQ and SQ, NY, both of these have weights which are
greater than the threshold and get the results in union. That is the correct result.
And the reason is for anything to cover more than .6 of the query, it would have
to cover at least one of these two. Is that clear? Because if we only had a query
which is Madison NY, that's only .5. So we don't want to issue those query. On
the other hand, if we try to issue the entire query, Madison SQ, NY, the answer
there is a subset of the answer of any one of these queries. So we'll get all of
those answers. So we'll get all answers with threshold of either .8 or .7 or one.
Those are only three values we can essentially have Jaccard containment on
this.
Okay? So essentially to issue a Jaccard containment query without
transformations, woke find a collection of set containment queries which on
unioning we can get a Jaccard containment answer. Okay? And this is how it's
done.
In the plans of transformation this generalizes nicely. You apply conceptually the
transformed query, the green one. For that again do the same process of finding
queries such that you can answer that. Now by taking all of the transform, all of
these red queries, by -- so you generate all transformed queries, the green ones,
by applying transformations. Then you apply find their subsets as I defined
earlier, and you take all of these red queries, issue them, take their union and
now that is the exact result you'd like for Jaccard containment in the presence of
transformation. So this conceptually one naive way of doing things. Of course
you can see that that's a large explosion here.
The red queries are what we'll call variants in the rest of the talk. These are
variants of the queries which are set containment queries whose union is the
exact answer for Jaccard containment with transformations. Okay?
So this tells us a naive solution outline, which is you're given a query. Generate
all of its variants which are set containment queries. For set containment queries
we know inverted index work well. Use an inverted index to answer each of
these queries, get the results, do a union, fetch those from the reference relation.
The union essentially gives you IDs. Fetch those from your reference relation
and get the result. Okay?
So this is a system that can be can be easily built using this naive solution. What
are the problems with this naive solution? Why is this naive? This one, too
many variants. The number of transformed queries can be exponentially large
because it's essentially a cross-product. Secondly, each set containment lookup
itself can be very expensive when we are having to do a lot of them to answer
one request. So if we had to do only one set containment lookup, it is efficient
enough, but when you have to a large number of them, we want to optimize that
part as well. One reason why it can be very expensive is think of a query which
is Madison, NY. Okay? Both of these are cities. So they're lists are fairly large.
There's a lot of local places in Madison, there's a lot of places in NY. That
intersection might be really small, which is things like Madison Square Garden,
for instance. So the intersection -- to do intersection of two way [inaudible] is too
expensive to get a very small result.
So in this we'll attack both of these angles. First let's see intuition as to how we
might not have to issue all these large number of variants to solve the problem.
So there's intuition for issuing fewer number of queries. So again, the yellow is
the original query, the green one is transformed query. The red ones are the
variants. So we essentially conceptually like to issue all variants to answer our
query. Think of a blue query, which is Madison. This arrow denotes that this
query result contains all the results from these two red queries. Okay? Because
if you have fewer keywords you get more results, a union of results.
So now by issuing the blue query we assure that we get all the results from those
two red queries. Similarly by issuing this other blue query, the NY, we get those
two. So at this point we can notice that by issuing these two queries we'll get a
superset of all results for all the variants. And this is what is called the variant
covering.
Notice that fewer queries are issued possibly by doing the covering but you can
get false positives. There will be results which you'll get by issuing both of these
queries that are not result in any of those four queries. Okay? So by issuing
fewer queries we might be able to get more results, which is sort of a trivial
statement because by issuing the empty query you get the entire database. So
an empty query is always a variant covering. Okay?
So we'd essentially like to find nice variant covering. And there are many of
them. There's another variant query in square or SQ, there's a variant covering
with three queries here. So essentially the problem is finding a good covering.
So what this indicates is a solution outline, which how you might be able to
change and improve it. One, as I hinted earlier, we're going to improve on the
inverted index to make it more efficient for doing set containment queries by
introducing something called the minimal infrequent index, which is a more
efficient version.
The second part will be -- will be doing an algorithm called CoverGen, which will
generate good variant coverings efficiently. The intuition is that we want to issue
fewer queries because we can't issue like an exponentially large number of
queries.
And a 30 pardon is now we'll get a supersets of the results. So we have to do
this additional step of verifying which are the correct results. Okay? Because we
are using a variant covering, the -- we need to do this final phase of throwing
away results which actually don't [inaudible].
For this third part, that's an algorithm in our paper which uses maximal matching,
maximum matching, which I'll not discuss at all in this talk. That algorithm
essentially tries to say given a record and a query is the Jaccard containment
greater than this threshold and does this efficiently?
So first let's talk about a minimal infrequent index. In a minimal infrequent index,
one intuition is we are not only going to create lists for individual keywords, we
are going to create lists for sets of keywords as well, which can be thought of as
queries. So instead of just having four lists for each of these four keywords you
could have a list which materializes the result for the query Madison, NY, which
is for a set. Okay?
But we are not going to generate all of them. We are going to smart about that.
So what we'll have is a parameter A. And say that we have a -- that's a
frequency threshold which says that any list longer than size A is too frequent,
okay, which will not index. So what we'll do is we'll not index long lists which are
longer than this parameter A. We'll only index lists smaller than A.
In addition we'll only index lists which are minimal among these infrequent lists.
So for instance, we'll not index the list for Madison Square, NY because there is
already a index part which is square, NY, which is small enough that -- and it
contains all the results in Madison Square, NY. Okay? So we'll index minimal
infrequent sets or minimal infrequent queries for a threshold A. Okay?
So notice that we only index like three of these lists in this index, for example.
Now, how is this index useful? Think of first only queries which have fewer than
A results. So think of a query such that we know it has less than A results in its
output. For this query what we can do to answer this is find by definition there is
a minimal infrequent set that we have indexed, which contains this query. We
can find that, we can fetch all of those records which are less than A and verify
whether or not they actually satisfy the query and create our answer. And this
example sort of says that Madison Square, NY was on you query and Madison
Square is a list we have. And we're able to answer this query by finding first
quickly which this list is and secondly, just scanning this list and verifying which
record in this list.
Now, when we have queries with more than A results, you can think of an a
exponential set of parameters A, 2A, 4A which you can all [inaudible] in the same
index. And then now a query conceptually which has between A and 2A results
will be answered by the 2A index. A query which is less than A results will be
answered by the A index and so on. Okay?
So what this says is if a query has all results, if this always greater than A then
are better than A by 2, we are going to do it in at most two times the number of
results. But if it's like zero results we still need time A to make sure that has zero
results or less than A results. So we have an output sensitive guarantee for
doing set containment lookups. Okay?
So this is the index we will use for exact containment lookups in our solution.
The index will be useful in other ways as well because the index has metadata
which is about frequency of various queries. So since we've materialized queries
other than just single once, we have more knowledge about how big their results
are, which we'll use in our CoverGen part of the paper.
But also this index itself which has a output sensitive guarantee for set
containment is often independent interest. Because it provides essentially a
tradeoff in between space and efficiency. You say one query time, obviously you
index month things.
So we've seen the minimal infrequent index for efficient set containments. One
of the -- yes?
>>: [inaudible] index so do you make any assumptions on the way the data asks
you to distribute it for this index to be applicable?
>> Parag Agrawal: The index [inaudible] the size of the index can really blow up
->>: Realistically [inaudible] to me, realistically [inaudible].
>> Parag Agrawal: So -- okay. So if you have sort of small records, indexes,
usually small. If you have you large records, the index can get large as defined.
But another thing to notice this index is defined -- I used a definition of index
which is frequency based. You can use any monotonic function or instead of just
a frequency function. So you can essentially -- and that can be used to control
the size of your index. For instance, if you have long records, one thing we folk
about in the paper is two things have to be close to each other in some distance
for them to be co-located in that index. So frequency threshold.
Or you can say things like the biggest set we have every materialized has three
keywords in it and we'll not do five keyword sets. So you can essentially choose
for various settings difference monotonic functions to control the size of the
index.
We also did some investigation about how big the index would be for general
data. And there is evidence in the frequency items that's literature to say just that
the index is usually small. Because not all permutations are cut frequently.
Okay? It's the same intuition as why a priori algorithm is usually efficient. It also
generates all of these sets; in fact, it generates way more than the minimal
infrequent sets. But -- yes?
>>: Can you still [inaudible] index of single words too?
>> Parag Agrawal: We don't need it, but our index is essentially a superset of
that. We will definitely -- so think of it this way. Since we have an exponentially
large number of parameters, think of any individual keyword and its list. Its
frequency has to lie between two of these guys. So it will be minimal infrequent
for one of them. For something, right? Hence, that list will be materialized for
one of these.
So an inverted index -- our index contains all the lists that are in the inverted
index plus more. Okay? Another observation to make is when we use a
parameter of A equals infinity as the lowest parameter, the index is the inverted
index. So index for a certain parameter value is exactly the inverted index. Just
because every individual keyword is minimal infrequent. For the A equals infinity.
Yes. So I didn't -- we had a question about index size. We don't really make too
many contributions about index size and build. We basically use literally from
frequent itemsets to do our index build. And we have arguments to support why
the index is usually small and tools to enable that.
So the next part is how do we issue fewer queries. How do we generate this
cover, which is good. There are two things we are challenging here. One, we
don't want to verify too much. We don't like one covering would be the entire set
[inaudible] get the entire database and verify it. That's too inefficient. We don't
want to spend all our time in [inaudible] so that these two things we want to
optimize for.
As I hinted earlier, the information from our index will help us in -- will feed into
how we do our CoverGen. So the index we can sort of have metadata which
says that this query has this many results instead of just for single items. So we
essentially have the frequency diagram in some sense of our data.
Now, one approach which is the obvious approach is the set cover approach for
doing -- generating this covering. Let me try to formalize what this problem is.
So we had a query, a threshold bunch of transformations and we are able to
generate our variants from this sector. Given these variants and given list that
we have, so we have a parameter A collection, we have a parameter 2A
collection, we have a parameter 4A collection and these all are index. Okay?
Given this, we now know also the length of all of these lists. The question is can
we cover? Now Madison covers the first two queries. Madison Square covers
only the first query and square covers the middle two queries and so on. So this
is a set cover problem in the sense can we cover our variants using some list that
we have.
So the cost for each set is the length of that list because we are going to fetch all
of those and verify against them. So this I'll argue the blue set here, the three
middle queries ask variant covering because it covers all of our variants. Okay.
But so the thing is that we can -- by knowing the cost from our index -- yes?
>>: [inaudible] the cost [inaudible] also have the CoverGen and index and cover
costs? Does that also cover [inaudible].
>> Parag Agrawal: Okay. So [inaudible] so this cost metric is for defining what's
the good cover. Is this -- so this cost is essentially only measuring the goodness
of our covering. In terms of false positives. How many false positives will you
get as a result of using this covering?
>>: Not [inaudible] not the only thing, right? I mean there are other two aspects,
the CoverGen and [inaudible].
>> Parag Agrawal: Absolutely. And that's why I said the second is not a good
solution and then give you an alternative.
So but if you plug in that cost function [inaudible] all these costs then you can
[inaudible].
>> Parag Agrawal: So that's possible. So the way we went about -- this is we
are trying to get good quality covering. And we're not accounting for how
efficiently [inaudible] is covering. It's like saying in optimization you want a good
plan but for now we're not accounting for how expensive it is to compute this
plan. Okay? Think of this blue things as a plan. So the cost as we are
implementing it is exactly the verification cost modulo overlap for this.
So recall how this is going to work. We're going to take all of these lists, collect
these IDs, fetch all of these records, which is again linear in this cost. For each
of those records run the verification algorithm. So the cost after generating the
covering is the proportional to the cost I have here. The cost of generating the
covering is something I'm not measuring in defining this, okay? And that's
actually where the are set algorithm will essentially have its downside. Okay?
>>: [inaudible] like a dominating portion of the cluster.
>> Parag Agrawal: We'll see that in the [inaudible] our experiments will show
difference in where the cost goes. So okay. So to recap, we can now using
metadata from our index and a greedy set cover algorithm generate a variant
covering which has some set cover a guarantee that you're not getting too many
false positives. There's a bound on the number of false positives you will get as
a result of doing this.
So set cover has this nice property which is a bound on false positives. You will
not necessarily verify a lot of things. The problem with this as was noticed was
that generating the covering itself may be expensive. Since I said there's a large
number of these variants possible, in our set cover algorithm we are trying to
generate all of these large number of variants and then run a set cover
computation over them.
So the set cover itself is very large. And hence this might be too expensive to
generate, especially in the presence of a lot of transmissions. So to that end we
have an alternative solution which is what we call the hitting set approach which
essentially I'll not talk about in great detail. But here's the intuition. It essentially
generates a covering which is this blue things which is this blue query to issue
set containment to the set containment part which without actually enumerating
all the variants. So we can guarantee that we will cover all our variants without
ever generating these variants. And this usually essentially works from
monotone dualization or lattice theory, which especially applied to frequent item
sets to essentially indicate how this was done. This was -- this intuition exists in
existing literature.
But one thing that's interesting is that we require our index to have this property
that it's enumerating all minimal infrequent sets for this algorithm to work.
Because it does basically works on this lattice going up and down and it has to
hit the bottom. Okay? So another thing to note here is while the variant covering
may be generated more efficiently than set cover, the -- its heuristic in the
number of false positives. So set cover had this nice guarantee that we have a
theoretical bound and the number of false positives we'll have.
Here that bound no longer applies. It's heuristic. But in our experiments we'll
see how that affects us. So this is the second part of the -- the second approach
which essentially addresses the problem of generating the covering itself
efficiently and not only look in false positives.
So to talk -- to set up the -- one experimental result that I'll show you, we have a
prototype implementation as a stand-alone library. It uses an in-memory index,
as you'd expect.
So we used a bunch of datasets. The experiment I'll show you is from a places
dataset. It is probably similar to the motivating example we've been running
through. It has seven million records. We talked a little about the
transformations we did in this experiment. We are up to 20 million
transformations in different experiments. These experiments could be static
which are of the form that Bob is Robert or SQ is square and so on.
These are applied -- so that's a small amount of these. Then there's a large
number of these programatic transformations like edit distance or abbreviations.
Edit distances are everything within edit distance of one or two of a word. Or
abbreviations are A could go to Alex or J could go to Jack, John, Jim, whatever.
These are programatically provided. These are not like materialized in a table
anywhere but programatically provided at query name.
The experiment I'll show you is to compare the hitting set approach with the set
approach and we'll see the time split as to in the various parts of the thing.
To compare to prior work, unfortunately no prior work or fortunately no prior work
handles the problem of Jaccard containment with transformations. As we
mentioned, this prior work called prefix filter which handles the Jaccard
containment caseload without transformations. One observation you can make
is a hitting set algorithm can degenerate if you don't have transformations to this
work. Secondly, our index, when you use the parameter A is equal to infinity can
be generated to be exact index. That prefix filter uses which is the inverted
index. So we can essentially use in our implementation implement prefix filter
directly. And this is what we will compare to. Okay? So also this gives us a
natural way of saying how does free fix filter generate the case of
transformations? We use it on our implementation with transformations or the
index equals infinity. Okay?
So this is the plot comparing this. There's a lot of setup here. On the X axis we
are sort of changing the amount of transformations in the three sets of plots.
Each set has the three algorithms I've been mentioning, the prefix filter, the
hitting set and set cover. On the Y axis we have performance. Lower is better.
Quicker. So the three colors represent parts in various parts of the algorithm.
The CoverGen is the red part. And the lookup fetch from the index and verify is
the yellow, gold, whatever part. Okay? And the overheads are the green part.
So one thing to note for all these three sets of plots is that the first plot uses less
memory than the second and third because it uses the regular inverted index
while the second and third part use our minimal infrequent index which for this
dataset had a 20 percent overhead. Okay? So it's -- we're comparing
performance but there is a memory difference between the first part and the
second and third part.
So now let's see what this result said. When you have no transformations by just
the gain from using our index is essentially a three fold improvement in
performance. And the hitting set and set cover algorithm are essentially -- since
there's not a very large number of variants in general just because of the Jaccard
containment threshold, we don't see the difference. So they're essentially
similar. One thing this says is that the hitting set heuristic work well in this case.
With -- as a we increase one of our transformations everything becomes slower
which is kind of expected because there's now a larger number of queries you
want to issue, more error tolerance, more results. And but as the transformation
becomes really large you start seeing in the set cover the red part blows up. So
the set cover generating the cover becomes very expensive at that point. Okay?
And the hitting set has a significant benefit in when the number of
transformations are large. Also you notice that the hitting set index plus verify
part is only a little bit more than the set cover. So while heuristic in practice for
this experiment it was not bad, the generated covering in terms of false positives.
Yes?
>>: To which degree do these numbers rely on the fact that your index
[inaudible] index where sort of navigating to the start of the posting list in your
index becomes relatively much more expensive. How would these things change
and can you adapt your algorithm to this?
>> Parag Agrawal: Okay. So let me repeat the question to see if I got it. So if
our lists were [inaudible] on disk, that's an overhead of getting the top of the list.
So the question is how does our algorithm verify relative to the others? That's a
good question.
So I would say that our algorithms would actually become less efficient in this
case because if you look at the inverted list, there's a keyword that's a query and
there's transformations. So there's a set of reachable keywords from those
queries, which is a small set, which is in the size of the query and the number of
transformations you have linear. So if you have 10 transformations which go to
different end points, you have these 10 plus five, 15 essential keywords to look
up. You can essentially collect all of them and sort of do work in memory in
some sense to get the answer.
So the inverted index solution touches fewer posting lists. Our solution touches
far more posting lists than this. Hence, from that point of view, if you put it on
disk I would -- my intuition suggests that our solution will become less efficient
because it's just having to touch just more number of lists. But how this -- so
both -- there's two opposing factors. My guess is that the prefix filter will sort of
win. Okay. Yes?
>>: [inaudible] how many word do you have in [inaudible] experiments here?
>> Parag Agrawal: Query word?
>>: Yes. Unique keywords that you [inaudible].
>> Parag Agrawal: I actually [inaudible] that question. So I do not have an
answer.
>>: How many distinct words in the whole collection? How -- like inverted lists
how many lists would it have?
>> Parag Agrawal: So yeah. So the conclusion of this experiment was that the
hitting set went through a large number of transformations can help significantly
and the heuristic index, the verification that happens it's not too bad.
Again, as before, there's a the lot of related work that goes into this. There's
work in similarity functions, fuzzy lookups have been defined for various other
things. There's obviously inverted indexes and list intersection work. There's
work from fuzzy autocompletion, exams relevant. Obviously we used a lot of
work from frequent itemsets mining. And maybe more that I forgot to list here.
So we've seen two main parts of our talk. Allow he to give you one slide
overviews of the trendy topics part to promote discussion. The first one is how
do you lose SQL inside what people are call NoSQL systems? So we started
with a No SQL system called PNUTS. This is essentially a data store for serving
Web content. What this system does is range scans and point lookups by -- and
the key ingredients of this system are it works at a very large scale horizontal
scalability, very low latency per Web serving. It does geographic replication to do
low latency across like workloads which are global. It does weak consistency
guarantees at a per record level rather than across record coordination.
The system is designed to be highly available, very fault tolerant, thinks about
bunch of failure modes, tries to make sure that the data stays and performance is
not very well affected -- very affected in terms of various kinds of failures.
What we did was we added this thing called notion called replica views which
essentially enables you to do richer set of SQL on this rather than just the simple
functionality which is secondary access, simple joins, equal joins, some group-by
aggregates.
So there's a lot of fine print as to what part of SQL you can do. You cannot do ad
hoc SQL. You can do only predeclared SQL, and you can only do [inaudible]
SQL.
But we claim that those are the interesting parts for Web serving. I'm not going
to the fine print and just leave you with the good stuff. And the key idea was
replica views which helped us not have to worry about keeping all those scale by
treating our views as PNUT stables, the same way as replica of PNUT stables
were treated. We were able to bottle all the same consistency abstractions. We
were able to get the whole geographic replication. We were able to get the fault
tolerance. So the cool idea was by just reusing and conceptually thinking of our
views as replicas, which are just acts of differently you get a lot of benefits in
terms of abstractions and engineering. Okay? So this was -- this work I did with
Yahoo!
A second one was to shared scans in a map reduce like system. So the context
was that Yahoo! has a Hadoop cluster with a bunch of datasets, some of these
very large. Pig sits on top of this. And that's queries being written all of these
datasets continuously.
One objection is a large part of the time spent by this system was spent
analyzing a few very popular datasets which are just that there's a bunch of
queries attacking the same datasets. So there's an opportunity to share work
across these queries. So one scan can sort of answer multiple queries or help
answer multiple queries. So that's the opportunity.
The way we went about it is we modelled how to anticipate these query arrivals
because essentially to do the sharing you have to make one query wait while
another one arrives which it can share with. So by anticipating query arrivals we
were able to build a proprietary based scheduler. Which can help you do this
scan sharing to enable efficient computation of our queries.
The challenge there is that this increases throughput while possibly decreasing
latency. So our contribution was to define a metric for how to share scans.
Because in theory you can just make everyone wait for a long while and you had
more sharing. So conceptually more throughput. But the latency increases.
Okay?
So we define metrics for saying how to do this scan sharing and we had a paper
about that at -- in VLDB.
So now I've spoken about the two parts in detail and two parts just one slide used
to promote discussion. Let me just throw up a bunch of other work that I've done.
I'm incurring working with folks at at&t on a data mining problem which is how do
you summarize databases? I've done a study group essentially to make a case
for RAMClouds, essentially a PNUTS like system for -- except the [inaudible] all
your data is in memory. It's still viewable, but it's always [inaudible] from
memory. And this introduces a bunch of challenges and a bunch of
opportunities. And we essentially are a study group to figure this out. And we
made a case for RAMClouds with a large number of people at Stanford.
And then a bunch of other work on uncertain data which is not about integration
but about code uncertain data.
So with that, let me throw up a list of my coauthors over the last few years. I've
worked with people like Yahoo! Research, Microsoft Research, at&t labs,
Stanford Computer Science, outside of the InfoLab and a bunch of people at
InfoLab. I'd like to thank you all of them as well as you guys for returning. Thank
you very much. And I like have a minuter for questions. [laughter].
[applause].
>> Christian Konig: Okay. If we have one question, we'll take that. Otherwise
we'll just stop here. I think we'll stop. Let's thank the speaker again.
[applause]
Download