18315 >> Vivek Narasayya: My name is Vivek Narasayya, and... Hongrae Lee. So Hongrae started out in his career...

advertisement
18315
>> Vivek Narasayya: My name is Vivek Narasayya, and it's my pleasure to introduce
Hongrae Lee. So Hongrae started out in his career by working in the area of nuclear
engineering at Seoul National University. You're all wondering how he got to computer
science. So the story is that in Korea you're required to do mandatory military training.
But as an interesting catch, you don't necessarily have to do the training. You could
substitute that by working in IT companies for a few years.
So Hongrae spent about six years, including two years of his own start-up, working in IT
companies, and then got sort of hooked into computer science. He decided that his
passion was going to be to do research in the area of computer science, and then he got
into -- he did a masters at Seoul National University and decided to come to UBC to do
his Ph.D. in computer science, sort of more narrowly the area of database systems. And
he's working with Professor Raymond Ng who is his advisor. So today Hongrae will talk
to us about selectivity estimation of approximate predicates on text.
>> Hongrae Lee: Thanks, Vivek. Nice introduction. Good morning, everyone. Thanks
for coming. Today I'll be talking about selective estimation of approximate predicates on
text. This is mostly joint work with my supervisor Raymond Ng, and Professor
[indiscernible]. Let's just start with motivation. With the advancement of technology, we
have more and more text data in our relational database. For instance, [indiscernible]
names, comments and profile. And they often co-exist with structural data.
Here is one example of [indiscernible]. And as you can see, there is basic structural
information. And there is also text information sort of such as personal interests or his
favorite TV shows, favorite movies. And there are many more examples like reviews,
product information, customer database. Basically virtually text data is everywhere.
The difficulty in handling text data is they can be [indiscernible]. This is new article about
type of scheduling. So the idea is people buy mispaired variant of popular URL and then
they place ad in it, and those ads are often posted by the original site owner to redirect
the traffic. And it turns out that this is quite lucrative business. My point is not about like
let's find some URL and make money. My message is people make typos all the time.
And it may not be just typos. People use different textual representation all the time.
For instance, if you go to baby name website, you can see people use different spelling
for similar names all the time.
There could be many reasons for this, and one of the main reasons is usually input. The
example of user profile, the way the text data is generated by user input. So actually the
user types in his personal interests or favorite movies. And each, it's very easy to make
typos.
In this profile there are two typos. Should be 12 monkeys not [indiscernible] and Prison
Break is not the correct name of the TV show, the name.
With the presence of this type of error, the exact messaging varies. It fails to deliver
intended information. For instance, when you click the link in this profile, what the
system does is it goes to database and then retrieves the related information such as fan
pages, people like the same movie, while comments are left by your friends based on
exact textual message.
24, there's no typo. If you click that, you get thousands of pages from multiple
databases, pages, people. What about if you click the Prison Break? Then you get
almost 0 hits from database. In this case there happens to be one fan page with wrong
spelling, but except that you missed thousands of interesting information. Due to this
prevalence in relational database, recently the approximate matching or fraud matching,
I'll use those terms interchangeably in this talk, defining something similar. The
functionality has been incorporated into relational database. Microsoft SQL Server, work
with text, IBM DB2 text search engine. They first started from first text approach, first
two text search, now they're supporting more advances features and feature quite based
on improvement. And there are many things we need to successfully support this
approximate messaging on text. First of all, we need some index structure that enables
efficient lookup.
For approximate text matching, each vendors in the previous slide they have their own
infrastructure. I'll call them collectively domain index in this talk. We also need
operators that can utilize the index and then come integrated into other operators and
then actually process the quality.
For approximate text matching, again the vendors or commercial database, they have
their own version of the operators. Popular choice of many is contains. And another
crucial factor thing that we need is selective estimation. I'll give you more detail in the
next slide.
But you can -- you have the history or something like that. For this, the adduction
solution, to the best of my knowledge adduction solution has been employed, using
constant or very simple heuristic. This is obviously less than ideal to more and more, we
have more and more text data and they are not clean. And this will be the main focus of
my talk. The selective estimation of approximate [indiscernible] on text. I won't talk
about index or operators. I assume that they exist and they do exist and I'll focus on the
selective estimation.
Okay. At a very high level, selectivity, here you can think like this. Number of people
satisfying given predicate. Why this medal? It fetch various optimizer choices such as
indexing, index cam or famous cam. Or it affects the choices algorithms, joint operators
or ordering, and it can be actual optimizer. Many choices depends on the selectivity.
And pro selective estimation can be to suboptimal time. So let's see actual example.
This is on the vision per citation database in commercial database. You have also
names and paper titles.
And imagine, suppose you won't find a paper by a certain author in 2008. And you may
not be sure about the spelling, where the data is not clean. Here express this with
maybe equal. This means the name similar to the spelling. And to define this predicate,
we need to specify similar measure, how do you measure the similarity. And we also
need measure how similar.
And this is actual query in this specific database. You don't need to memorize any of
this. It just try to give you a feeling of how it may look like.
Here, there are similar measures based on distance and user based on threshold and it
may have like other optional parameters.
Now, what things are possible with these approximate fuzzy predicate? Here there are
two predicates, selection predicate. One is on quality predicate and the other main,
name similar to INDYK is predicate on target text.
Depending on the selectivity, the quality optimizer generate different plan. If you play
with those parameters, they generate different plan. Here is one example. Focusing on
the choice related with the fuzzy text predicate. Here you use domain index that
supports the fuzzy matching and the optimizer chose to use nested loop. And this is just
one example.
And depending on the selectivity, you may choose to do the table scan with hash during,
while in other cases you might just use, the checking may be embedded in filter and with
natural loop. Or again you use domain index but this is some hash join.
So as you can see, just like the other predicates, the fuzzy predicate is embedded in the
plan and used in diapers shape [phonetic]. Does it matter whether we choose one over
the other? You will see an example. This is again the simple query. Named similar to
micro in year 2008. I'm hiding the actual the complicated query. In this particular test
database, it's selectivity like this. There's that many TUPLs satisfying the predicate in
the database.
And this is the optimizer generated plan. And ideally this is supposed to be the best plan
or close to best. But there are many more plans that reach this plan. For instance, this
plan is much faster than the optimizer chosen plan. That is the optimizer produce
several plan. So why this is happening? Why this happen? If you track down the
region, the identified, the two regions is that the estimated selectivity of the domain, the
fuzzy predicate, may not be clear here but it's a thousand and 700. You can see there's
huge, orders of magnitude difference through selectivity.
So optimizer got selective wrong and that's why the [indiscernible] plan. And this is just
one simple example. In reality, it can be [indiscernible] bad. Again, the poor selective
estimation on fuzzy predicate, it can lead to choice of several plan and the results can be
arbitrary there.
And this is why the fourth motivation of why I studied the selective estimation of
approximate predicate on pairs. The first motivation is query estimation on
[indiscernible].
But the utility of selectivity in general, very general, like the SKUs of this paper is not just
important in real time's optimizer, it has broader meaning. For instance, in the literature,
the selectivity information has been emphasized as your key tool for choosing algorithm.
For instance, information extraction summarization tasks, it has been emphasizing that
depending on selectivity, you may have -- one method may be better than some other
algorithm.
And even in Q structured RDBMS [phonetic], depending on the quality selectivity, one
plan may be better than the other plan. So it affects the choice of algorithm [phonetic].
So again this serves as my second motivation about on why I study selectivity problem,
estimation problem.
With this motivation, I studied the selective estimation of the two workers, operators in
RDBMS. The first kind is selection operators, and the second kind is joint operators.
For selection, I studied string and then sub string. And these things how they're different
in later slide. And then for join, I first studied [indiscernible] join, and then I studied a
more generalized similarity join. And for today's talk I'll just be focusing on the first three
technical contributions.
Let me give you a little bit of context in related work. I show the techniques on selective
estimation. On the left, on selection. On the right, join. So first there are soft string
selectivity estimations, which they only do the exact measure. They don't consider
errors who are variant.
Going to the approximate work, first string problem is proposed. And my first
contribution is in that category. And I'll compare my work with the prior art set here. And
some [indiscernible] selectivity estimation. So now the data is set but they're related.
You will see.
The string is a special case of soft string. So because of the additional complexity of
approximate matching, special case string is first studied but more natural extension of
the exact messaging is soft string. So then I study soft string estimation with fuzzy
predicate. This was one selection. For join, there are many words for join selective
estimation. Most of them are random sampling-based algorithm, but they come in a very
diverse flavor, like joined sampling, [indiscernible] approximate query answering, the
estimation is streaming invariant. But these are more related with relational sense.
Going to the approximate work, my work is the first published work on selection, on
similarity join in size estimation. And it is also generated, join version of selected
selective estimation.
I will start with string and then soft string and join problem. For each of them I will deliver
one key idea with a sample of the resource. Okay. Let's start from string. One of the
most popular similarity measures in string is added distance. It's defined as minimum
number of operations, added operations, insertion, deletion and replace to convert one
string to another.
Here is an example. For instance, if you compare Sylvia to Sylvia you need one in place
one insertion. So there are distances, too. So the problem statement is given a query
string, a threshold tau on edit distance, watch the number of strings similar to the query
string with the, at a distance threshold. Here is one example of database, and you can
ask questions like: How many soldiers are similar to Sylvia with an added distance of 2.
Here are several variations, and then you want to count these strings, the number of
these strings.
And the major challenges you will see shortly is there are too many possible extensions.
Here in this I'll use the term extension to denote the possible variant using the added
operation.
This is one slide very high level overview of what's going on in the exact matching. Here
is string database, and you can ask questions focusing on the string query. You can ask
questions like what's the frequency of DBMS, how many times it appears in your string
database. And one important data structure is Q-gram table. Q-gram is just simply a
substance of length Q. For example, if you build diagram from index, it has IM, MBD,
and EX and, et cetera. And G-gram tables stores substream of length Q or less with
frequency. So DB appears, this entry means DB appears this many times as a soft
string in this database.
Of course, you can have tree-based structures such as tri or suffix tree. And there are
self selective algorithms. They can answer these questions using these data structures.
I use this cognitive notation to mean the number of strings that matches that form.
Now, with this let's focus on special case. There's no insertion or there's no deletion.
We have only -- we only allow replace. And considering the query DBMS and the
threshold as one, we want to count number of strings that can be converted to DBMS
with at most one replace. So what would be the naive solution?
The naive solution could be first we omit all possible extensions ABMS, BMS, and
CDMS for [indiscernible] and for each of them we can estimate ourselves 15 using the
technique in the previous slide, using the equivalent table with soft string selected
estimated algorithm. They can give you one selectivity.
Well, what's the problem with this approach?
>>: I have one clarification. So assuming how you estimate each one of them using this
program.
>> Hongrae Lee: Yeah, so that's what these algorithms do. Yes. This algorithm ->>: Previous.
>> Hongrae Lee: Yes, that's previous work. If you are curious, I can give you more
detail. Probably that would be too much. So the idea is you don't generally see a whole
very long string. You have very summarized information like all the strings we mix two
or three. That's all you have. Now the question could be longer than that. So the
previous algorithm, they can estimate version source subsequent. So that's what they
do. But the problem is there are too many extensions. Maybe, okay, in this case but
what about this. The quality IBMS and now the [indiscernible] and then there are more
than formula extension, and it takes 20 seconds just to sum them up. In short, this is not
scaleable. As a quick glance of this, we can do millisecond.
Now, to copy the challenge of too many extensions, our first idea is to use wildcard.
Here the wildcard represents any single character. So it abstracts, replace or insert. So
people are giving the technique, let's see how it can extend, how it can be implemented
in the data structure. This is [indiscernible] Q-gram table, but now we can extend with
wildcard. So it had all the -- the equivalent table, now it had additional entries like DB
question, which means what's the number of occurrence in this form like DB followed by
the form of any single character. Let me give you a few highlights. It can support
qualities with wildcard. For instance, this discussion is how many strings are in this
form? DBM, start with DBM followed by any single character, combined with equivalent
table and previous dimension, the soft estimate algorithm we can answer this type of
question, this question.
And then natural support [indiscernible] because it doesn't touch existing entries. It just
has additional entries. And you can implement this with tree-based structure as well.
Yes, straightforward. And size is rather compact. Of course, it depends on the choice of
queue. If you want a compacted structure, you can go less than one percent; but in
general if you want more accuracy, then you can work ten percent. So it's up to you.
This idea may seem like deceptively simple, but if you look at the data structure
proposed by other work, they employ very heterogeneous structure just for the, to
support the fuzzy matching. Doesn't go well with the existing data structure. Some of
them incorporate rely on Q-gram table, but they have some positional information, cross
information, and they can easily pull off the data structure.
Now, with this wildcard, so our next approach is like this. Okay. We do the same thing.
But now using wildcard we have fewer entries. And then we can [indiscernible] all
counts. What the discount is now we have much fewer counts. So it's possible. So do
you see the problem here?
>>: How many times do we add [indiscernible].
>> Hongrae Lee: Yes, this is coming next. But ->>: [indiscernible].
>> Hongrae Lee: Yeah, exactly. There are two free casts. Well, in this case, maybe,
okay, you can consider all of them. But you see it gets more complicated. And does it
matter? Yes. It can make an order of magnitude difference. But let me delay the result
in SSJ because there we have the same problem.
Now this may be okay, but now here we allow two replaces. We have six possible forms
for S2. And now the overlap gets a little complicated. You see? So what we want to do
is to compute the union size. So we want to count the number of strings that matches
any of this X form. And we want to consider linear.
>>: It's not just that, because wouldn't the index you would need to know how many
[indiscernible] in the index?
>> Hongrae Lee: Yes. So literally -- there is some limit. Generally at a distance. You're
not talking about at a distance like 200 or 300. Generally at one, two, three. Now if you
need very high level, very long at a distance, that's probably at a distance much better
other choices. So here, of course, there's some limit like four, five, that's a reasonable.
So we're not targeting by DNA sequence or Web documents. Show existing.
Considering union is not difficult. There is IE principle for it. For instance, size of A or B
or C is first we add individual side and then we subtract the size between intersection,
size of intersection between two sets. Plus A and B and C. And we can do the same
thing here. There are intersections. But notice that we can compute the intersection
size. For instance, the first term touches the number of strings to satisfy post form.
Watch the number of strings, watch the strings that satisfy DB question, question, and
the question and question. That's exactly the string of form of DB question. So we can
inject. It's like drilling down to specific case so we can know the specific form and then
using the equivalent table with soft selective estimation algorithms we can estimate this
size. It's estimation.
Okay. This is fine. We get the answer. The only problem is it is exponential. Now how
do we compute this exponential IE formula. Our next idea, second idea is vision lattice.
So our goal is simplify this lattice, this IE formula using some lattice. Let's follow
step-by-step and see what's happening in the IE formula. So this we have six nodes.
You can think of it as a set of strings to satisfy this form. And notice that we have nodes
for each the original sets to union. Now, let's follow the intersection, watch the
intersection of these two. We already touched the ejected string we saw in the previous
slide. And that's DBM question.
And we replaced with DBM question. And what would be the next one? The set of
strings that satisfy this form and this form. And here that's the intersection between the
second and third. Sorry, first and third. Ejected are the same thing. So we can apply
again. Substitute the same form and whatever -- this one. This is intersection between
the second and third. Okay. The same thing.
So as you can see, there are many duplicate entries here. And what about the
intersection between three sets in the second rank? Then that's again DBM question.
So as you can see, there are some regular things in this IE formula using the lattice. If
you do the whole thing, this is all you get, this is average result in this lattice. So the
idea is focusing on the contribution of this set, the set of strings that satisfy this form. It
appears three times in the [indiscernible], the ones in the previous part. So in that
contribution of that form is minus two times each intersection size.
Okay, now our idea is we started from this original IE form, IE process, and then using
this lattice we can have a more simplified form. And now the number is now much more
manageable. And this, generally this idea gives us replace only formula. And that
actually gives the answer for exact answer for having distance case.
>>: So do you always get, can you ensure it's always going to be linear number of
terms?
>> Hongrae Lee: Yes. This is linear in the number of based terms. Yeah. Of course, if
you consider like any arbitrary at a distance, it can blow up. But based on the, according
to in the, in terms of the number of bases, studying [indiscernible] yeah, it's always. Let
me start here for the -- this was the idea, one idea on the string work. And today I
present this part. In general, for general cases we have OptEQ region. Let's see a
sample region. We compared our technique with the state-of-the-art algorithm, the prior
art, and both shows accuracy. So the lower the better. As you can see, the OptEQ, the
lever is much more accurate.
>>: You compare that [indiscernible] just take one sample?
>> Hongrae Lee: Yeah, sure. Sure. [indiscernible] random sampling.
>>: Random sampling.
>> Hongrae Lee: Set of random sampling, and we also considered random sampling.
It's not here. Random sampling is always the baseline method. But the thing is, in string
comparison, it can be CPU-intensive, comparing some string, computing it, it may not be
always easy.
So random sampling, when you random -- random sample and just compare value, it's
fine; but when you think about similar comparison, it's generally, it's CPU-intensive. And
of course we compare our technique to random sampling, yeah. And it's also coming
into the soft string problem as well.
In OptEQ there's two bars. And the right one has more space. As you can see -sorry -- clear off between trade and acquisition. This is not necessarily the case in other
data structure. Let me move on to soft string. Now, if you look at the problem statement
it looks very similar. The difference is now we want to count ->>: Technique because [indiscernible] for distance also? Technique for having
technique also for distance?
>> Hongrae Lee: Sure, yeah, yeah.
>>: Maybe we can ->> Hongrae Lee: Yeah.
>>: [indiscernible].
>> Hongrae Lee: The OptEQ, that's under general case. The formula is just key idea.
And, again, the lattice structure, that's the key idea behind the, otherwise it's super
exponential. It's very hard. And soft string, we want to count -- okay, going back to the
soft string problem. This is the only difference. We only count the number of strings that
contains some similar string to query. Maybe example will be easier. Here's tighter.
And now we can ask questions like how many titles are similar, contains something
similar to string.
So let me distinguish the two problems. In string problem, you match the whole column.
So if you want quality longer kernel like title you have to still input the whole column,
which may not be the idea. But with soft string you can only specify something contains
some query string. And this is like general like predicate. The counterpart of this in the
exact matching is like predicate with partition.
In string problem, of course, there are many possible extensions. But if you look at there
may be some natural clustering of strings. So this is some common underlying idea
behind the solution for string. But in soft string, there is -- there is the number of strings
you need to consider and they may be overlapping or they may be correlated so the
counting gets a lot messier.
For soft string, we have two solutions. The one is simple one. It's very heuristic. And
the second one is a more defined one. More data structure. Let me give you the key
idea intuition behind the first one, not giving you technical detail for the other two.
For the first one you have generalized SIS assumption, the exact matching. The soft
string selected estimation exact matching, the SIS assumption which states that a string
tends to have identifying soft string, which means here is one example. If some string
contains eatt, it's likely it's Seattle. But of course it could be something else. If you
count across larger fractions, the encoding string says Seattle. And this assumption is
used for improving the accuracy.
And in approximate matching, we have extended version. Generalized SIS assumption,
which states that there are identifying extensions, like typical variation or typical -- you
can think of it as common typo.
For instance, if you consider the quality string Sylvia and then find all strings in instance
one, this form, there other forms but these three forms explain most of the true answers.
So this form asks question LVI. This explains more than 70 percent of the answer count.
It's not like -- of course, in theory millions of extensions are possible. But in reality, there
are possible variations. So this is one key idea. This is the key idea behind our solution,
the first imprecise solution. So for soft string I'll just give you an overview and then
conclude this part with another research.
We have simple solution. Again, it uses Q-gram table and then it's based on the general
assumption. It estimates to what would be the most possible form and then scale it up.
So that's the thing.
And the second solution, LBF extends the Q-gram table with matching, comes in SSJ.
And then now we can consider multiple extensions.
The MO consider just one form. So obviously it's simple but it cannot be robust. And
LBS considers multiple extensions and their correlations and their overlaps. This is
performance highlight. We compared our technique random sampling and adapted it to
soft string and MLBS. The first shows the accuracy relative error and the second shows
runtime. So both lower the better. So as you can see -- sorry. The proof is technique is
a clear winner in both metrics.
>>: One question before you go on the next slide. The SIS assumption that you
mention, it looks applicable even to the earlier problem where you are not estimating the
frequency of sub strings, but rather [indiscernible] strings. So why did you use it? Is it
something with respect to the sub string that makes it particularly suited for SIS or ->> Hongrae Lee: No. Here you identify soft string and estimate the selective
encroaching string. So how can I give you this idea? Can you explain more how would
you ->>: I don't know. If you have an estimate, I can just focus on eatt, but adding the extra
[indiscernible] I'm actual [indiscernible].
>> Hongrae Lee: Then maybe we just counting the soft string is the difference. So this
considers soft string.
>>: Exact matching, approximate. If you have a solution for sub string, you can use it
for exact matching also?
>> Hongrae Lee: Yes, yes. Yes. Another question?
>>: That's fine.
>>: How do you find L2? Is it L2 with respect to the true ->> Hongrae Lee: True selectivity. Yes, over the true count, the absolute error with the
true count.
>>: Absolute true count?
>> Hongrae Lee: Yes.
>>: Do you know what the distribution of the actual selectivity is? Is it fairly close to
normal?
>> Hongrae Lee: No, it's not close to normal.
>>: Very skewed.
>> Hongrae Lee: In general it's very skewed. And soft string generally the
[indiscernible] is higher, the value's bigger.
>>: So what is the dataset you use in that?
>> Hongrae Lee: We used the TPLP. And sometime the IM cure.
>>: Cure.
>> Hongrae Lee: Also names, [indiscernible] titers.
>>: What are the queries, what's the sample you're estimating? What was your query,
estimation of sample of queries.
>> Hongrae Lee: Sample queries is sampled from the database.
>>: Uniform?
>> Hongrae Lee: Yeah, uniform. Okay. Let's work through join problem. So why we
sort of assessing join. Defining all parallel objects that are similar is one of the very
important operations. Document detection, elimination, correlation detection, duplicative
record detection. For instance, in the addresses you may want to find duplicate
addresses and as SIS join is proposed as a general framework for such operations. So
the idea is you can reference an object with a set.
And then you can run the access join algorithm. Here's, for example, you can represent
a document with a set of words, anagram, and then you can transform each document
and compare them. That's the basic idea of asset join. And I studied the selective
estimation, the estimation of the SSJ problem. Again, there are index structures, and
then the selectivity affects the choice of algorithm, and it can be embedded in other
predicates. As you see in the next slide, the selective changes dramatically depending
on threshold. So this schedule is more important.
In the set, one of the popular similar measures is stochastic similarity, which is defined
as the size of the intersection over size of union. So the qualities we have input of
collection of sets.
We have threshold tau on stochastic similarity. And if you want to count the number of
set pairs that satisfy the given threshold. Here is one example. We have five sets. And
if the threshold four and five, the answer is two because there are two pairs satisfying
the given threshold.
The challenge is that the joint selectivity changes dramatically depending on the input
threshold. Here's one example of DBRP, when threshold is .1, there are more than 100
billion pairs. But when the threshold is .9 there are only like 42 K pairs satisfying those
given threshold.
>>: The selectivity is the product [indiscernible].
>> Hongrae Lee: Product, yes. Here it shows selectivity is in terms of fraction. You can
see the selectivity is extreme as it goes higher, which means the random sampling is
very hard. You can adapt traditional join sampling algorithm to this problem, but they
work only for a similar or .4 and they don't work for the higher similar range.
But, in general, we are generally more interested in high similar threshold .5 and .9.
Another difficulty is in joint size estimation. If you know the valid distribution of a
joint column, it's done. For instance, if you know a value appears 10 times here and 20
times here, you know that there are like 200 pairs satisfying for join. So you don't need
to actually compare 200 pairs. But, in similar, join, it doesn't work like that. You need to
actually compare the pairs.
So this is background of our technique. We used min-hash signature. Let me give you
an idea what min-hash signature. In signature, you can think of it as representation of
object. When you compare some object, it may be expensive like documents you are
imaging. And signature is a set of value where a vector of values, and still you can
compare those signature, again you can achieve the same thing. That's the idea behind
the signature.
And depending on why you have comparison, there are many signature schemes
developed. And for stochastic similarity, min-hash signature is one of the popular
choices. I'm not -- I'll skip the details. But the idea min-hash signature you can think of
a vector of values. So here is original step. And here the signature size is 4, which
means it's a vector of four values. And it preserves the stochastic similarity, which
means you can measure the stochastic of the original object just by looking at the
similarity signature.
In this case, they preserve stochastic similarity. You can see there are matches at two
positions. So without looking at this original object you can estimate the, just looking at
the signature, you can estimate your selectivity, stochastic similarity is two by four.
This is original database, and we perform analysis on this signature representation of the
database. So we work on this here.
And for SS join, I'll give you just key idea. The key inside is there is relationship between
SS join size and number of frequent pattern in the signature database.
For this I'll give you let's start from one single pattern. So a single pattern tells us about
some number of pairs satisfying the given threshold. And if you consider multiple
pattern and that's the SS join size, that's the idea.
This is signature database. If you look at the signature database, you may observe that
same values occurs at some position over and over again. In this case 4 and 3 occurs
at first and second position in all the steps. So if you think this signature as a transaction
and one value as an item, you can define patterns just like a transaction there,
transactional database.
So we define signature pattern like this. [indiscernible] position, which means that 4 and
3 occurs there first and second position for all of this.
And you can define the same quantities. Like pattern length. Here the pattern lengths is
two and the support count is three because it matches the three sets, the signature of
[indiscernible] set. So what's the relation between this and the number of pairs?
Observe that these are three sets. If you pick any two, if you pick any two sets, for
instance, R1 and R2 or R2, R3, R3, R1, they have four and three at their first and
second position in their signature, which means any of these pairs their estimate
similarity is at least two-by-four. So this is the single pattern. Of course, there are more
patterns.
Four, 3X X and the support count is three. And it tells us that there are three pairs, three
is because you can choose any two, three choose two, satisfying tau .5. Matches -- we
know they matches at least first and second position. We don't care about the other
position. They may match more positions. That's fine. This is how we connect, how the
connection between single pattern and number of pairs.
And if you consider multiple patterns and that gives us SS join size. So here is type .4-5
and we first find all patterns with length at least two. And this is the results in the
previous example. The first pattern is just mentioned pattern. And each of these gives
some number of pairs. For instance, here, it's a support count three which means that
three chooses two. There are three pairs satisfying points five. They measure at least
four and three. And each of these patterns they tell us some number of pairs.
And as any pair satisfying the threshold matches at least one of the signature patterns.
So one naive solution could be just sum them up. Again, yeah, do you see the problem?
Yes. It's because there are overlaps among the counts. For instance, this pair, the pair
of signature R1 and R3 it appears. It matches this pattern and this pattern and this
pattern. So again there are overlaps. And our observation is by slightly twisting the
previous definition, again there is some lattice structure in this and then we can consider
the union.
And we have the lattice counting that considers the union count. So this is a key idea of
the join problem.
Let's see the ->>: So did you say which patterns you need to consider? Obviously there are many
patterns. Some length one, some two and three. Is there any comment on which
patterns you're looking ->> Hongrae Lee: So given the threshold, given a threshold, that will just -- watch the
pattern length should be. So similarities related with pattern banks in this case length is
two. It tells us that any pair from here matches at least two positions. So if the threshold
is .5 you don't care about the pattern of length one. So given the input threshold we
know the minimum length that we are interested, that we just mine that information.
>>: The statistic you have already is compared to something [indiscernible] you only
have this statement -- if I said tau .4, then the information [indiscernible].
>> Hongrae Lee: Yes. So we -- in signature database we can efficiently mine actually in
runtime we can get the information.
>>: You have this as part of the index?
>> Hongrae Lee: Yes, but it can be simple. We can maintain simple sample database.
>>: Sample database.
>> Hongrae Lee: Of course if it's too small there may not be many interesting pairs. So
one percent or five percent. So given the threshold.
>>: Let's say we were only interested in threshold greater than 25. Then you don't need
to keep the entire signature.
>> Hongrae Lee: Yes. So all the information we need is the account information. You
don't need to actually store any of the existing pattern. As you can see, support count
and then the number of patterns is all we need.
>>: Support like are you stopping, are you looking for the frequent pattern?
>> Hongrae Lee: Yes. We will looking for just a frequent pattern that satisfies the ranks
constraint. And one way to get that information is first we run the pattern mining
algorithm, which can be very efficient. And the second, even last year there are
independently reported technical estimating the number of frequent patterns. Entire
independent of this work. If you can in theory you can use those techniques. The key
idea why this work -- we don't need actual pattern. We only need counts.
>>: But don't need the entire ->> Hongrae Lee: Sample. Yes.
>>: But specific thing, how does [indiscernible] one the estimation, what is the technique
you use from this time connected to join this solution? Can you use a technique for
that?
>> Hongrae Lee: For join?
>>: Yes, special case for join.
>> Hongrae Lee: One way is, for instance, let's see we find the similarity with .5. One
naive way is we estimate the size of .5 and then subtract the .6.
>>: No, here -- output one.
>> Hongrae Lee: Tau .5 means you won't find more patterns -- you won't count the pairs
satisfying those five. Similarities greater than 1 equals .5. Let's say you want to find the
exact number of pairs that satisfy .5. You're not interested in .6, seven or eight. And of
course one possible approach is you first get answer for .5 and subtract the answer for
.6. Discretizing the selective space, the similarity space.
>>: But his question is different. I think his question was let's say make tau very high 2
to 1, what happens, how do you using comparative techniques used?
>> Hongrae Lee: So we are -- we are not sampling. Yeah. So let's say we've
compared -- we store the whole database. We know the answer. But I said in principle
we can store the samples, right? So if we run the algorithm -Yeah. The point is this process is much more than random sampling. For instance, if
you run random sampling and use the pair wise computation it doesn't work but our
technique works for five so we can keep larger sample size so we don't miss the two
pairs. So does that answer your question.
>>: We can discuss it. Question? So you do pair wise join, you decide technique for
join case, you have two tables what do you do?
>> Hongrae Lee: In two tables, with additional overhead, you can extend [indiscernible]
join. For instance, one naive way to be -- you estimate size from one database, R and S
and then you compare the combined R and S. Then you [indiscernible] then you get the
pairs. These are the answers. So with additional overhead. But in many interesting
cases, it's self-join. Do you see my point? That's one possible procedure. But
unfortunately in many cases the self-join has the special interest. In many cases the
application is self-join.
Let me show you the result. This shows the actual asset join size for each stochastic
similarity. And we compare our technique with the adapt [indiscernible] basically you
can think of this as random sampling. In the black line is the true answer size and the
red is random sampling and the blue is ours. As you can see, when the threshold is low,
it's fine. It works fine. But when the threshold is like big, there's not enough true
samples. You missed, random sampling missed a lot of samples. So here you can see
our technique closely follows the true answer.
And here what if we don't question the overlap? Okay. This shows the relative error,
accuracy. The first bar, the high dark green bar shows the error when you don't consider
the overlap. As you can see, you can make like two order of magnitude difference if you
don't consider overlap.
And another -- we used signature. It enables us efficient counting. But there is price for
it. Now, the next partial step -- so this shows the actual join size by exact matching,
exact pair-wise matching and then min using the join size using the min-hash signature.
So here it's stochastic similarity, it shows join size. So the red line shows the true size.
So we actually compared the original set. And the blue line shows again we compared
all the pairs but this time you compare the min-hash signature. As you can see there's
distributorship, so there's price for using min hash counting. So the point is using
min-hash signature for measuring similarities is fine, but when you use it for counting,
you have to be very careful. Because there can be distribution shift when the dataset is
skewed.
And I can give you more detail after the meeting. So we have -- we identify the region
and we have some correction step that corrects these overestimation, and this is when
you correct the error, we get, we can reduce the error. Again, this part looks short
compared to the original part, but the first part, but again you can make on order of
magnitude difference.
Okay. Summary. For string, I propose a Q-gram table. Proposed on the formula. Then
OptEQ at a general distant case. And for soft string we have simple version that uses
again the equivalent table. But if you are given more space, we can augment it with
min-hash signature and then we can consider correlations between multiple extension.
For SSJ, again we use the min-hash signature, and then the replace only formulas used
as a soft routine, and actually has more routine for improving the efficiency and
accuracy.
I studied selection and join operators. And so my personal feeling it's possible if
carefully done it's possible with reasonable space and runtime overlap. And one nice
feature it can be nicely integrated with interesting data structure. Let me conclude with
other research and official work. So there's another work now this is again the last
chapter in my technical contribution. It's more about -- more generalized version and
performance guarantee, and I plan to make all the results public. And I've done also the
work some work in [indiscernible] this is mostly done by my internship here, through my
internship here. The first study I studied the efficient detection of at runtime and last
summer I studied the parameter quality optimization, traditionally people has interesting
minimizing average cost, but here this work we also consider variants. And there's
another work clustering on how to cluster with document.
For official research, as direct extension of my Ph.D. work, I'm interested in text
processing in LDS in general.
One difficulty is approximate matching is one thing but there are a bunch of more. If you
start considering text is very difficult. Like sound decks or synonym or abbreviation, or
distance. So it's very, it could be very challenging to support some general framework.
And I'm also interested in robust quality optimization. This is influenced by my internship
here. And again the quality optimization, there are many unsolved problems like how to
deal with digital known statistics DRF or [indiscernible] environment and dynamically
changing environment. So this is quite challenging. And then also the automating the
DBS management is also interesting. Highly paid jobs just for tuning the parameters in
LBDMS. At least to me that's not ideal situation.
And there are recently some people seem to hate database, you structure proposed
these days like keep errors to the Map-Reduce. It's not like I believe all the buzzwords
out there, but I think there is some truth in their approach. So I'll be very interested in
making LBDMS making it more scaleable, learning from those techniques. Okay. As a
conclusion, there are a growing amount of unclean text data. And LBDMS can better
support text with the estimate techniques with techniques. Thank you.
[applause]
>> Vivek Narasayya: Let's thank the speaker.
[applause]
Download