1

advertisement
1
>> Christian Konig: Thank you very much for coming. It is my pleasure to
introduce Bolin Ding from UIUC. He's a fifth year Ph.D. student. He's now a
veteran of a total of four internships at Microsoft. Three of those in the DMX
group at MSR Redmond, and he's worked on, as you will see, a wide range of
topics, including data mining, privacy, work on sensor nets. And he's going to
talk today about mainly on this work on processing information retrieval
queries, text and text analytics.
And with that, I'll hand it over to Bolin.
>> Bolin Ding: Okay. Hello, everyone. Welcome to my talk. And as Christian
said, my talk today is about how to support efficient and effective keyword
search. So before going into the major part of the talk, I will first give you
a very high level overview of my Ph.D. research.
So in many -- there are two major streams. The first stream is based on
structure data, including original databases, data cube and networks. I study
how to build efficient index structure and the data model. [indiscernible]
kind of in this same data model, two kinds of problems. One is keyword search
and text analytics. And the other one is data mining.
So in the second streams of my work is about data privacy, because we want to
encourage people to share their data, so we need to promise that their data is
safe in our indexing model. My work in this stream provides privacy preserving
search, query and mining.
And during my internship here, I worked on [indiscernible] with Christian. And
in school I also collected people from other groups to transfer the data mining
and database techniques into different structures like sensor networks and this
other [indiscernible].
But in this talk, I will focus on this stream. In particular, it's about
indexing and data model. Different types of structure data and to support
keyword search.
Here is the outline of this talk. The first part will start from a very basic
operator to support keyword search and text analytics of plain text. It is the
[indiscernible] operator. And this book was done during my internship in MSR.
And then the second part, I will discuss how to incorporate taxonomy into
search and analytics. So simply put, taxonomy is a hierarchical terms and a
2
query can be rewritten by replacing terms with instances like company can be
replaced by IBM, Google, Microsoft.
So the goal of this work is to optimize a structure to support such, to process
such query writings in an efficient way. And this work originates from
so-called ProBase system, which is led by Haixun Wang at Microsoft Research
Asia. And our technique for the optimizing the index is now being integrated
into a part of the system. And it's my pleasure to introduce this work in
public for the first time because it is just accepted by SIGMOD this year.
And then for the third part, I will introduce our system called text cube and
top cells to handle structured [indiscernible] on plain text. It is based on a
collection of works which was originally funded by NASA. And before we finish
this funding, we built a system where I act as a team leader. This system was
so exciting to NASA people so they give us another round of funding to add in
new features to this system. And I will give you some details on this system
later.
So overall, this is first part, second part and third part.
outline.
And this is
Let's go to the first part for start. The first part is about the basic
objective called set intersection. So it's very important and in many
applications like keyword search, data mining and maybe query classification.
So from reports in Google, it is said that speed does matter in web search.
And among different factors, there are many factors can affect the latency of
speed, like the network status and other ranking [indiscernible]. But the set
intersection is one of the most important operators. So we study how to
improve the performance on this operator. And in particular, 100 milliseconds
matters. We'll show how our approach can improve the performance by this much
in most cases.
So set intersection plays an important role in keyword search, of course. A
simple example is we have a list of documents which contains some terms. And
this so-called virtual index is constructed in the pre-processing. So in the
online processing, once the query come, the output is the set of documents
containing all the terms. So it essentially is simply the intersection of
these two lists. So this provides us a motivation to study the certain
intersection operator.
3
And our algorithm and index structure will focus on in-memory compact index.
It aims to provide fast online processing algorithms. The contribution is on
two hand. On one side, in theory, it's provide better time complexity. But
more importantly, in practice, it has robust performance.
So simply put, it compared to other techniques for this operator, our approach
in is the best in most cases. And otherwise, it is close to the best. The
improvement is around, I mean, we can reduce the processing time to one half in
most cases. So this is so-called the best in most cases case.
And here's a complexity, but I will introduce it later. So there are different
types of related work. This is very simple one, we just merge two sorted
lists. And these three are designed to handle the case where we have one side
very large and the other side very small.
And this one is a very recent work from a theory community. So it's very good.
It has very good performance guarantee in theory, but it performs very badly in
practice.
So let's come to the basic idea of our approaches. In the pre-processing
stage, we first partition the lists into small groups. So to remind you, every
element at least could be from a large universe, depending upon how large the
document [indiscernible] have. But we partition them into small groups first.
And then each small group is mapped from a large universe to a small universe,
using a same hash function, H, for all the small groups.
And the size of the range is W. W is the word size. Then for -- here's a
simple example. For each small group from a large universe, we map them into a
small universe for 1 to W. So W into 16 then the hash value is the number from
1 to 16.
This small universe, this image from a small universe can be encoded as a word
in a computer. Simply means if 1 is in the image, then the first
[indiscernible] is said to be 1. 6 is in the image, then the 6 [indiscernible]
is said to be 1. So in this way, we can compactly represent to the hash image.
Now, there are two observations -- there are three observations. The first one
is if each pair has an empty intersection then the hash image will have empty
intersection too with high probability. So, of course, how large the property
4
is depending on how we partition them and how we select the hash function.
I will introduce that later.
But
The second factor is more obvious. It says that if the hash image is empty for
some pair, then we can skip this pair for sure, simply because if there are
some common elements in this pair, then they must be mapped to the same entry
in the hash image, because we use the same hash function for all pairs.
And the third one is typically, the intersection size is much smaller than the
list size. Because otherwise, we simply materialize this pair in the index.
So in most cases, we found that the intersection size is relatively small.
>>:
[inaudible] how are the small groups [inaudible]?
>> Bolin Ding: Oh, good question. So there are different ways to partition
the small groups, but suppose W is work size. The group size, good choice is
square root of W. So if W is 16, 64, then W, then square root is 8, group size
of 8. And I will show you why this choice makes sense.
>>:
[inaudible]
>> Bolin Ding: One partitioning schema is based on sorting.
based on hashing. So I'll introduce both a bit.
But the other is
In online processing, we first compute the intersection of hash images. If it
is empty, as we previously said, we can skip. But otherwise, we need to
compute intersection of two small groups. This example. The hash image we
found, the intersection is the green one, which is not empty. So but note that
since they are encoded as a word, the intersection of the hash image done more
efficiently using a [indiscernible].
And now there are two questions. How to partition. How to partition is just
that. And the second one is how to compute intersection of two small groups.
The first partitioning method is called fixed-width partition. So in this one,
we first sort the two sides, L1 and L2. And partitioning them into equal size
groups, with square root of W elements in each group. W is word size.
Now we compute intersection of two small groups using this sub routine. But we
only need to consider -- because this, we only need to consider some subset of
all the pairs. Simply put, if a pair has an overlapping range, we compute in
5
set intersection. Otherwise, we don't need to like this one into this one.
The maximum value is more than minimum value. So they're at most N plus M over
square root W pairs to be considered. Actually, there are times two.
So for each pair, how to process it quickly. This is our sub routine called
QuickIntersect. We call it processing, we map them into a small hash word, a
hash image. We first compute in online process, we first compute the
intersection of the hash word using one operator, bitwise-AND. And for each
one entry in the hash word, we go back to check whether the two elements that
are mapped to the same value are the same. If yes, we add this element into
the result. Otherwise, if these two are different, then they are not in the
result side. They are not in the intersection result.
So essentially, the cost for this step is a number of one bit in this hash
word. But since I said there are two cases, one is these two elements are
indeed in the intersection. The other case is these two elements are different
but are mapped to the same entry. This is so-called bad case. The good news
is that the number of so-called bad case is bounded by one for each pair in
expectation. So that's a very key reason of why our -- why this approach
performs well.
So let me be clearest about why this is true. So here's a bit of analysis.
The number of bad cases for each pair is bounded by this one. So in
particular, if two elements are different, X1 and X2, but the probability that
they are mapped to the same hash value is bounded by 1 over W. And if we
consider all the other pairs, since the group size is square root of W, then
with some of the map, it is at most 1. So given that, the total complexity is
this. The number of bad cases across all the pairs sum up to this factor, and
there's an intersection for all the pairs of small groups, some up to this
factor. So overall I got this one. And we can optimize the parameters a bit
to get a better parameter -- or to get a better capacity by selecting a
different group size.
The second partition scheme I based on randomized partition. So first question
is why we need a randomization. The reason is simply because in the
fixed-width partition, we note that these two sets are partitioning to three
groups for each. For the fixed-width partition, we need to consider the six
pairs. But to show that in the randomized partition, only three pairs need to
be considered, which could be more efficient.
6
So it works in this way. For two sets, we use another hash function G, called
grouping hash function, to map all the elements to the universe with the size
equal to the number of groups. So for each list, the elements that are mapped
to the same value are grouped together. So that we have this number of groups
for each list.
Now for two sets, we only need to consider the pairs of groups which are mapped
to the same value, according to group function G for similar reason as before.
And again, the key question becomes for each pair, how to compute an
intersection efficiently. And again, we use the same routine and it works
exactly the same as the previous one.
The analysis is also similar. The P parties to bound to the number of bound
cases for each pair, which can be shown to be 1. Say the property here is at
most 1 over W, and the number of pairs of groups is from up here. So it's at
most 1. And for similar analysis, we can get this complexity.
By the way, this [indiscernible] can be generalized to K-set intersection.
Now, the two approaches are better approaches in theory, but in our
experiments, we found that ->>:
You said generalized to K-sets.
>> Bolin Ding:
>>:
Yes.
How do you do that?
>> Bolin Ding: So for K-sets, it's quite similar. There are two sets here.
If there are other sets, then very simple [indiscernible] we consider the
triple of groups with the same G value. So that means in first set
[indiscernible] triples. And then we need to revise the [indiscernible] this
subroutine. It's also not so complicated. We compute -- this step, we compute
the intersection of three hash images. And similarly, for each one entry, we
go back to the three lists to check whether they are the same.
>>:
You mentioned the [indiscernible].
>> Bolin Ding:
>>:
Um-hmm.
You mean partitioned?
So is this algorithm, is it a distributor [indiscernible].
7
>> Bolin Ding: This is a good question. So I'll answer it in two ways.
first one, on each partition, we can apply this technique, because ->>:
The
[indiscernible].
>> Bolin Ding: Yes, yes, we've got the same results. We've got the same
result, even if this algorithm is applied in each partition. [indiscernible]
we get to the same result. And the second answer is as you can see, we use a
grouping function here. This function can be even used to direct us on how to
do the partitioning on different machines, because -- but this is a little bit
out of the range of the talk, okay.
>>: So you place a lot of emphasis on [indiscernible]. Why don't you just
have one hash [indiscernible] and spray them across a bit vector and chop the
bit vector up [indiscernible] process?
>> Bolin Ding:
I'm sorry.
Your question?
>>: You're hashing. So why don't you just generate one large bit vector, chop
it up into pieces, and do the intersections of bit vectors in pieces? I don't
understand what the role of the grouping is doing for you when you're using
hashing for those steps.
>> Bolin Ding: Oh, I see.
for this part, this part?
So you just said we can simply do the intersections
>>: Yeah,, you map them into one large bit vector and just chop it up into
pieces and do the pieces one after the other. Isn't that equivalent to what
you're doing?
>> Bolin Ding: Interesting. So this is very similar to our practical version
of the second algorithm. So this sub-routine seems to be complicated. We have
this kind of structure. And in our practical version, we show that we can
simply do the independent merge on these pairs, right. But the hash word is
still useful, because we can first test whether this hash image has an empty
intersection. If yes, we can skip this pair.
>>: That would be true if you just chopped the result and the resulting bit
vector up into pieces and processed it a piece at a time.
8
>> Bolin Ding:
>>:
That's possible, yes.
All right.
>> Bolin Ding: But okay. Yeah. But yes, that's essentially means that our
approach can be applied in a distributed or in parallel style, right. And I
will show you that the practical version is very similar as you described. In
this version we found, well, [indiscernible] cheap so we just chop them into
segments. And for each segment, we merge to computing section. And we found
that this simple algorithm is even faster in workload time. And we really
[indiscernible] that so that's why we analyze this this way. But we will add a
bit more features on that. That is in the pre-processing. We still map each
pair of groups using a hash function, and we use additional one word to store
the hash image.
So before we process each pair of the [indiscernible], we first check whether
the hash image has a now empty intersection. If the intersection of the hash
image is empty, we can skip this pair. But otherwise, we use a simple merge
algorithm to compute the intersection of this pair.
So this is the idea. We don't use this structure anymore. Instead, if the
hash value -- if the hash intersection or hash image here is now empty, then we
do linear scan to merge. But otherwise, we're skipping this scan. And an
extension to this framework is to add more hash functions. Simply means we add
more words for each pair so that we scan this pair only if all these hash
words, all these pairs of hash words have a now empty intersection. If one of
them is empty, we can skip that.
So why this works well, so because
filter the empty -- can be used to
intersection. And I will show you
significant. So in the sense that
that is the intersection is empty,
the hash words. Happens with very
constant.
this kind of hash function can be used to
filter the pairs of groups with empty
the power of this filtering is very
this kind of event called failed filtering,
but we have a now empty intersection for all
low probability, which is bound by a
Because of that, we can use [indiscernible] functions to further reduce this
probability. And to show this constant, we did some analysis, but I will skip
it here. So now the total running time of this algorithm consists three parts.
9
The first one is we compute the hash functions when computing intersection of
hash images for each group, for each pair. And for each element, we still need
to [indiscernible] which is very low, especially if we use [indiscernible] to
two or four. But for the elements, which are intersection, we still need to
scan that, so that's for complexity.
This just shows and include many details, but I will emphasize one point. That
is in our evidence, in theory, the grouping, the grouping of the elements in
theory, it depends on the query, depends in different queries, it may be
grouped into different status. But we can use [indiscernible] structure with
just a linear space to represent all the possible revolutions so that we don't
spend too much more time, too much more space.
So in summary, or in theory, some better performance guarantee. And in
practice, this grouping merge with hash functions for filtering performs
efficient in practice. And we also propose another algorithm using
[indiscernible] of same index to handle the case where one set is lage, the
other set is small. And they can be generalized to K-sets.
So what we care is about the [indiscernible] performance. So before we
introduce the performance in experiments, our discuss the reason why the
[indiscernible] can be improved. So the first [indiscernible] is in
algorithms, the constant or the number, the real number of operators
[indiscernible] relatively small and the second reason is especially in this
one, we tried to do a linear scan in memory instead of random access to achieve
better performance. So we can see that in experiments, this approach is best
in most cases and otherwise close to the best among other existing approaches.
And in experiments, we first generate some synthetic data and used real data
from Bing and Wikipedia. All the acronyms are carefully [indiscernible] into
[indiscernible] to achieve the best possible performance. And the time is
measured in milliseconds. Some way implemented some typical approaches in
different types and compare them with our four approaches. This one is
practical version of the second one. And the summaries on average is our
approach is twice as faster than the best competitor.
It is also robust in the sense that when they are not the best, they are close
to the best. There are some parameters we want to study in experiments. One
is [indiscernible] of the list, the other is the relative size of the
intersection. And also the visual size in the query as well as the number of
10
sets included in the query.
in the first one, we vary the size of the sets. We found that we have two sets
with the same size varying from one million to ten million. We found our
practical version is always the best, followed by our first approach, and you
can see the improvement, then the [indiscernible] which is simple and
effective. Improvement is still twice. We spent only half of the processing
time. And when we varying the size of the intersection, we found that before,
the size of intersection equates to, like, 70% of the list.
Our approach is still the best, this practical version, followed by the first
approach and the second approach. But after we got that, merging's the best,
merging is this blue line. So the reason you simply merge is very simple and
has very little additional cost. But actually in practice, if the intersection
size is so large, then we may just start to materialize this pairing of the
index, and really happening so this really happening in search engine.
And the third one is we vary the ratio of two sets. We can say before the
ratio becomes too large, that means one set is small and the other set is too
large, ours is still the best. But after that, the hash algorithm is the best.
The hash algorithm is simply describe the short list and they use hash lookup
to track whether each element is in the other list. So it is the best hash,
hash is the best if the ratio's too large.
But we can say that ours is still close to the hash. Hash is a purple line or
hash is a purple line, and our is a red line. Still close to the hash.
Hashes, yeah. And then in the real data, we normalize all the performance of
all the algorithms to the merge, where the merge is one and the print file
means the performance of that. You can see that our random group scan is
around half of the merge. And in particular, with more, we report to the
performance for queries with the numbers of keywords. If there's two keywords,
merge it here and here. But as a number of keywords inquiries we found that
the performance improvement is even more, simply the because filtering power of
hash functions were in queries with more keywords in one query.
Okay. So let's conclude this part. We propose simple and fast set
intersection structure and algorithm. And novel guarantee in theory. But more
importantly, we want to show they perform well really better in practice and
they are the best and most [indiscernible] close to the best. In the future,
the storage compression is an important issue and we have some preliminary
11
results, but we're not including this talk. And the second part is how to
select best algorithm and parameters. Very simple case is the merge performs
sometimes better. So how to make a selection between ours and merge and why.
>>: [indiscernible] in real data, what's the size of the sets
>> Bolin Ding:
You mean in real data, what's --
>>: Yeah, set size, in real data.
>> Bolin Ding:
In real data?
>>: Yes.
>> Bolin Ding: Oh, in real data, because the query log is gotten from Bing.
So for different queries, the length of the size of the may be different. But
on average, since we're using Wikipedia to build the word index, on average
remember it's around 1 million to 5 million. So ->>: [indiscernible] should keyword [indiscernible].
>> Bolin Ding:
Yes.
>>: So if the size of settings like one million on one web page, you have one
million of this ->> Bolin Ding: No, no. That means one million pages contain this word.
and actually, because the web page are not partitioned, so actually the
technique can be used in the partition.
Yeah,
>>: Okay.
>>: So your algorithm and the algorithm of the study here, they are
[indiscernible] section of the lists, right?
>> Bolin Ding:
Yes.
>>: In practice, especially dealing with more practical ways [indiscernible] to
do some sort of [indiscernible] processing, whereas you only need to process a
12
very small portion of the lists.
handle such cases?
So can your evidence [indiscernible] to
>> Bolin Ding: Yeah, that's quite possible. But this work focus on
[indiscernible] because sometimes we may need the [indiscernible] to accomplish
some task like query optimization. But when the only the top case is
interesting to well because [indiscernible] evidence requires only most of
them requires only linear scan with a list sorted by the relevant score. So
actually, we can organize our index [indiscernible] in a similar way. But the
extension, yeah, is an interesting piece of future work. Is not completely
solved by this one.
>>: I think in your first version is basically is big size range. This can be
very easy to adapt [indiscernible] randomized. The second one, I'm not sure if
you have ->> Bolin Ding: Yeah, for randomized, will work to -- well, a possible way that
we can treat the relevant score as a hash function. But it may not have search
perfect performance guarantee in theory. But if we're treat that in this way,
it will be possible.
>>: So is the issue of the set sizes is [indiscernible] hashtag
[indiscernible].
>> Bolin Ding:
Yes.
>>: So in [indiscernible] kind of ways you might expect, for example,
[indiscernible]. So that will be the common case.
>> Bolin Ding: Okay. So what I want to show here is trying to show you here
this. So here's the threshold, and it is shown that this is, if the hash
performance [indiscernible] larger than 100. But given that, our approach is
very close to hash. And actually, through -- actually, our technique has about
the worst latency. Because if this case happened, the processing time is
already very small. So we don't count so much, well, so the difference here is
not so much but we care about the worst case. In this case, the gap here is
large and we [indiscernible] them. And another comment is we may need to
switch to hash if we can detect this case happen.
13
>>: I see [indiscernible] we should be able to [indiscernible] this merge we
can have [indiscernible] getting from the big structure to merge. But with
your way, you don't have much of this benefit.
>> Bolin Ding:
Okay.
Let me, so basically, saying about the pipeline, right?
>>: Yes.
>> Bolin Ding: Okay. So the pipeline, a major issue in pipelines is load
balancing, right? If we -- or maybe in the distributed environment, if we want
several machine to handle one list, one important properties if the load is
balanced among different, different machines then this approach can be -- the
set intersection can be computed in parallel where efficient.
And I want to say the advantage of our approach in this environment is that we
use hash function for the grouping. So the group properties, the hash function
can group at least into small groups with approximately equal sizes. So this
feature can handle the load balancing issue very well. So simply means in the
-- in our randomized grouping approach, here, we can put, well, say, these
groups into one machine. These groups into the other machine. And when we
process them, it can be done in parallel.
>>: [inaudible].
>> Bolin Ding: Do you have any question on this part? Okay, then let's move
on to the second part, which is a bit more high level than this one. Sorry,
sorry, sorry. Okay. That's it. It's about beyond set intersection. And the
focus of this part is how to optimize index structure in a big high level way
to support taxonomy keyword search. And consider such an example. I want to
find a job in Seattle, some IT company. So I type these terms into Bing or
Google. In Bing, I find search results. The first one is not relevant at all.
And the second one include ->>: [inaudible].
>> Bolin Ding: Yeah. So the second one is about some company called All
Covered, but I'm not sure why it is there, because it is there only because the
terms are included. And all the rest are about some job seeking website, and I
may need to retype my query here to use the application.
14
So similar things happening in Google. So we don't need to feel so bad. And
so as a user, what I need to do next with this search engine is to rewrite the
IT company as company names, like Google in Seattle, whether have any job
opening. We found, well, here it is and Facebook. How about Facebook. We
also found some in our top results.
So here's the motivation why this is important for search engine. The first
problem is if the user has no knowledge about which companies are in Seattle,
then this rewriting is essentially impossible for them. You have to refer to
other websites. And the second problem is if there are hundreds of IT
companies, do we require the user to rewrite the query for hundreds of times,
which is not reasonable, of course. So our final goal is to provide
automatically rewriting to user, using a taxonomy.
Here is a demo in the ProBase system. So this work originates from ProBase,
which is by led by Haixun Wang of MSI. And in this demo, it shows that, well,
this example called company by IT, tech company. This shows that if we click
this button, different possible rewritings will be expanded here. So now the
focus of this work is how to process all the possible rewritings using a
carefully designed index.
The motivation, first one, enhanced user experience in keyword search. Since
much work in IR has been done on how to find different query rewriting, but
this work will focus on the efficiency part, how to optimize the index
structure to efficiently process all the query rewritings.
And this technique can be applied in other applications in text analytics with
taxonomy. So there are two big challenges. The one is for query, keyword
query and keyword taxonomy, which is large nowadays. We have a large number of
possible query writings. On the other hand, the space for index structure is
limited. So once the index structure is doubled, which means we need to spend
double the amount of money to buy machines.
And then let's come to the taxonomy and let's define taxonomy more formally. A
taxonomy is a hierarchy of terms, concept-instance relationship it could be
generated by a particular system. ProBase is the country's largest taxonomy.
And our work supposed that this taxonomy is given. And the taxonomy could be
realized of the tree, each query, each term has its instances.
And in particular this relationship is transitive in the sense of the sub-tree
contains all the instances of concept here. Like IBM, Google and Microsoft are
15
instances of company and also instances of IT company.
search is based on such a taxonomy.
And taxonomy keyword
So given a query like database conferencing Asian city so database country can
be replaced by SIGMOD, WWW, SIGMOD or VLDB similar to Asian city. In another
example, company, IT company can be replaced by a bunch of things. And some
rewriting may be not so many for like this one. But our goal is to, well,
among all these possible rewritings, we want to find or to answer all of them,
even though some of them are not interesting. But to user, don't know, right?
So we want to answer all the possible writings in an efficient way.
This is a focus of this work, which will appear in SIGMOD 12, and it's being
integrating to the ProBase system. So it's about the index to support taxonomy
keyword search. We are given taxonomy and try to answering all the possible
rewritings. The space is limited. And the problem is how to optimize our
index. So the key problem here is all to to be materialized and does not need
to be.
Let's define the taxonomy in keyword search more formally. So suppose query
has three terms, T2, T10 and T17 in this taxonomy. Then we cran replace each
term by teach of these instances. Like IT company to Facebook. IT company to
IBM, Google, Microsoft. And also for the other terms, like T10, so on. So
there are too many ways to rewrite the queries.
The overall picture is that we're not asking the users to rewrite that. This
will be handled by the search engine and our technique. So another way to give
the same set of results is this. We compute the union of this part. That
means we compute the list of documents, which contains any instance of T2.
This is essentially the union of a bunch of merging list, a bunch of merging
index. And the answer could be simply the intersection of these three terms.
But problem is they cannot store all the results list in memory. It will just
consume too much space. And actually, with very simple tree taxonomy, the
space will be three [indiscernible], three means three times the
[indiscernible] list, which is not acceptable. There are two base lines. One
is to index nothing and the answering of this query is equivalent to evaluate
such a large DNF. So why it's large, because this part, the union part is the
number of instances of some terms. If the taxonomy is large, this part is
really huge.
16
But we know also that this is the lower bundle of the space consumption because
we only materialize the [indiscernible]/ and second wise, we materialize all
the results list, then it can be evaluated in this way, but we need to
materialize all the possible terms, all the results list for all the term. We
just cannot so much space.
So our proposal is to partially materialize it. First, we have to materialize
all the [indiscernible] system. And second for the list, for the result lists,
we select a subset of the terms to do the materialization. Here, the green are
the set P. We only materialize this result list. Then to process the query,
to get the same result as before, we can -- we only need to evaluate a smaller
DNF for term T2. This will include merging list for T2 and T9. But for
instance T3, T6, we can utilize the partially materialization of this results
list.
So this could be, well, something in the middle. But the problem is how to
select the set P for the materialization. And this is so-called I proposed
workload of our index optimization. The set P is selected based on workload of
queries. In particular, this workload is query log. And for subset P of
terms, we materialize the result list. But for others, we materialize only the
merging list. And as a query, the query here, this is a query log, so this P
is selected based on the taxonomy structure and our query log.
So now, the major problems, we have a space budget. We need to constrain, we
have a constrain that the overall index structure has a space number than
certain amount. So it becomes an optimization problem in which we want to have
a constraint as a space, and we want to minimize the cost. So I will introduce
different cost model later. But the cost is simply the average cost to process
this workload on this materialization.
And we use this query Q, this set of query Q to -- as a something like a
training set. So the index is optimized based on Q, but we show that the
optimizing index is very effective even for future queries.
This is a different cost model. The first model is linear sky model. We just
apply the merge acronyms. Now to compare to this DNF, we first compare the
union of each term in the query using linear scan. And then we will use linear
scan again to compute the set intersection.
Then the overall cost is -- can be approximated by the sizes of lists we
17
retrieved from the index because the linear scan essentially scan all these
lists. And second model independently, the second model is hash lookup
[indiscernible] DNF in a different way, in which we select one listed candidate
and for each element in the candidate, we use hash lookup to check its
membership.
So the cost can be approximated by the number of hash lookups times the
[indiscernible] by total number of hash lookups. So yeah. So this cost can be
computed as a weighted sum across all the queries in the workload.
There are two interpretations. The first one is this cost is expected
processing cost for random query in Q. The other is if Q is large enough, this
cost can be used to predict future queries processing cost using this index in
P.
So now if we use this cost model, it can show that [indiscernible] hard, and we
proposed different acronyms to select the P, which is a key problem here. And
the [indiscernible] basically select the terms paced on their frequency. We
like to select terms with more frequency in workload, because they are likely
to be referred again in the future. But we show that this is suboptimal. In
the first case is if T1, T2, T3 are all frequent, they are included in P. But
actually, a better solution is to include T4 with even smaller space, but have
similar improvement on the performance.
>>: What's the [indiscernible] example of that?
>> Bolin Ding:
You mean here?
>>: Yeah, the keyword [indiscernible] taxonomy [indiscernible].
>> Bolin Ding: Okay. So example could be like -- suppose T4 is Microsoft.
Microsoft is IT company. IT company [indiscernible] through Microsoft. And in
another case, we may say, well, a successful startup could refer to Microsoft.
Then we could materialize these things, of course. But if we're materialize
Microsoft, this material -- this [indiscernible] can be utilized by all of
these terms. So the space we need for this term is smaller, but we can achieve
similar improvement on the ->>: [Inaudible] this example might be [indiscernible] more frequent than
[indiscernible].
18
>> Bolin Ding: Yeah, this may not be so good an example.
Maybe that's --
Yeah.
Let me see.
>>: [inaudible].
>> Bolin Ding: Maybe let's say this one. Suppose this is Facebook, this is
FB. Also that might frequent, but we only need to materialize one of them.
Then because of the -- it's not optimal, so the second one is used in
[indiscernible]. So the P idea is that suppose all the terms are ordering by
DNF. Then this part, the decision on whether to include this one determines,
is determined only by, by the status of the parse from this term to the root.
So I will maybe for this limited time, I'll skip detail. So but to the key
things we can have such a [indiscernible] optimal solution. And it can be
computed in this time to find the optimal solution. But the problem is the
Budget B is in the complexity, which can be large. So we need to scale it
down. After it scales down, we may be applied in concept, but we may not find
optimal solution anymore. It becomes an approximation.
For the third one, we utilize a similarity of the cost function and use
[indiscernible]. We show that this function is monotone and sub-modular.
by sub-modular, I mean if two sets, one is the super-set, one is a subset,
the benefit of including a term in the super-set is smaller than a benefit
including the same term to the subset. And because of this property, this
intuition.
So
then
of
is
We can use [indiscernible] acronyms. It starts from an empty set, and the
average situation is left term with maximal marginal benefit until the space is
used up.
It is a constant optimization based on the approximation of similarity and
monotonicity. And its time complexities, but it should be known that the
complexity, the real running time is much faster because it will terminate
after this space budget is used up.
Now, let's come to the experiments. So taxonomy is given by ProBase, which
contains more than 200 thousand terms. It's already the country's largest
taxonomy in today's taxonomies. So query log is [indiscernible] from 2007 to
2010, February and contains all [indiscernible] queries with frequency larger
19
than 300. And we use this query log, this query workload is used to test, is
used to train the index. But we use a different set of query in the future
from March to August to test the optimized index, and the documents are a
sample from Bing's web pages. We compare different approaches to compare it
for the index.
And the summary with us is this baseline use no additional space, but the
present time is bad. This baseline use three times spaces, but it's a lower
[indiscernible] process time in this model, and we will show that our approach
use only 10% additional space. Go ahead.
>>: [indiscernible]
>> Bolin Ding: ProBase is actually a project on how to build taxonomy and how
to apply this taxonomy. So this work is application of this taxonomy:
>>:
What is the final keyword [indiscernible].
>> Bolin Ding: It's just to answering all the possible query writing.
just a setting session of the union on this taxonomy.
It's
>>: Is there a ranking?
>> Bolin Ding: There's no ranking at this point. We return all the answers.
And, of course, it can later be included in ranking.
Let's put it this way.
>>: You are talking [indiscernible].
>> Bolin Ding: Yeah, let's put it in this way. The index we constructed could
be used for ranking. But the performance evaluation is about the case when we
want to extract all the results. So hopefully, this optimize index can perform
well for ranking time, because we essentially we materialized something. And
similarly to the [indiscernible] list, hopefully it can be, the performance of
ranking on this optimizing [indiscernible] is similar, but we need to evaluate
that for sure. Yeah? Any other questions?
So basically, the major conclusion is we have when you use a bit more space to
achieve the similar [indiscernible] processing time as the best possible one.
But note that this best possible one is three times as many -- as much space as
20
[indiscernible] lists.
So different curves.
For space budget, we found that 10% of space is enough. This is lower bound.
And this is a hash look model, and this is how it handles future query. The
query log is from September to February 2010 and queries are from March, April,
May and to August. We can see that, especially in this, we can see our
approach is pretty stable for future query. After August, it has a bit
increase because distribution change. But we can then re-optimize the whole
index.
And compare different models, and we also test our scan ability of different
size of [indiscernible] index. We can see that the performance is stable if we
enlarge the size of the material index.
And this is to show the index optimization part only occupies, only occupies a
small portion of the overall pre-processing. And the extension is how our
algorithm can be extended to handle general taxonomy. The only thing we want
to show is the similarity of the cost model in general taxonomy and which can
be done.
The second one is we can invite other algorithms [indiscernible] point out we
can use the special cost model for ranking algorithm here. So essentially,
this is general framework and flexible enough to handle that.
And the conclusion support taxonomy query, the optimization problem and it has
in Bing data set to gain such good news.
For future work, how to handle updates to determine when to reoptimize and how
to enlarge the search space to get better performance.
Any questions for this part? Okay. Let's quickly finish the third part.
the third part is a system we built to support text analytics on
multidimensional data.
So
This was originally funded by NASA. And after we built the system, they very
excited so I will [indiscernible] the system here briefly instead of
algorithms.
So the system like this, suppose on a productive basis we have customer reviews
associated with different properties like brand [indiscernible] and the queries
21
lightweight, powerful. So we want to find that which brand name with which
model has this feature.
The second one is the NASA [indiscernible] we originally work on this one.
It's a data set where we have reports, reports are written by captains of which
flight after the travel, after the travel. And the reports is also associated
with different attributes like the weather of this travel, the face of the
anomaly, and what kind of anomaly, and some others.
So the problem is how to find under which condition this kind of action happens
and what's the reason, based on the dimensions. It could be either caused by
the weather or caused by the light or caused by other things.
So now the big picture is we have a multidimensional data with text and
dimensions. We want to have queries, how to rank or what to rank. So our
proposal is a task cube model. So we aggregate the text data on subsets of
dimensions. The aggregation here are entities formed by a combination of
subsets of dimensions and their values.
And actually, this is an entity which combined the evidence of cost to
[indiscernible]. Mentioned values. So for example, here's a laptop here's the
a laptop. Here's the three attributes of the laptop. This is what the
customer reveals for the laptop. But the setting of the laptop can be combined
in any way.
So our compared with ranking objects approach proposed by people in Microsoft
Research and our work is -- was started independently and but I will show the
difference. So their work assumes there's a relation called document object
relation to associate each object with a set of documents.
Now, question is given the query, how to rank the objects. But differently, we
assume we have a structured dimension relation. Instead of object document
dimension, in which we have the document and the dimensions.
Based on the dimensions, we can generate a hierarchy of entities from
dimensions like models. Model, we can generate a bunch of different settings
of different laptops.
So based on that, a simple way is to insert all of the objects into their
relation. But the problem is there are too, too many objects to be inserted.
22
So our proposal is a test cube model. In this cube model, we have different
entities in different sub-spaces. And we do the partial materialization of the
whole cube so that the space is not so large but we can do the online query
efficiently.
Then this is a simple example. This is, this laptop, Acer 110, aggregate two
documents. This one, Acer, with a system XP aggregate another two documents.
And this is in NASA data.
So the whole object of our work is the power is given [indiscernible] data.
And through the test cube, we give a ranking of the cells, we give a ranking of
the entities of different sub-spaces, and the relevance is a relevance between
the text here and the query and the keyword query. So they are ranked.
And we are interested in top-k. The top-k search algorithm has two challenges.
The first one is the number of [indiscernible] number of dimensions. And the
second one is the relevant score is queried independent -- query-dependent,
which means we cannot materialize the query score, the score of the cell in
pre-processing.
So we proposed this algorithm. I will skip the detail, but the major idea is
to estimate the lower bound, upper bound of certain subset to speed up the
search and try to follow a shortcut for a [indiscernible] top-k.
So this is a very simple experiment on the efficient to show the scaleability.
This is the baseline is our approach. But the baseline is very simple. So it
may not be so [indiscernible], but I will show you the system, which can show
the effectiveness.
And this test system is based on NASA's data, as I described about, and here's
the address you can relate that any place. So the interface like this, we have
simply type keywords here and search.
This is the system NASA people was excited about, because like in this sample,
we try to find what kind of situation where runway excursion happen. Runway
excursion simply means a very dangerous event when the aircraft want to take
off or to land. So we found rainy night is a very dangerous condition for this
event. When this happens, we found a score of documents of the parts related
to this. One will be very high.
23
Another interesting observation is this model seems to be easier too with this
problem. But yeah. And the purpose of this system is to give, say, give some,
give some training to the new captain to let them pay more attention when this
condition happens to avoid this event.
And another sample for us, we are interested in what kind of flight delay, what
will cause flight delay. So we'll type flight delay. We found that most
likely mostly, it's something related to company and their policy. It may be
because of the weather or because some problem, and well in some cases, it's
interesting to note that the flight delays is caused by improper maintenance.
So this is because once the flight -- flight wants take off, suddenly found
some maintenance problem. This will also cause flight delay. So this is
something interesting for this query.
And the third one. What will the light fog cause for our flight. So the
flight first cause excursion on the taxiway and runway for the top two.
Especially when the fog happens, the reaction of the captain could be reject
takeoff, which rarely happens when we find the weather's bad. We found flight
delay, and they just reject to take off. And also, it's also more critical
during the night, during the night. And it's the light fog could be even more
dangerous. And several are other example like when the ->>: So you're not really trimming out -- say all the data occurred in 2010.
would you always show the column that said -- say there's a column the year
that it occurred. Would you always return to that column as being 2010? I
guess you would.
So
I'm just saying say the data is the same in all the documents that come back
for a given attribute. Let's say are you actually moving that attribute from
being relevant? Because say it occurs in all the documents across the entire
[indiscernible], as opposed to just the ones that you're searching for.
>> Bolin Ding:
Same document or same query term?
>>: Let's say your whole purpose is just of 2010 events.
>> Bolin Ding:
Okay.
>>: And then you type in some keyword which restricts that document. Are you
still going to return 2010? Let's say there's an attribute called year, right.
24
And are you still going to return 2010 as being an important attribute?
>> Bolin Ding: Well, in that case, it's not interesting to show 2010, right.
Then we'll replace 2010 with a star here, because what we return is only a
subset of dimensions, which are relevant to the query and their values. Yeah.
>>: Okay. But then you also have to look at all the other documents in the
queries and say that it's not really, you know, there's no decision tree that's
separating, like, the documents that have this keyword versus the other
documents, right?
>> Bolin Ding: Well, it's not a decision tree. Actually because it's a query
dependent procedure, because for different queries, different cells may be
ranked to be the top ones. So it's not a decision tree, it's essentially a
ranking technique.
>>: Just within the documents that have those keywords are you going to do that
ranking? You're not going to rank those features against all the other
documents that don't have the keywords?
>> Bolin Ding: Well, it's a -- the suggestion you mention is related to, well,
the IR model we apply. So on the other side, if one terms appear in other
documents, then with this document, it's not so, it's not so interesting,
because all the documents containing this document. Then the relevance, then
including this term will have a lower weight compared to including other real
terms. It's essentially relevant to what kind of ranking score, scoring
function we used.
I ask people a lot of details, but the key point is we use -- our search
algorithm can be combatted with different ranking functions. And in this
system, we simply use term key FIDF model from IR, which is typically used in
analysis terms to get the relevant score here.
>>: So your relevant score, I mean, if you look at the lattice of the cube,
typically, the cube's [indiscernible] lattice tend to have more
[indiscernible]. Seems like we have an aggregation of these top cubes, we keep
getting higher at number of [indiscernible].
>> Bolin Ding:
Actually, it will happen in the reverse way, because for cells
25
in higher level, all kinds of documents are aggregated together, right. So
then for a particular query, the relevant score could be lower because
different documents may -- or because it aggregates essentially different
situations. Maybe certain type of situation is relevant to our query.
But the problem I mention is actually [indiscernible] in our system, because
once we go down to a single, to single document, that could be some
over-fitting issue. That means if one document is very relevant to our query,
it's possible. Then a cell -- then the top cell could be just this document.
But this is not interesting, because we want our cell to be large enough to
support this evidence. So actually, we have a parameter called support this
threshold, which is a minimum number of documents we need to have in a cell to
be outputting the top-k. So yes. That's how we handle that.
Any question? That's, I think that's all. And almost one and a half hour.
Yeah. And these are different examples, and I would like to say thank you, and
here's an overview of my research. And if you're interested, we can discuss
some component of that. And otherwise, that's it. Thank you.
Download