1

1 >> Christian Konig: Thank you very much for coming. It is my pleasure to introduce Bolin Ding from UIUC. He's a fifth year Ph.D. student. He's now a veteran of a total of four internships at Microsoft. Three of those in the DMX group at MSR Redmond, and he's worked on, as you will see, a wide range of topics, including data mining, privacy, work on sensor nets. And he's going to talk today about mainly on this work on processing information retrieval queries, text and text analytics. And with that, I'll hand it over to Bolin. >> Bolin Ding: Okay. Hello, everyone. Welcome to my talk. And as Christian said, my talk today is about how to support efficient and effective keyword search. So before going into the major part of the talk, I will first give you a very high level overview of my Ph.D. research. So in many -- there are two major streams. The first stream is based on structure data, including original databases, data cube and networks. I study how to build efficient index structure and the data model. [indiscernible] kind of in this same data model, two kinds of problems. One is keyword search and text analytics. And the other one is data mining. So in the second streams of my work is about data privacy, because we want to encourage people to share their data, so we need to promise that their data is safe in our indexing model. My work in this stream provides privacy preserving search, query and mining. And during my internship here, I worked on [indiscernible] with Christian. And in school I also collected people from other groups to transfer the data mining and database techniques into different structures like sensor networks and this other [indiscernible]. But in this talk, I will focus on this stream. In particular, it's about indexing and data model. Different types of structure data and to support keyword search. Here is the outline of this talk. The first part will start from a very basic operator to support keyword search and text analytics of plain text. It is the [indiscernible] operator. And this book was done during my internship in MSR. And then the second part, I will discuss how to incorporate taxonomy into search and analytics. So simply put, taxonomy is a hierarchical terms and a 2 query can be rewritten by replacing terms with instances like company can be replaced by IBM, Google, Microsoft. So the goal of this work is to optimize a structure to support such, to process such query writings in an efficient way. And this work originates from so-called ProBase system, which is led by Haixun Wang at Microsoft Research Asia. And our technique for the optimizing the index is now being integrated into a part of the system. And it's my pleasure to introduce this work in public for the first time because it is just accepted by SIGMOD this year. And then for the third part, I will introduce our system called text cube and top cells to handle structured [indiscernible] on plain text. It is based on a collection of works which was originally funded by NASA. And before we finish this funding, we built a system where I act as a team leader. This system was so exciting to NASA people so they give us another round of funding to add in new features to this system. And I will give you some details on this system later. So overall, this is first part, second part and third part. outline. And this is Let's go to the first part for start. The first part is about the basic objective called set intersection. So it's very important and in many applications like keyword search, data mining and maybe query classification. So from reports in Google, it is said that speed does matter in web search. And among different factors, there are many factors can affect the latency of speed, like the network status and other ranking [indiscernible]. But the set intersection is one of the most important operators. So we study how to improve the performance on this operator. And in particular, 100 milliseconds matters. We'll show how our approach can improve the performance by this much in most cases. So set intersection plays an important role in keyword search, of course. A simple example is we have a list of documents which contains some terms. And this so-called virtual index is constructed in the pre-processing. So in the online processing, once the query come, the output is the set of documents containing all the terms. So it essentially is simply the intersection of these two lists. So this provides us a motivation to study the certain intersection operator. 3 And our algorithm and index structure will focus on in-memory compact index. It aims to provide fast online processing algorithms. The contribution is on two hand. On one side, in theory, it's provide better time complexity. But more importantly, in practice, it has robust performance. So simply put, it compared to other techniques for this operator, our approach in is the best in most cases. And otherwise, it is close to the best. The improvement is around, I mean, we can reduce the processing time to one half in most cases. So this is so-called the best in most cases case. And here's a complexity, but I will introduce it later. So there are different types of related work. This is very simple one, we just merge two sorted lists. And these three are designed to handle the case where we have one side very large and the other side very small. And this one is a very recent work from a theory community. So it's very good. It has very good performance guarantee in theory, but it performs very badly in practice. So let's come to the basic idea of our approaches. In the pre-processing stage, we first partition the lists into small groups. So to remind you, every element at least could be from a large universe, depending upon how large the document [indiscernible] have. But we partition them into small groups first. And then each small group is mapped from a large universe to a small universe, using a same hash function, H, for all the small groups. And the size of the range is W. W is the word size. Then for -- here's a simple example. For each small group from a large universe, we map them into a small universe for 1 to W. So W into 16 then the hash value is the number from 1 to 16. This small universe, this image from a small universe can be encoded as a word in a computer. Simply means if 1 is in the image, then the first [indiscernible] is said to be 1. 6 is in the image, then the 6 [indiscernible] is said to be 1. So in this way, we can compactly represent to the hash image. Now, there are two observations -- there are three observations. The first one is if each pair has an empty intersection then the hash image will have empty intersection too with high probability. So, of course, how large the property 4 is depending on how we partition them and how we select the hash function. I will introduce that later. But The second factor is more obvious. It says that if the hash image is empty for some pair, then we can skip this pair for sure, simply because if there are some common elements in this pair, then they must be mapped to the same entry in the hash image, because we use the same hash function for all pairs. And the third one is typically, the intersection size is much smaller than the list size. Because otherwise, we simply materialize this pair in the index. So in most cases, we found that the intersection size is relatively small. >>: [inaudible] how are the small groups [inaudible]? >> Bolin Ding: Oh, good question. So there are different ways to partition the small groups, but suppose W is work size. The group size, good choice is square root of W. So if W is 16, 64, then W, then square root is 8, group size of 8. And I will show you why this choice makes sense. >>: [inaudible] >> Bolin Ding: One partitioning schema is based on sorting. based on hashing. So I'll introduce both a bit. But the other is In online processing, we first compute the intersection of hash images. If it is empty, as we previously said, we can skip. But otherwise, we need to compute intersection of two small groups. This example. The hash image we found, the intersection is the green one, which is not empty. So but note that since they are encoded as a word, the intersection of the hash image done more efficiently using a [indiscernible]. And now there are two questions. How to partition. How to partition is just that. And the second one is how to compute intersection of two small groups. The first partitioning method is called fixed-width partition. So in this one, we first sort the two sides, L1 and L2. And partitioning them into equal size groups, with square root of W elements in each group. W is word size. Now we compute intersection of two small groups using this sub routine. But we only need to consider -- because this, we only need to consider some subset of all the pairs. Simply put, if a pair has an overlapping range, we compute in 5 set intersection. Otherwise, we don't need to like this one into this one. The maximum value is more than minimum value. So they're at most N plus M over square root W pairs to be considered. Actually, there are times two. So for each pair, how to process it quickly. This is our sub routine called QuickIntersect. We call it processing, we map them into a small hash word, a hash image. We first compute in online process, we first compute the intersection of the hash word using one operator, bitwise-AND. And for each one entry in the hash word, we go back to check whether the two elements that are mapped to the same value are the same. If yes, we add this element into the result. Otherwise, if these two are different, then they are not in the result side. They are not in the intersection result. So essentially, the cost for this step is a number of one bit in this hash word. But since I said there are two cases, one is these two elements are indeed in the intersection. The other case is these two elements are different but are mapped to the same entry. This is so-called bad case. The good news is that the number of so-called bad case is bounded by one for each pair in expectation. So that's a very key reason of why our -- why this approach performs well. So let me be clearest about why this is true. So here's a bit of analysis. The number of bad cases for each pair is bounded by this one. So in particular, if two elements are different, X1 and X2, but the probability that they are mapped to the same hash value is bounded by 1 over W. And if we consider all the other pairs, since the group size is square root of W, then with some of the map, it is at most 1. So given that, the total complexity is this. The number of bad cases across all the pairs sum up to this factor, and there's an intersection for all the pairs of small groups, some up to this factor. So overall I got this one. And we can optimize the parameters a bit to get a better parameter -- or to get a better capacity by selecting a different group size. The second partition scheme I based on randomized partition. So first question is why we need a randomization. The reason is simply because in the fixed-width partition, we note that these two sets are partitioning to three groups for each. For the fixed-width partition, we need to consider the six pairs. But to show that in the randomized partition, only three pairs need to be considered, which could be more efficient. 6 So it works in this way. For two sets, we use another hash function G, called grouping hash function, to map all the elements to the universe with the size equal to the number of groups. So for each list, the elements that are mapped to the same value are grouped together. So that we have this number of groups for each list. Now for two sets, we only need to consider the pairs of groups which are mapped to the same value, according to group function G for similar reason as before. And again, the key question becomes for each pair, how to compute an intersection efficiently. And again, we use the same routine and it works exactly the same as the previous one. The analysis is also similar. The P parties to bound to the number of bound cases for each pair, which can be shown to be 1. Say the property here is at most 1 over W, and the number of pairs of groups is from up here. So it's at most 1. And for similar analysis, we can get this complexity. By the way, this [indiscernible] can be generalized to K-set intersection. Now, the two approaches are better approaches in theory, but in our experiments, we found that ->>: You said generalized to K-sets. >> Bolin Ding: >>: Yes. How do you do that? >> Bolin Ding: So for K-sets, it's quite similar. There are two sets here. If there are other sets, then very simple [indiscernible] we consider the triple of groups with the same G value. So that means in first set [indiscernible] triples. And then we need to revise the [indiscernible] this subroutine. It's also not so complicated. We compute -- this step, we compute the intersection of three hash images. And similarly, for each one entry, we go back to the three lists to check whether they are the same. >>: You mentioned the [indiscernible]. >> Bolin Ding: >>: Um-hmm. You mean partitioned? So is this algorithm, is it a distributor [indiscernible]. 7 >> Bolin Ding: This is a good question. So I'll answer it in two ways. first one, on each partition, we can apply this technique, because ->>: The [indiscernible]. >> Bolin Ding: Yes, yes, we've got the same results. We've got the same result, even if this algorithm is applied in each partition. [indiscernible] we get to the same result. And the second answer is as you can see, we use a grouping function here. This function can be even used to direct us on how to do the partitioning on different machines, because -- but this is a little bit out of the range of the talk, okay. >>: So you place a lot of emphasis on [indiscernible]. Why don't you just have one hash [indiscernible] and spray them across a bit vector and chop the bit vector up [indiscernible] process? >> Bolin Ding: I'm sorry. Your question? >>: You're hashing. So why don't you just generate one large bit vector, chop it up into pieces, and do the intersections of bit vectors in pieces? I don't understand what the role of the grouping is doing for you when you're using hashing for those steps. >> Bolin Ding: Oh, I see. for this part, this part? So you just said we can simply do the intersections >>: Yeah,, you map them into one large bit vector and just chop it up into pieces and do the pieces one after the other. Isn't that equivalent to what you're doing? >> Bolin Ding: Interesting. So this is very similar to our practical version of the second algorithm. So this sub-routine seems to be complicated. We have this kind of structure. And in our practical version, we show that we can simply do the independent merge on these pairs, right. But the hash word is still useful, because we can first test whether this hash image has an empty intersection. If yes, we can skip this pair. >>: That would be true if you just chopped the result and the resulting bit vector up into pieces and processed it a piece at a time. 8 >> Bolin Ding: >>: That's possible, yes. All right. >> Bolin Ding: But okay. Yeah. But yes, that's essentially means that our approach can be applied in a distributed or in parallel style, right. And I will show you that the practical version is very similar as you described. In this version we found, well, [indiscernible] cheap so we just chop them into segments. And for each segment, we merge to computing section. And we found that this simple algorithm is even faster in workload time. And we really [indiscernible] that so that's why we analyze this this way. But we will add a bit more features on that. That is in the pre-processing. We still map each pair of groups using a hash function, and we use additional one word to store the hash image. So before we process each pair of the [indiscernible], we first check whether the hash image has a now empty intersection. If the intersection of the hash image is empty, we can skip this pair. But otherwise, we use a simple merge algorithm to compute the intersection of this pair. So this is the idea. We don't use this structure anymore. Instead, if the hash value -- if the hash intersection or hash image here is now empty, then we do linear scan to merge. But otherwise, we're skipping this scan. And an extension to this framework is to add more hash functions. Simply means we add more words for each pair so that we scan this pair only if all these hash words, all these pairs of hash words have a now empty intersection. If one of them is empty, we can skip that. So why this works well, so because filter the empty -- can be used to intersection. And I will show you significant. So in the sense that that is the intersection is empty, the hash words. Happens with very constant. this kind of hash function can be used to filter the pairs of groups with empty the power of this filtering is very this kind of event called failed filtering, but we have a now empty intersection for all low probability, which is bound by a Because of that, we can use [indiscernible] functions to further reduce this probability. And to show this constant, we did some analysis, but I will skip it here. So now the total running time of this algorithm consists three parts. 9 The first one is we compute the hash functions when computing intersection of hash images for each group, for each pair. And for each element, we still need to [indiscernible] which is very low, especially if we use [indiscernible] to two or four. But for the elements, which are intersection, we still need to scan that, so that's for complexity. This just shows and include many details, but I will emphasize one point. That is in our evidence, in theory, the grouping, the grouping of the elements in theory, it depends on the query, depends in different queries, it may be grouped into different status. But we can use [indiscernible] structure with just a linear space to represent all the possible revolutions so that we don't spend too much more time, too much more space. So in summary, or in theory, some better performance guarantee. And in practice, this grouping merge with hash functions for filtering performs efficient in practice. And we also propose another algorithm using [indiscernible] of same index to handle the case where one set is lage, the other set is small. And they can be generalized to K-sets. So what we care is about the [indiscernible] performance. So before we introduce the performance in experiments, our discuss the reason why the [indiscernible] can be improved. So the first [indiscernible] is in algorithms, the constant or the number, the real number of operators [indiscernible] relatively small and the second reason is especially in this one, we tried to do a linear scan in memory instead of random access to achieve better performance. So we can see that in experiments, this approach is best in most cases and otherwise close to the best among other existing approaches. And in experiments, we first generate some synthetic data and used real data from Bing and Wikipedia. All the acronyms are carefully [indiscernible] into [indiscernible] to achieve the best possible performance. And the time is measured in milliseconds. Some way implemented some typical approaches in different types and compare them with our four approaches. This one is practical version of the second one. And the summaries on average is our approach is twice as faster than the best competitor. It is also robust in the sense that when they are not the best, they are close to the best. There are some parameters we want to study in experiments. One is [indiscernible] of the list, the other is the relative size of the intersection. And also the visual size in the query as well as the number of 10 sets included in the query. in the first one, we vary the size of the sets. We found that we have two sets with the same size varying from one million to ten million. We found our practical version is always the best, followed by our first approach, and you can see the improvement, then the [indiscernible] which is simple and effective. Improvement is still twice. We spent only half of the processing time. And when we varying the size of the intersection, we found that before, the size of intersection equates to, like, 70% of the list. Our approach is still the best, this practical version, followed by the first approach and the second approach. But after we got that, merging's the best, merging is this blue line. So the reason you simply merge is very simple and has very little additional cost. But actually in practice, if the intersection size is so large, then we may just start to materialize this pairing of the index, and really happening so this really happening in search engine. And the third one is we vary the ratio of two sets. We can say before the ratio becomes too large, that means one set is small and the other set is too large, ours is still the best. But after that, the hash algorithm is the best. The hash algorithm is simply describe the short list and they use hash lookup to track whether each element is in the other list. So it is the best hash, hash is the best if the ratio's too large. But we can say that ours is still close to the hash. Hash is a purple line or hash is a purple line, and our is a red line. Still close to the hash. Hashes, yeah. And then in the real data, we normalize all the performance of all the algorithms to the merge, where the merge is one and the print file means the performance of that. You can see that our random group scan is around half of the merge. And in particular, with more, we report to the performance for queries with the numbers of keywords. If there's two keywords, merge it here and here. But as a number of keywords inquiries we found that the performance improvement is even more, simply the because filtering power of hash functions were in queries with more keywords in one query. Okay. So let's conclude this part. We propose simple and fast set intersection structure and algorithm. And novel guarantee in theory. But more importantly, we want to show they perform well really better in practice and they are the best and most [indiscernible] close to the best. In the future, the storage compression is an important issue and we have some preliminary 11 results, but we're not including this talk. And the second part is how to select best algorithm and parameters. Very simple case is the merge performs sometimes better. So how to make a selection between ours and merge and why. >>: [indiscernible] in real data, what's the size of the sets >> Bolin Ding: You mean in real data, what's -- >>: Yeah, set size, in real data. >> Bolin Ding: In real data? >>: Yes. >> Bolin Ding: Oh, in real data, because the query log is gotten from Bing. So for different queries, the length of the size of the may be different. But on average, since we're using Wikipedia to build the word index, on average remember it's around 1 million to 5 million. So ->>: [indiscernible] should keyword [indiscernible]. >> Bolin Ding: Yes. >>: So if the size of settings like one million on one web page, you have one million of this ->> Bolin Ding: No, no. That means one million pages contain this word. and actually, because the web page are not partitioned, so actually the technique can be used in the partition. Yeah, >>: Okay. >>: So your algorithm and the algorithm of the study here, they are [indiscernible] section of the lists, right? >> Bolin Ding: Yes. >>: In practice, especially dealing with more practical ways [indiscernible] to do some sort of [indiscernible] processing, whereas you only need to process a 12 very small portion of the lists. handle such cases? So can your evidence [indiscernible] to >> Bolin Ding: Yeah, that's quite possible. But this work focus on [indiscernible] because sometimes we may need the [indiscernible] to accomplish some task like query optimization. But when the only the top case is interesting to well because [indiscernible] evidence requires only most of them requires only linear scan with a list sorted by the relevant score. So actually, we can organize our index [indiscernible] in a similar way. But the extension, yeah, is an interesting piece of future work. Is not completely solved by this one. >>: I think in your first version is basically is big size range. This can be very easy to adapt [indiscernible] randomized. The second one, I'm not sure if you have ->> Bolin Ding: Yeah, for randomized, will work to -- well, a possible way that we can treat the relevant score as a hash function. But it may not have search perfect performance guarantee in theory. But if we're treat that in this way, it will be possible. >>: So is the issue of the set sizes is [indiscernible] hashtag [indiscernible]. >> Bolin Ding: Yes. >>: So in [indiscernible] kind of ways you might expect, for example, [indiscernible]. So that will be the common case. >> Bolin Ding: Okay. So what I want to show here is trying to show you here this. So here's the threshold, and it is shown that this is, if the hash performance [indiscernible] larger than 100. But given that, our approach is very close to hash. And actually, through -- actually, our technique has about the worst latency. Because if this case happened, the processing time is already very small. So we don't count so much, well, so the difference here is not so much but we care about the worst case. In this case, the gap here is large and we [indiscernible] them. And another comment is we may need to switch to hash if we can detect this case happen. 13 >>: I see [indiscernible] we should be able to [indiscernible] this merge we can have [indiscernible] getting from the big structure to merge. But with your way, you don't have much of this benefit. >> Bolin Ding: Okay. Let me, so basically, saying about the pipeline, right? >>: Yes. >> Bolin Ding: Okay. So the pipeline, a major issue in pipelines is load balancing, right? If we -- or maybe in the distributed environment, if we want several machine to handle one list, one important properties if the load is balanced among different, different machines then this approach can be -- the set intersection can be computed in parallel where efficient. And I want to say the advantage of our approach in this environment is that we use hash function for the grouping. So the group properties, the hash function can group at least into small groups with approximately equal sizes. So this feature can handle the load balancing issue very well. So simply means in the -- in our randomized grouping approach, here, we can put, well, say, these groups into one machine. These groups into the other machine. And when we process them, it can be done in parallel. >>: [inaudible]. >> Bolin Ding: Do you have any question on this part? Okay, then let's move on to the second part, which is a bit more high level than this one. Sorry, sorry, sorry. Okay. That's it. It's about beyond set intersection. And the focus of this part is how to optimize index structure in a big high level way to support taxonomy keyword search. And consider such an example. I want to find a job in Seattle, some IT company. So I type these terms into Bing or Google. In Bing, I find search results. The first one is not relevant at all. And the second one include ->>: [inaudible]. >> Bolin Ding: Yeah. So the second one is about some company called All Covered, but I'm not sure why it is there, because it is there only because the terms are included. And all the rest are about some job seeking website, and I may need to retype my query here to use the application. 14 So similar things happening in Google. So we don't need to feel so bad. And so as a user, what I need to do next with this search engine is to rewrite the IT company as company names, like Google in Seattle, whether have any job opening. We found, well, here it is and Facebook. How about Facebook. We also found some in our top results. So here's the motivation why this is important for search engine. The first problem is if the user has no knowledge about which companies are in Seattle, then this rewriting is essentially impossible for them. You have to refer to other websites. And the second problem is if there are hundreds of IT companies, do we require the user to rewrite the query for hundreds of times, which is not reasonable, of course. So our final goal is to provide automatically rewriting to user, using a taxonomy. Here is a demo in the ProBase system. So this work originates from ProBase, which is by led by Haixun Wang of MSI. And in this demo, it shows that, well, this example called company by IT, tech company. This shows that if we click this button, different possible rewritings will be expanded here. So now the focus of this work is how to process all the possible rewritings using a carefully designed index. The motivation, first one, enhanced user experience in keyword search. Since much work in IR has been done on how to find different query rewriting, but this work will focus on the efficiency part, how to optimize the index structure to efficiently process all the query rewritings. And this technique can be applied in other applications in text analytics with taxonomy. So there are two big challenges. The one is for query, keyword query and keyword taxonomy, which is large nowadays. We have a large number of possible query writings. On the other hand, the space for index structure is limited. So once the index structure is doubled, which means we need to spend double the amount of money to buy machines. And then let's come to the taxonomy and let's define taxonomy more formally. A taxonomy is a hierarchy of terms, concept-instance relationship it could be generated by a particular system. ProBase is the country's largest taxonomy. And our work supposed that this taxonomy is given. And the taxonomy could be realized of the tree, each query, each term has its instances. And in particular this relationship is transitive in the sense of the sub-tree contains all the instances of concept here. Like IBM, Google and Microsoft are 15 instances of company and also instances of IT company. search is based on such a taxonomy. And taxonomy keyword So given a query like database conferencing Asian city so database country can be replaced by SIGMOD, WWW, SIGMOD or VLDB similar to Asian city. In another example, company, IT company can be replaced by a bunch of things. And some rewriting may be not so many for like this one. But our goal is to, well, among all these possible rewritings, we want to find or to answer all of them, even though some of them are not interesting. But to user, don't know, right? So we want to answer all the possible writings in an efficient way. This is a focus of this work, which will appear in SIGMOD 12, and it's being integrating to the ProBase system. So it's about the index to support taxonomy keyword search. We are given taxonomy and try to answering all the possible rewritings. The space is limited. And the problem is how to optimize our index. So the key problem here is all to to be materialized and does not need to be. Let's define the taxonomy in keyword search more formally. So suppose query has three terms, T2, T10 and T17 in this taxonomy. Then we cran replace each term by teach of these instances. Like IT company to Facebook. IT company to IBM, Google, Microsoft. And also for the other terms, like T10, so on. So there are too many ways to rewrite the queries. The overall picture is that we're not asking the users to rewrite that. This will be handled by the search engine and our technique. So another way to give the same set of results is this. We compute the union of this part. That means we compute the list of documents, which contains any instance of T2. This is essentially the union of a bunch of merging list, a bunch of merging index. And the answer could be simply the intersection of these three terms. But problem is they cannot store all the results list in memory. It will just consume too much space. And actually, with very simple tree taxonomy, the space will be three [indiscernible], three means three times the [indiscernible] list, which is not acceptable. There are two base lines. One is to index nothing and the answering of this query is equivalent to evaluate such a large DNF. So why it's large, because this part, the union part is the number of instances of some terms. If the taxonomy is large, this part is really huge. 16 But we know also that this is the lower bundle of the space consumption because we only materialize the [indiscernible]/ and second wise, we materialize all the results list, then it can be evaluated in this way, but we need to materialize all the possible terms, all the results list for all the term. We just cannot so much space. So our proposal is to partially materialize it. First, we have to materialize all the [indiscernible] system. And second for the list, for the result lists, we select a subset of the terms to do the materialization. Here, the green are the set P. We only materialize this result list. Then to process the query, to get the same result as before, we can -- we only need to evaluate a smaller DNF for term T2. This will include merging list for T2 and T9. But for instance T3, T6, we can utilize the partially materialization of this results list. So this could be, well, something in the middle. But the problem is how to select the set P for the materialization. And this is so-called I proposed workload of our index optimization. The set P is selected based on workload of queries. In particular, this workload is query log. And for subset P of terms, we materialize the result list. But for others, we materialize only the merging list. And as a query, the query here, this is a query log, so this P is selected based on the taxonomy structure and our query log. So now, the major problems, we have a space budget. We need to constrain, we have a constrain that the overall index structure has a space number than certain amount. So it becomes an optimization problem in which we want to have a constraint as a space, and we want to minimize the cost. So I will introduce different cost model later. But the cost is simply the average cost to process this workload on this materialization. And we use this query Q, this set of query Q to -- as a something like a training set. So the index is optimized based on Q, but we show that the optimizing index is very effective even for future queries. This is a different cost model. The first model is linear sky model. We just apply the merge acronyms. Now to compare to this DNF, we first compare the union of each term in the query using linear scan. And then we will use linear scan again to compute the set intersection. Then the overall cost is -- can be approximated by the sizes of lists we 17 retrieved from the index because the linear scan essentially scan all these lists. And second model independently, the second model is hash lookup [indiscernible] DNF in a different way, in which we select one listed candidate and for each element in the candidate, we use hash lookup to check its membership. So the cost can be approximated by the number of hash lookups times the [indiscernible] by total number of hash lookups. So yeah. So this cost can be computed as a weighted sum across all the queries in the workload. There are two interpretations. The first one is this cost is expected processing cost for random query in Q. The other is if Q is large enough, this cost can be used to predict future queries processing cost using this index in P. So now if we use this cost model, it can show that [indiscernible] hard, and we proposed different acronyms to select the P, which is a key problem here. And the [indiscernible] basically select the terms paced on their frequency. We like to select terms with more frequency in workload, because they are likely to be referred again in the future. But we show that this is suboptimal. In the first case is if T1, T2, T3 are all frequent, they are included in P. But actually, a better solution is to include T4 with even smaller space, but have similar improvement on the performance. >>: What's the [indiscernible] example of that? >> Bolin Ding: You mean here? >>: Yeah, the keyword [indiscernible] taxonomy [indiscernible]. >> Bolin Ding: Okay. So example could be like -- suppose T4 is Microsoft. Microsoft is IT company. IT company [indiscernible] through Microsoft. And in another case, we may say, well, a successful startup could refer to Microsoft. Then we could materialize these things, of course. But if we're materialize Microsoft, this material -- this [indiscernible] can be utilized by all of these terms. So the space we need for this term is smaller, but we can achieve similar improvement on the ->>: [Inaudible] this example might be [indiscernible] more frequent than [indiscernible]. 18 >> Bolin Ding: Yeah, this may not be so good an example. Maybe that's -- Yeah. Let me see. >>: [inaudible]. >> Bolin Ding: Maybe let's say this one. Suppose this is Facebook, this is FB. Also that might frequent, but we only need to materialize one of them. Then because of the -- it's not optimal, so the second one is used in [indiscernible]. So the P idea is that suppose all the terms are ordering by DNF. Then this part, the decision on whether to include this one determines, is determined only by, by the status of the parse from this term to the root. So I will maybe for this limited time, I'll skip detail. So but to the key things we can have such a [indiscernible] optimal solution. And it can be computed in this time to find the optimal solution. But the problem is the Budget B is in the complexity, which can be large. So we need to scale it down. After it scales down, we may be applied in concept, but we may not find optimal solution anymore. It becomes an approximation. For the third one, we utilize a similarity of the cost function and use [indiscernible]. We show that this function is monotone and sub-modular. by sub-modular, I mean if two sets, one is the super-set, one is a subset, the benefit of including a term in the super-set is smaller than a benefit including the same term to the subset. And because of this property, this intuition. So then of is We can use [indiscernible] acronyms. It starts from an empty set, and the average situation is left term with maximal marginal benefit until the space is used up. It is a constant optimization based on the approximation of similarity and monotonicity. And its time complexities, but it should be known that the complexity, the real running time is much faster because it will terminate after this space budget is used up. Now, let's come to the experiments. So taxonomy is given by ProBase, which contains more than 200 thousand terms. It's already the country's largest taxonomy in today's taxonomies. So query log is [indiscernible] from 2007 to 2010, February and contains all [indiscernible] queries with frequency larger 19 than 300. And we use this query log, this query workload is used to test, is used to train the index. But we use a different set of query in the future from March to August to test the optimized index, and the documents are a sample from Bing's web pages. We compare different approaches to compare it for the index. And the summary with us is this baseline use no additional space, but the present time is bad. This baseline use three times spaces, but it's a lower [indiscernible] process time in this model, and we will show that our approach use only 10% additional space. Go ahead. >>: [indiscernible] >> Bolin Ding: ProBase is actually a project on how to build taxonomy and how to apply this taxonomy. So this work is application of this taxonomy: >>: What is the final keyword [indiscernible]. >> Bolin Ding: It's just to answering all the possible query writing. just a setting session of the union on this taxonomy. It's >>: Is there a ranking? >> Bolin Ding: There's no ranking at this point. We return all the answers. And, of course, it can later be included in ranking. Let's put it this way. >>: You are talking [indiscernible]. >> Bolin Ding: Yeah, let's put it in this way. The index we constructed could be used for ranking. But the performance evaluation is about the case when we want to extract all the results. So hopefully, this optimize index can perform well for ranking time, because we essentially we materialized something. And similarly to the [indiscernible] list, hopefully it can be, the performance of ranking on this optimizing [indiscernible] is similar, but we need to evaluate that for sure. Yeah? Any other questions? So basically, the major conclusion is we have when you use a bit more space to achieve the similar [indiscernible] processing time as the best possible one. But note that this best possible one is three times as many -- as much space as 20 [indiscernible] lists. So different curves. For space budget, we found that 10% of space is enough. This is lower bound. And this is a hash look model, and this is how it handles future query. The query log is from September to February 2010 and queries are from March, April, May and to August. We can see that, especially in this, we can see our approach is pretty stable for future query. After August, it has a bit increase because distribution change. But we can then re-optimize the whole index. And compare different models, and we also test our scan ability of different size of [indiscernible] index. We can see that the performance is stable if we enlarge the size of the material index. And this is to show the index optimization part only occupies, only occupies a small portion of the overall pre-processing. And the extension is how our algorithm can be extended to handle general taxonomy. The only thing we want to show is the similarity of the cost model in general taxonomy and which can be done. The second one is we can invite other algorithms [indiscernible] point out we can use the special cost model for ranking algorithm here. So essentially, this is general framework and flexible enough to handle that. And the conclusion support taxonomy query, the optimization problem and it has in Bing data set to gain such good news. For future work, how to handle updates to determine when to reoptimize and how to enlarge the search space to get better performance. Any questions for this part? Okay. Let's quickly finish the third part. the third part is a system we built to support text analytics on multidimensional data. So This was originally funded by NASA. And after we built the system, they very excited so I will [indiscernible] the system here briefly instead of algorithms. So the system like this, suppose on a productive basis we have customer reviews associated with different properties like brand [indiscernible] and the queries 21 lightweight, powerful. So we want to find that which brand name with which model has this feature. The second one is the NASA [indiscernible] we originally work on this one. It's a data set where we have reports, reports are written by captains of which flight after the travel, after the travel. And the reports is also associated with different attributes like the weather of this travel, the face of the anomaly, and what kind of anomaly, and some others. So the problem is how to find under which condition this kind of action happens and what's the reason, based on the dimensions. It could be either caused by the weather or caused by the light or caused by other things. So now the big picture is we have a multidimensional data with text and dimensions. We want to have queries, how to rank or what to rank. So our proposal is a task cube model. So we aggregate the text data on subsets of dimensions. The aggregation here are entities formed by a combination of subsets of dimensions and their values. And actually, this is an entity which combined the evidence of cost to [indiscernible]. Mentioned values. So for example, here's a laptop here's the a laptop. Here's the three attributes of the laptop. This is what the customer reveals for the laptop. But the setting of the laptop can be combined in any way. So our compared with ranking objects approach proposed by people in Microsoft Research and our work is -- was started independently and but I will show the difference. So their work assumes there's a relation called document object relation to associate each object with a set of documents. Now, question is given the query, how to rank the objects. But differently, we assume we have a structured dimension relation. Instead of object document dimension, in which we have the document and the dimensions. Based on the dimensions, we can generate a hierarchy of entities from dimensions like models. Model, we can generate a bunch of different settings of different laptops. So based on that, a simple way is to insert all of the objects into their relation. But the problem is there are too, too many objects to be inserted. 22 So our proposal is a test cube model. In this cube model, we have different entities in different sub-spaces. And we do the partial materialization of the whole cube so that the space is not so large but we can do the online query efficiently. Then this is a simple example. This is, this laptop, Acer 110, aggregate two documents. This one, Acer, with a system XP aggregate another two documents. And this is in NASA data. So the whole object of our work is the power is given [indiscernible] data. And through the test cube, we give a ranking of the cells, we give a ranking of the entities of different sub-spaces, and the relevance is a relevance between the text here and the query and the keyword query. So they are ranked. And we are interested in top-k. The top-k search algorithm has two challenges. The first one is the number of [indiscernible] number of dimensions. And the second one is the relevant score is queried independent -- query-dependent, which means we cannot materialize the query score, the score of the cell in pre-processing. So we proposed this algorithm. I will skip the detail, but the major idea is to estimate the lower bound, upper bound of certain subset to speed up the search and try to follow a shortcut for a [indiscernible] top-k. So this is a very simple experiment on the efficient to show the scaleability. This is the baseline is our approach. But the baseline is very simple. So it may not be so [indiscernible], but I will show you the system, which can show the effectiveness. And this test system is based on NASA's data, as I described about, and here's the address you can relate that any place. So the interface like this, we have simply type keywords here and search. This is the system NASA people was excited about, because like in this sample, we try to find what kind of situation where runway excursion happen. Runway excursion simply means a very dangerous event when the aircraft want to take off or to land. So we found rainy night is a very dangerous condition for this event. When this happens, we found a score of documents of the parts related to this. One will be very high. 23 Another interesting observation is this model seems to be easier too with this problem. But yeah. And the purpose of this system is to give, say, give some, give some training to the new captain to let them pay more attention when this condition happens to avoid this event. And another sample for us, we are interested in what kind of flight delay, what will cause flight delay. So we'll type flight delay. We found that most likely mostly, it's something related to company and their policy. It may be because of the weather or because some problem, and well in some cases, it's interesting to note that the flight delays is caused by improper maintenance. So this is because once the flight -- flight wants take off, suddenly found some maintenance problem. This will also cause flight delay. So this is something interesting for this query. And the third one. What will the light fog cause for our flight. So the flight first cause excursion on the taxiway and runway for the top two. Especially when the fog happens, the reaction of the captain could be reject takeoff, which rarely happens when we find the weather's bad. We found flight delay, and they just reject to take off. And also, it's also more critical during the night, during the night. And it's the light fog could be even more dangerous. And several are other example like when the ->>: So you're not really trimming out -- say all the data occurred in 2010. would you always show the column that said -- say there's a column the year that it occurred. Would you always return to that column as being 2010? I guess you would. So I'm just saying say the data is the same in all the documents that come back for a given attribute. Let's say are you actually moving that attribute from being relevant? Because say it occurs in all the documents across the entire [indiscernible], as opposed to just the ones that you're searching for. >> Bolin Ding: Same document or same query term? >>: Let's say your whole purpose is just of 2010 events. >> Bolin Ding: Okay. >>: And then you type in some keyword which restricts that document. Are you still going to return 2010? Let's say there's an attribute called year, right. 24 And are you still going to return 2010 as being an important attribute? >> Bolin Ding: Well, in that case, it's not interesting to show 2010, right. Then we'll replace 2010 with a star here, because what we return is only a subset of dimensions, which are relevant to the query and their values. Yeah. >>: Okay. But then you also have to look at all the other documents in the queries and say that it's not really, you know, there's no decision tree that's separating, like, the documents that have this keyword versus the other documents, right? >> Bolin Ding: Well, it's not a decision tree. Actually because it's a query dependent procedure, because for different queries, different cells may be ranked to be the top ones. So it's not a decision tree, it's essentially a ranking technique. >>: Just within the documents that have those keywords are you going to do that ranking? You're not going to rank those features against all the other documents that don't have the keywords? >> Bolin Ding: Well, it's a -- the suggestion you mention is related to, well, the IR model we apply. So on the other side, if one terms appear in other documents, then with this document, it's not so, it's not so interesting, because all the documents containing this document. Then the relevance, then including this term will have a lower weight compared to including other real terms. It's essentially relevant to what kind of ranking score, scoring function we used. I ask people a lot of details, but the key point is we use -- our search algorithm can be combatted with different ranking functions. And in this system, we simply use term key FIDF model from IR, which is typically used in analysis terms to get the relevant score here. >>: So your relevant score, I mean, if you look at the lattice of the cube, typically, the cube's [indiscernible] lattice tend to have more [indiscernible]. Seems like we have an aggregation of these top cubes, we keep getting higher at number of [indiscernible]. >> Bolin Ding: Actually, it will happen in the reverse way, because for cells 25 in higher level, all kinds of documents are aggregated together, right. So then for a particular query, the relevant score could be lower because different documents may -- or because it aggregates essentially different situations. Maybe certain type of situation is relevant to our query. But the problem I mention is actually [indiscernible] in our system, because once we go down to a single, to single document, that could be some over-fitting issue. That means if one document is very relevant to our query, it's possible. Then a cell -- then the top cell could be just this document. But this is not interesting, because we want our cell to be large enough to support this evidence. So actually, we have a parameter called support this threshold, which is a minimum number of documents we need to have in a cell to be outputting the top-k. So yes. That's how we handle that. Any question? That's, I think that's all. And almost one and a half hour. Yeah. And these are different examples, and I would like to say thank you, and here's an overview of my research. And if you're interested, we can discuss some component of that. And otherwise, that's it. Thank you.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib