XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database Systems - Semester Project OUTLINE • Introduction • Ranking Idea • Search Techniques • Experimental Evaluations • Conclusion INTRODUCTION • Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machinereadable. • XML can have user defined tags which can be nested. • HTML is a presentation language and hence cannot capture much semantics. • HTML search techniques cannot be employed for XML searches. • XQuery is complicated for end user. • XRank provides simple keyword search query interface. INTRODUCTION • Challanges: • Element containing the search keyword is returned. • Ranking of the elements depends on a certain factors. • Keyword proximity has to be considered in two dimensions – keyword distance and ancestor distance. INTRODUCTION • XML Data Model : • A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = (NE U NV) NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes RANKING IDEA • ElemRank – For ranking a single element • Overall rank – For ranking an ancestor of an element by considering the value of ElemRank the child element. RANKING IDEA – ELEMRANK • ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. • This is obtained by refining the PageRank algorithm of Google. • PageRank: PageRank of a document v, p(v) is • Nd is the total number of documents. • Nh (u) is the number of out-going hyperlinks from document u. • d is a constant (typically is 0.85). RANKING IDEA – ELEMRANK • But PageRank is unidirectional. • We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse containment edges in the formula: • v- Element for which rank is being calculated. • Ne – Number of XML elements. • Nh (u) is the number of out-going hyperlinks from document u. • Nc (u) is the number of sub elements of u • d is a constant (typically is 0.85). • E = HE ∪ CE ∪ CE -1 , where CE -1 is the set of reverse containment edges. RANKING IDEA – ELEMRANK • But containment edges and hyperlink edges need to be differentiated. • After differentiating the hyperlink edges and containment edges we get • v- Element for which rank is being calculated. • Ne – Number of XML elements. • Nh (u) - number of out-going hyperlinks from document u. • Nc (u) - number of sub elements of u • d1 , d2 are the probabilities of navigating through hyperlinks, forward containment edges. RANKING IDEA – ELEMRANK • But it weights forward and reverse containment relationships similarly. • After differentiating the hyperlink edges, containment edges and reverse containment edges we get • v - Element for which rank is being calculated. • Ne – Number of XML elements. • Nh (u) - number of out-going hyperlinks from document u. • Nde (v) - number of elements in the XML documents containing the element v • Nc (u) - number of sub elements of u • d1 , d2 , and d3 are the probabilities of navigating through hyperlinks, forward containment edges, and reverse containment edges, respectively. RANKING IDEA – OVERALL RANK Rank of v1 with respect to the element vt which contains the keyword (ki)is calculated. decay is a parameter that can be set to a value in the range 0 to 1 For multiple occurences of ki in v1 combined rank is: Where function is the maximum of all the ranks of element v1 with respect to m keywords RANKING IDEA – OVERALL RANK • The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v1, k1, k2, …, kn). • Function p(v1 , k1 , k2 , …, kn ) can be any function that ranges from 0 to 1. SEARCH TECHNIQUES – NAÏVE APPROACH • Main Difference between XML and HTML keyword search: • The granularity of query results • XML keyword search returns elements • HTML keyword search returns documents • One way to do XML keyword search • Treat each element as a document • Problems: • Space Overhead • Spurious Query Results • Inaccurate ranking of results SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) • Dewey IDs idea: SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) • An inverted list of all the elements which contain the keyword/keywords is created. • It contains all three fields – Dewey ID for each element, its ElemRank and the position in the element where the keyword occurs. • The list is sorted by Dewey ID. SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) • This algorithm works in a single pass. • Key idea is to merge the keyword inverted lists by simultaneously computing the longest common prefix of the Dewey IDs in the different lists. SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) 5.0.3.0 5.0.3.0.0 5.0.3.0.1 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) • “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results” • We can directly start determining the elements which are likely to have higher ranks. • In this way, we can only calculate the top m results requested by the user rather than all of them. SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) • In RDIL, • Inverted lists are ordered by ElemRank. • Each inverted list has a B+-tree index of the Dewey ID field. SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Working: • Pick a random keyword ki and thus has Dewey ID of a top ranked element containing ki • Now another keyword kj is picked and from its B+ tree (which is sorted by Dewey IDs), we pick a Dewey ID which is greater than the Dewey ID of ki . • The longest ID containing both the elements will be either the Dewey ID we just picked or a predecessor of the Dewey ID we just picked. SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Example: • Consider the query “XQL Ricardo”. • Dewey ID, 9.0.4.2.0 is a top ranked Dewey ID which contains the keyword “XQL”. • Pick the Dewey ID greater than 9.0.4.2.0 from the leaf nodes of the B+ tree for the keyword “Ricardo”. • Consider the IDs - 8.2.1.4.2, 9.0.4.1.2, 9.0.5.6, 10.8.3, … on B+ tree of Ricardo • We pickup the ID 9.0.5.6 as it is greater than 9.0.4.2.0. • The Dewey ID with longest prefix will be either 9.0.5.6 or its predecessor, 9.0.4.1.2. • The element with Dewey ID 9.0.4 will contain both XQL and Ricardo. SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) • Consider an individual query where keywords occur relatively frequently in the document collection but rarely occur together in the same document. • RDIL has to scan most (or all) of the inverted lists to produce the output. • The overhead of performing random index lookups in RDIL can sometimes outweigh the benefit of processing the inverted lists in rank order SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL) • The key idea here is to combine the benefits of both DIL and RDIL. • We dynamically switch from RDIL and DIL depending upon the query performance. • So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for DIL. • But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list. • So we store only a small fraction of the inverted list sorted by rank. SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL) SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL) • The dynamic switching between RDIL and DIL is based on the following factors: • The time spent so far – t • The number of results above the threshold so far – r • Based on this we estimate the remaining time for RDIL as s (m-r)*t/r • Switch to DIL if this is more than the expected time for DIL. • We initially start with RDIL and then switch to DIL based on the above computation. EXPERIMENTAL EVALUATIONS • Data Sets Used : DBLP and XMark. • We perform time taken by each of the search techniques based on the number of keywords, correlation among them versus time. CONCLUSION • We have presented the design, implementation and evaluation of the XRANK system for ranked keyword search over XML documents taking into account: • (a) the hierarchical and hyperlinked structure of XML documents • (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML keyword search queries THANK YOU.