XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Abhishek Chennaka, Alekhya Gade

advertisement
XRANK: RANKED
KEYWORD SEARCH OVER
XML DOCUMENTS
Lin Guo
Feng Shao
Chavdar Botev
Jayavel Shanmugasundaram
Abhishek Chennaka, Alekhya Gade
Advanced Database Systems - Semester Project
OUTLINE
• Introduction
• Ranking Idea
• Search Techniques
• Experimental Evaluations
• Conclusion
INTRODUCTION
• Extensible Markup Language (XML) is a markup language that defines a set of rules
for encoding documents in a format which is both human-readable and machinereadable.
• XML can have user defined tags which can be nested.
• HTML is a presentation language and hence cannot capture much semantics.
• HTML search techniques cannot be employed for XML searches.
• XQuery is complicated for end user.
• XRank provides simple keyword search query interface.
INTRODUCTION
• Challanges:
• Element containing the search keyword is returned.
• Ranking of the elements depends on a certain factors.
• Keyword proximity has to be considered in two
dimensions – keyword distance and ancestor distance.
INTRODUCTION
• XML Data Model :
• A collection of hyperlinked XML documents can be defined as a directed graph:
G = (N, CE, HE)
N : The set of nodes N = (NE U NV)
NE : The set of elements
NV : The set of values
CE : The set of containment edges relating nodes
HE : The set of hyperlink edges relating nodes
RANKING IDEA
• ElemRank – For ranking a single element
• Overall rank – For ranking an ancestor of an element by considering the value of
ElemRank the child element.
RANKING IDEA – ELEMRANK
• ElemRank is a measure of the objective importance of an XML element and is based
on the hyperlinked structure of XML docs.
• This is obtained by refining the PageRank algorithm of Google.
• PageRank: PageRank of a document v, p(v) is
• Nd is the total number of documents.
• Nh (u) is the number of out-going hyperlinks from document u.
• d is a constant (typically is 0.85).
RANKING IDEA – ELEMRANK
• But PageRank is unidirectional.
• We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse
containment edges in the formula:
• v- Element for which rank is being calculated.
• Ne – Number of XML elements.
• Nh (u) is the number of out-going hyperlinks from document u.
• Nc (u) is the number of sub elements of u
• d is a constant (typically is 0.85).
• E = HE ∪ CE ∪ CE -1 , where CE -1 is the set of reverse containment edges.
RANKING IDEA – ELEMRANK
• But containment edges and hyperlink edges need to be differentiated.
• After differentiating the hyperlink edges and containment edges we get
• v- Element for which rank is being calculated.
• Ne – Number of XML elements.
• Nh (u) - number of out-going hyperlinks from document u.
• Nc (u) - number of sub elements of u
• d1 , d2 are the probabilities of navigating through hyperlinks, forward containment
edges.
RANKING IDEA – ELEMRANK
• But it weights forward and reverse containment relationships similarly.
• After differentiating the hyperlink edges, containment edges and reverse containment
edges we get
• v - Element for which rank is being calculated.
• Ne – Number of XML elements.
• Nh (u) - number of out-going hyperlinks from document u.
• Nde (v) - number of elements in the XML documents containing the element v
• Nc (u) - number of sub elements of u
• d1 , d2 , and d3 are the probabilities of navigating through hyperlinks, forward
containment edges, and reverse containment edges, respectively.
RANKING IDEA – OVERALL RANK
Rank of v1 with respect to the element vt which contains the
keyword (ki)is calculated.
decay is a parameter that can be set to a value in the range 0 to 1
For multiple occurences of ki in v1 combined rank is:
Where function is the maximum of all the ranks of element v1
with respect to m keywords
RANKING IDEA – OVERALL RANK
• The overall ranking is the sum of the ranks with respect to each query keyword,
multiplied by a measure of keyword proximity p(v1, k1, k2, …, kn).
• Function p(v1 , k1 , k2 , …, kn ) can be any function that ranges from 0 to 1.
SEARCH TECHNIQUES – NAÏVE APPROACH
• Main Difference between XML and HTML keyword search:
• The granularity of query results
• XML keyword search returns elements
• HTML keyword search returns documents
• One way to do XML keyword search
• Treat each element as a document
• Problems:
• Space Overhead
• Spurious Query Results
• Inaccurate ranking of results
SEARCH TECHNIQUES – DEWEY INVERTED
LIST (DIL)
• Dewey IDs idea:
SEARCH TECHNIQUES – DEWEY INVERTED
LIST (DIL)
• An inverted list of all the elements which contain the keyword/keywords is created.
• It contains all three fields – Dewey ID for each element, its ElemRank and the position
in the element where the keyword occurs.
• The list is sorted by Dewey ID.
SEARCH TECHNIQUES – DEWEY INVERTED
LIST (DIL)
• This algorithm works in a single pass.
• Key idea is to merge the keyword inverted lists by simultaneously computing the
longest common prefix of the Dewey IDs in the different lists.
SEARCH TECHNIQUES – DEWEY INVERTED
LIST (DIL)
5.0.3.0
5.0.3.0.0
5.0.3.0.1
SEARCH TECHNIQUES – RANKED DEWEY
INVERTED LIST (RDIL)
• “If inverted lists are long (due to common keywords or large document collections)
even the cost of a single scan of the inverted list can be expensive, especially if the
users want only the top few results”
• We can directly start determining the elements which are likely to have higher ranks.
• In this way, we can only calculate the top m results requested by the user rather than
all of them.
SEARCH TECHNIQUES – RANKED DEWEY
INVERTED LIST (RDIL)
• In RDIL,
• Inverted lists are ordered by ElemRank.
• Each inverted list has a B+-tree index of the Dewey ID field.
SEARCH TECHNIQUES – RANKED DEWEY
INVERTED LIST (RDIL)
Working:
• Pick a random keyword ki and thus has Dewey ID of a top ranked element containing
ki
• Now another keyword kj is picked and from its B+ tree (which is sorted by Dewey
IDs), we pick a Dewey ID which is greater than the Dewey ID of ki .
• The longest ID containing both the elements will be either the Dewey ID we just
picked or a predecessor of the Dewey ID we just picked.
SEARCH TECHNIQUES – RANKED DEWEY
INVERTED LIST (RDIL)
Example:
• Consider the query “XQL Ricardo”.
• Dewey ID, 9.0.4.2.0 is a top ranked Dewey ID which contains the keyword “XQL”.
• Pick the Dewey ID greater than 9.0.4.2.0 from the leaf nodes of the B+ tree for the
keyword “Ricardo”.
• Consider the IDs - 8.2.1.4.2, 9.0.4.1.2, 9.0.5.6, 10.8.3, … on B+ tree of Ricardo
• We pickup the ID 9.0.5.6 as it is greater than 9.0.4.2.0.
• The Dewey ID with longest prefix will be either 9.0.5.6 or its predecessor, 9.0.4.1.2.
• The element with Dewey ID 9.0.4 will contain both XQL and Ricardo.
SEARCH TECHNIQUES – RANKED DEWEY
INVERTED LIST (RDIL)
• Consider an individual query where keywords occur relatively frequently in the
document collection but rarely occur together in the same document.
• RDIL has to scan most (or all) of the inverted lists to produce the output.
• The overhead of performing random index lookups in RDIL can sometimes outweigh
the benefit of processing the inverted lists in rank order
SEARCH TECHNIQUES – HYBRID DEWEY
INVERTED LIST (HDIL)
• The key idea here is to combine the benefits of both DIL and RDIL.
• We dynamically switch from RDIL and DIL depending upon the query performance.
• So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for
DIL.
• But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted
list.
• So we store only a small fraction of the inverted list sorted by rank.
SEARCH TECHNIQUES – HYBRID DEWEY
INVERTED LIST (HDIL)
SEARCH TECHNIQUES – HYBRID DEWEY
INVERTED LIST (HDIL)
• The dynamic switching between RDIL and DIL is based on the following factors:
• The time spent so far – t
• The number of results above the threshold so far – r
• Based on this we estimate the remaining time for RDIL as s (m-r)*t/r
• Switch to DIL if this is more than the expected time for DIL.
• We initially start with RDIL and then switch to DIL based on the above computation.
EXPERIMENTAL EVALUATIONS
• Data Sets Used : DBLP and XMark.
• We perform time taken by each of the search techniques based on the number of
keywords, correlation among them versus time.
CONCLUSION
• We have presented the design, implementation and evaluation of the XRANK system
for ranked keyword search over XML documents taking into account:
• (a) the hierarchical and hyperlinked structure of XML documents
• (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML
keyword search queries
THANK YOU.
Download