Efficient Physical Operators for a cost-based XPath Execution Engine Haris Georgiadis Minas Charalambides Vasilis Vassalos Athens University of Economics and Business 1 Motivation (1) XPath query: /s/r/*/it[mb/m/to=‘x’]//k Three navigation alternatives (among others): Straightforward Starting from to k navigation retrieveallallto return k itelements elements elementswith under under at least /s/r/*/it/mb/m/to, /s/r/*/it; onekeep it ancestor, those keephaving which only those at in least turn: with one text to •descendant value has a‘x’, tothen descendant under go backward /mb/m/to undervia with /mb/m/to parent::m/parent::mb/parent::it text value with text ‘x’. For value the‘x’it and elements and,left, for •return the has it elements atheir s document k descendants. left, return element their ancestor k descendants via relative path parent::*/parent::r/parent::s. Athens University of Economics and Business 2 Motivation (2) Many XPath processing algorithms PPFS+ , Staircase Join, Sort Merge-based structural joins, PathStack, Twig2Stack etc Many physical data models and storage techniques : Shredding on relations: Schema-based mapping vs. edge-based mapping Storage into disk pages preserving XML hierarchy Structural encodings: Region Encoding vs. Prefix based encoding Data structures: XB-trees, F&B Index, Path indexes Athens University of Economics and Business 3 Contribution I GeCOEX: the first generic Xpath cost-based execution and optimization framework Agnostic to the underlying XML storage system and the access methods it supports Independent of the techniques and algorithms available for XPath processing. Encapsulated in operator implementations, and rewriting rules Cost based optimization Athens University of Economics and Business 5 Contribution II XPalgebra: A novel XPath logical algebra Good fit with many XPath processing techniques Lookup and SM: two novel and efficient families of physical operators for Xpath Multiple storage engines Experimental evaluation: Direct comparison of operator implementations Athens University of Economics and Business 6 GeCOEX System Architecture XPath query Query Execution Query Optimization Parser Physical Plan Selector Physical Plan Executor result Rewriting Rules Database Statistics Descriptors Physical Operator Descriptors Cost Models Descriptors Physical Operators Primitive Access Method Cost Models XPA Driver Primitive Access Methods Data Model XPA API Athens University of Economics and Business 7 XPalgebra Generic sequence-based logical algebra for a subset of XPath Forward and backward axes Non-positional predicates involving conjunctive boolean expressions Maintains the navigation nature of Xpath Data Model Element Sequence Duplicate-free list of elements in document order Sequence Operators: (mainly) navigation Input and Output: Sequence Boolean Operators: used for filtering Input: Element Output: True or False Athens University of Economics and Business 8 XPalgebra – Sequence Operators Both the input and the output of a Sequence operator are sequences of nodes The input sequence is called context sequence BoolExpr: const | Ъ1^Ъ2^ … ^Ъn , where Ъi : Boolean Operator Athens University of Economics and Business 10 XPalgebra – Boolean Operators applied on single nodes only the input element is called context element return boolean values f(S, Ъfp/d//c) …[d//c] BoolExpr: const | Ъ1^Ъ2^ … ^Ъn , where Ъi : Boolean Operator Athens University of Economics and Business 12 XPalgebra - examples /s/r/*/it[mb/m/to=‘x’]//k dk(f(fp/s/r/*/it(root), Ъfp/mb/m/to(Ъvftext()=x))) Athens University of Economics and Business 13 Physical Operators Implements the Sequence interface of XPA API Access the XML data using the AccessMethods interface of the XPA API Example: a physical operator implementation That’s how physical operators are agnostic to the physical data model Athens University of Economics and Business 14 Physical Operators Large number of physical operators, divided roughly into four ‘families’: Lookup operators (LU) Inspired by indexed nested loops join dLUa: for each element n from input sequence S make a lookup using XPAAPI.Descs(n, a) SortMerge-based operators(SM) Inspired by Sort Merge join dSMa: scan all elements from input sequence S and all a elements (using XPAAPI.Descs(root, a)) and find ‘ancestor-descendant’ matches Staircase Join operators[Grust 2003] PathStack operators [Bruno 2002] Athens University of Economics and Business 15 Physical Operators LU* SM* Staircase [Grust 2003] PathStack [Bruno 2002] c (child) ** d (descendant) ** X ** a (ancestor) ** bp (backward path) ** X cs (cousin) X X s fp (forward path) p (parent) **: inspired by original Athens University of Economics and Business 16 5 XML Storage Systems and their XPA drivers The PE-basic Native XML storage system XPath query Dewey encoding, 1 B-Tree per tag name Rewriting Rules Parser Query Execution Query Optimization Database The RE-basic Native XML storage system Statistics Pre/Post/Level encoding, 1 B-Tree per tag name Descriptors Physical Primitive Plan The PE-Path NativePhysical XMLOperator storage system Access Selector Descriptors Cost Dewey encoding, 1 B-Tree per tag name, Paths MethodB-Tree Cost Models XPA Driver Models The RE-Path Native XML storage system Pre/Post/Level encoding, 1 B-Tree per tag name, Primitive Physical Plan Descriptors Paths B-Tree Access Executor Physical Operators Methods The Edge-RE Native XML storage system XML Storage System Pre/Post/Level encoding, 1 B-Tree for all elements result Data Model XPA API Athens University of Economics and Business 22 Lookup Operators Novel efficient algorithms for holistically evaluating forward and backward multi-step paths Based on root-to-node filtering. buffered-leaping: a new technique for pipelined duplicate elimination and document order preservation Search a minimum window of elements for each element in the context sequence window: the result of calling the method from the AccessMethods interface of the XPA API (e.g. Descs(), Ancs()) corresponding to the XPath axis (e.g. descendant, ancestor) for a given context element Example: fpLU/c/f f17 regExprFilter(f1.getRTNPath(), regExprFilter(f2.getRTNPath(), regExprFilter(f3.getRTNPath(), regExprFilter(f4.getRTNPath(), regExprFilter(f5.getRTNPath(), f6 f8 f12 f13 f16 is reachable not descendant reachable offrom b5 b7 b7 b9 via b9 /c//f via/c//f, /c//f 1) = true false f9 f10 f11 descendant again again not not reachable reachable offrom b3 and from from f7 not b5 regExprFilter(f8.getRTNPath(), any of b3, b5, b7 via /c//f /c//f, 1) regExprFilter(f6.getRTNPath(), regExprFilter(f7.getRTNPath(), 3) = false true r b1 c f1 c e b2 f4 f3 b3 b4 c b5 f5 d c f2 f6 b8 b6 c b7 c c f8 f9 f7 window =XPAPI.Descs(b9,‘f’); =XPAPI.Descs(b1,‘f’); =XPAPI.Descs(b2,‘f’); =XPAPI.Descs(b3,‘f’); The size of chain at any time is very small and upper bounded by the depth of the XML document b9 d c c f16 f11 c f14 f15 f12 f13 c f17 f10 f1 next() f3 next() f5 next() f7 contextEl b1 b1 b2 b2 b3 b3 b9 b5 next() f12 b7 next() f13 next() f17 null chain next() rootAnc b9 b2 is b3 b5 b7 b9 context notnot a adescendant sequence descendant a descendant isofexhausted ofb3 of b1b3 b2 d b5 b7 reverseOf(parent::c/ancestor::b)=/c//f V: regExprFilter(f3.getRTNPath(), /c//f, 1)=true Example: LU bp parent::c/ancestor::b f8 f3 f5 f6 not b1 b1 b2 f11is isa adescendant descendantofofb3 b3 r b1 c f1 c e b2 f4 f3 c b3 b4 b5 f5 d c f2 f6 b6 c f7 window =XPAPI.Ancs(f3,‘b’); window window window c f8 f9 window =XPAPI.Ancs(f2,‘b’); window b8 b7 c d f11 f12 b9 c c f16 c f14 f15 d c f17 f13 f10 next() b1 next() b2 next() b4 contextEl sortedElements V f2 # b1 f3 V # b2 Cheap implementationf5of Ancs() in the PE-Path driver =XPAPI.Ancs(f5,‘b’); V Dewey(f2)=1.1.2.1.1 # b3 # b4 # b5 # b7 f6 =XPAPI.Ancs(f6,‘b’); RTN(f2)= /r/b/c/f => there is a ‘b’ ancestor b’ at level 2 Dewey(b’)= substr(dewey(f2), …) = 1.1 f8 =XPAPI.Ancs(f8,‘b’); RTN(b’)=substr(RTN(f2), …) = /r/b f11 =XPAPI.Ancs(f11,‘b’); Ancs() outputs n without actually retrieving b1 from the database. n is the virtual representation of b1, denoted as #b1 null SM Operators Inspired by sort-merge join algorithms Traverse two sequences of elements, left and right left: the context sequence (the input sequence) right: always consists of all the elements of the requested tag name Keeping track of the current elements on left and right, try to find matching pairs according to the appropriate navigation axis and condition Novel techniques for holistic SM-based forward path and backward path operators with guaranteed low memory requirements Performance Comparison Performance Comparison Sensitivity to context selectivity descendant forward path ancestor Conclusions I Novel techniques for evaluating forward and backward multi-step paths pipelined duplicate elimination and document order preservation Lookup fp, Lookup bp, Lookup cs, SM fp, SM bp, SM cs Fast backwards navigation that fully exploits the capabilities of the underlying storage system Algorithms perform well across a variety of different physical storage models First steps towards building cost models for XPath Athens University of Economics and Business 33 Conclusions II Operator-based XPath processing provides significant optimization opportunities Different implementations of logical operators can provide benefits in different circumstances E.g. context selectivity Query plans can be much more efficient than (existing) monolithic (twig) techniques in most circumstances Athens University of Economics and Business 34 Thank you! Athens University of Economics and Business 36