Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia Introduction XML data stream XPath query 1 XPath query 2 XPath query 3 Consumer 1 XML Stream Router Consumer 2 Consumer 3 Related Work The problem was introduced in [Altinel and Franklin 2000] for a system XFilter. [Chan et al. 2002] describes techniques to solve the problem based on a trie (XTrie) [Diao et al. 2003] discusses a method based on optimized NFAs(YFilter) [Green et al. 2003] introduces how to solve the problem using lazy DFA DFA approach in general Convert the set of XPath expressions into the set of NFA’s Convert the set of NFA’s into a single NFA Convert the single NFA into a DFA Process XML data stream with DFA (using SAX model) DFA approach in general (cont) Linear XPath expression: P ::= /N | //N | PP N ::= E | A | * | text() | text() = S where E – element label What about predicates? A – attribute label / - child axis To be decomposed into // - descendant axis linear XPath expressions * - wild card S – constant string DFA approach in general (cont) Consider two XPath expressions /datasets/dataset[//tableHead//*/text()=“Galaxy”]/title /datasets/dataset[/history]/tableHead[/field] Corresponding query tree $D IN $R/datasets/dataset $H IN $D/history $T IN $D/title sax f = true $TH IN $D/tableHead sax f = true $N IN $D//tableHead//* $F IN $TH/field $V IN $N/text()="Galaxy" Conversion of XPath expressions into NFA and DFA Query tree $X IN $R/a $Y IN $X//*/b $Z IN $X/b/* $U IN $Z/d Query NFA Query DFA Eager DFA vs. Lazy DFA DFA is eager if it is obtained by the standard algorithm of conversion of NFA to DFA [Hopcroft and Ullman 1979] DFA is lazy if it is constructed at run-time on demand. Initially it has a single state and whenever we attempt to make a transition into a missing state we compute it and update a transition. Eager DFA P = p0 // p1 //… // pk pi = N1 / N2 /… / Nni k= # of //’s ni= length of pi, i=0,…,k m= max # of *’s in each pi n= length (or depth) of P, i.e. s= alphabet size || n i 0... k i Theorem. Given a linear XPath expression P, define prefix(P) = n0, and k 2 1 2 m body(P) = ( when k>0, and body(P) = 1 ( n n ) 2 ( n n ) n 1 ) s 0 0 k 2 2k when k = 0. Then eager DFA for P has at most prefix(P) + body(P) states. In particular, if m = 0 and k 1, then DFA has at most (n+1) states. Lazy DFA. Example DFA Queries 1 \a\\*\b a \a\b\*\d Sample XML document 2 * b * * <a> 7 <b> <b> 3 * 6 <d/> </b> </b> </a> b b * * b 4 8 d b b * 5 b d Lazy DFA Graph schema (based on DTD) d – the maximum number of simple cycles that a simple path can intersect D – the total number of nonempty, simple paths starting at the root d = 2, D = 13 Lazy DFA (cont) Theorem. Consider a graph schema with d, D, and let Q be set of XPath expressions of maximum depth n. Then on any XML input satisfying the schema, the lazy DFA has at most 1 + D(1+n)d states Corollary. The number of states of lazy DFA does not depend on the number of XPath expressions, only on their depth. If n = 10, and the number of XPath expressions is equal to 100,000. Eager DFA may have 2100,000 states Lazy DFA will have 1574 states Lazy DFA. Implementation To process XML stream, it uses SAX model The subset of XPath considered in the implementation No text() and attribute values tests Only child and descendant axes All predicates of a query must fire before the target element Restrictions of the implementation XPath queries Sample XML document 1. All predicates fire before the target element <courses> \\courses[level]\section <course>367-203</course> <title>MEDIA WORKSHOP</title> <level>U</level> <section> 2. Predicates fire between the starting and closing tags of the target element \\courses[days]\section 3. Predicates fire after the target element \\courses[credits]\section <section>Se 101</section> <days>T</days> <hours> <start>1:30pm</start> <end>5:20pm</end> </hours> </section> <credits>1-3</credits> </courses> Processing attributes When processing a stream, all attributes are converted into elements <section_listing> <section name=“Se 101“ <hours <section_listing> <section> description=“”/> <@name>Se 101</@name> start="1:30pm“ <@description/> end="5:20pm"/> </section_listing> </section> <hours> <@start>1:30pm</@start> <@end>5:20pm</@end> </hours> </section_listing> Testing Reference implementation: Galax 1.0.3.5 Testing XML stream: World geographic database http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml Maximum XML depth of the stream was 6 Number of queries was 14 The depth of queries had a range of 1 to 5 The number of predicates had a range of 0 to 3 The depth of predicates had a range of 1 to 4 Method used Number of states used NFA 22 Eager DFA 87 Lazy DFA 22 (1MB) Reference Todd J. Green et al, Processing XML Streams with Deterministic Automata and Stream Indexes,, ACM Transactions on Computational Logic, 12/2004 Altinel, M. and Franklin, M. 2000. Efficient filtering of XML documents for selective dissemination, In Proceedings of VLDB. Cairo Chen J et al, 2000, NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of VLDB. Berlin, Germany. John E. Hopcroft, Jeffrey D. Ullman 1987, Introduction to automata theory, languages, and computation