chandalia

advertisement
Processing XML Streams
with Deterministic Automata
Denis Mindolin
Gaurav Chandalia
Introduction
XML data
stream
XPath query 1
XPath query 2
XPath query 3
Consumer 1
XML Stream
Router
Consumer 2
Consumer 3
Related Work




The problem was introduced in [Altinel and
Franklin 2000] for a system XFilter.
[Chan et al. 2002] describes techniques to
solve the problem based on a trie (XTrie)
[Diao et al. 2003] discusses a method based
on optimized NFAs(YFilter)
[Green et al. 2003] introduces how to solve
the problem using lazy DFA
DFA approach in general




Convert the set of XPath expressions into the
set of NFA’s
Convert the set of NFA’s into a single NFA
Convert the single NFA into a DFA
Process XML data stream with DFA (using
SAX model)
DFA approach in general (cont)

Linear XPath expression:
P ::= /N | //N | PP
N ::= E | A | * | text() | text() = S
where
E – element label
What about predicates?
A – attribute label
/ - child axis
To be decomposed into
// - descendant axis
linear XPath expressions
* - wild card
S – constant string
DFA approach in general (cont)

Consider two XPath expressions
/datasets/dataset[//tableHead//*/text()=“Galaxy”]/title
/datasets/dataset[/history]/tableHead[/field]

Corresponding query tree
$D IN $R/datasets/dataset
$H IN $D/history
$T IN $D/title
sax f = true
$TH IN $D/tableHead sax f = true
$N IN $D//tableHead//*
$F IN $TH/field
$V IN $N/text()="Galaxy"
Conversion of XPath expressions into
NFA and DFA
Query tree
$X IN $R/a
$Y IN $X//*/b
$Z IN $X/b/*
$U IN $Z/d
Query NFA
Query DFA
Eager DFA vs. Lazy DFA

DFA is eager if it is obtained by the standard
algorithm of conversion of NFA to DFA [Hopcroft
and Ullman 1979]

DFA is lazy if it is constructed at run-time on demand.
Initially it has a single state and whenever we attempt
to make a transition into a missing state we compute
it and update a transition.
Eager DFA


P = p0 // p1 //… // pk
pi = N1 / N2 /… / Nni
k=
# of //’s
ni=
length of pi, i=0,…,k
m=
max # of *’s in each pi
n=
length (or depth) of P, i.e.
s=
alphabet size ||
n
i  0... k
i
Theorem. Given a linear XPath expression P, define prefix(P) = n0, and
k 2 1
2
m
body(P) = (
when k>0, and body(P) = 1
(
n

n
)

2
(
n

n
)

n

1
)
s
0
0
k
2
2k
when k = 0. Then eager DFA for P has at most prefix(P) + body(P) states. In
particular, if m = 0 and k 1, then DFA has at most (n+1) states.
Lazy DFA. Example
DFA
Queries
1
\a\\*\b
a
\a\b\*\d
Sample XML document
2
*
b
*
*
<a>
7
<b>
<b>
3
*
6
<d/>
</b>
</b>
</a>
b
b
*
*
b
4
8
d
b
b
*
5
b
d
Lazy DFA
Graph schema (based on DTD)
d – the maximum number of simple
cycles that a simple path can
intersect
D – the total number of nonempty,
simple paths starting at the root
d = 2, D = 13
Lazy DFA (cont)

Theorem. Consider a graph schema with d, D, and let Q be set of
XPath expressions of maximum depth n. Then on any XML input
satisfying the schema, the lazy DFA has at most 1 + D(1+n)d states

Corollary. The number of states of lazy DFA does not depend on
the number of XPath expressions, only on their depth.

If n = 10, and the number of XPath expressions is equal to 100,000.

Eager DFA may have  2100,000 states

Lazy DFA will have  1574 states
Lazy DFA. Implementation


To process XML stream, it uses SAX model
The subset of XPath considered in the
implementation

No text() and attribute values tests

Only child and descendant axes

All predicates of a query must fire before the target
element
Restrictions of the implementation
XPath queries
Sample XML document
1. All predicates fire before the target
element
<courses>
\\courses[level]\section
<course>367-203</course>
<title>MEDIA WORKSHOP</title>
<level>U</level>
<section>
2. Predicates fire between the starting and
closing tags of the target element
\\courses[days]\section
3. Predicates fire after the target element
\\courses[credits]\section
<section>Se 101</section>
<days>T</days>
<hours>
<start>1:30pm</start>
<end>5:20pm</end>
</hours>
</section>
<credits>1-3</credits>
</courses>
Processing attributes
When processing a stream, all attributes are converted into elements
<section_listing>
<section name=“Se 101“
<hours
<section_listing>
<section>
description=“”/>
<@name>Se 101</@name>
start="1:30pm“
<@description/>
end="5:20pm"/>
</section_listing>
</section>
<hours>
<@start>1:30pm</@start>
<@end>5:20pm</@end>
</hours>
</section_listing>
Testing


Reference implementation: Galax 1.0.3.5
Testing XML stream: World geographic database
http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml





Maximum XML depth of the stream was 6
Number of queries was 14
The depth of queries had a range of 1 to 5
The number of predicates had a range of 0 to 3
The depth of predicates had a range of 1 to 4
Method used
Number of states used
NFA
22
Eager DFA
87
Lazy DFA
22
(1MB)
Reference





Todd J. Green et al, Processing XML Streams with Deterministic Automata
and Stream Indexes,, ACM Transactions on Computational Logic, 12/2004
Altinel, M. and Franklin, M. 2000. Efficient filtering of XML documents for
selective dissemination, In Proceedings of VLDB. Cairo
Chen J et al, 2000, NiagaraCQ: a scalable continuous query system for
internet databases. In Proceedings of the ACM/SIGMOD Conference on
Management of Data
Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML
message brokering. In Proceedings of VLDB. Berlin, Germany.
John E. Hopcroft, Jeffrey D. Ullman 1987, Introduction to automata theory,
languages, and computation
Download