Slides

advertisement
XML Data Management
Deterministic DTDs and Schemas
Werner Nutt
How Expressive can a Schema Be?
<xsd:element name=“A” type=“oneB”/>
<xsd:complexType name=“onlyAs”>
<xsd:choice>
<xsd:sequence>
<xsd:element name=“A” type=“onlyAs”/>
<xsd:element name=“A” type=“onlyAs”/>
</xsd:sequence>
<xsd:element name=“A” type=“xsd:string”/>
</xsd:choice>
</xsd:complexType>
This schema is
a frequent example
in teaching material
on XML Schema
<xsd:complexType name=“oneB”>
<xsd:choice>
<xsd:element name=“B” type=“xsd:string”/>
<xsd:sequence>
<xsd:element name=“A” type=“onlyAs”/>
<xsd:element name=“A” type=“oneB”/>
</xsd:sequence>
<xsd:sequence>
<xsd:element name=“A” type=“oneB”/>
<xsd:element name=“A” type=“onlyAs”/>
</xsd:sequence>
</xsd:choice>
</xsd:complexType>
What would documents look like that satisfy this schema?
Arbitrary deep binary tree with A elements, and a single B element
How would one check validity? What would be the cost?
What are the pros and cons of allowing such schemas?
Let’s see what SAXON says …
Here is the Full Error Message from Eclipse
• cos-element-consistent: Error for type 'oneB'.
Multiple elements with name 'A', with different types,
appear in the model group.
I.e., in a given context,
• cos-element-consistent: Error for type 'onlyAs'.
elements with the same name
Multiple elements with name 'A', with different types,
must have the same content.
appear in the model group.
Easy to check!
• cos-nonambig: A and A (or elements from their substitution group)
violate "Unique Particle Attribution". During validation against this
That’s
more
subtle ...
schema, ambiguity would be created for those
two
particles.
• cos-nonambig: A and A (or elements from their substitution group)
violate "Unique Particle Attribution". During validation against this
schema, ambiguity would be created for those two particles.
The Country Example in XML Schema
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/country"
xmlns="http://www.example.org/country"
elementFormDefault="qualified">
<xsd:element name="country">
<xsd:complexType>
<xsd:choice>
<xsd:element name="king" type="xsd:string"></xsd:element>
<xsd:element name="queen" type="xsd:string"></xsd:element>
<xsd:sequence>
<xsd:element name="king" type="xsd:string"></xsd:element>
<xsd:element name="queen" type="xsd:string"></xsd:element>
</xsd:sequence>
</xsd:choice>
</xsd:complexType>
</xsd:element>
</xsd:schema>
As DTD:
<!ELEMENT country (king | queen | (king,queen))>
Also this is not validated …
• cos-nonambig: king and king (or elements from their
substitution group) violate "Unique Particle Attribution".
During validation against this schema, ambiguity would be
created for those two particles.
Let’s check what this means!
What the W3C Standard Explains …
Schema Component Constraint:
Unique Particle Attribution
A content model must be formed such that during
·validation· of an element information item sequence,
the particle contained directly, indirectly or ·implicitly·
therein with which to attempt to ·validate· each item in the
sequence in turn can be uniquely determined
without examining the content or attributes of that item,
and without any information about the items in
the remainder of the sequence.
http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig
Questions and Ideas
Questions:
• How can one make the standard formal?
• How can a validator implement the standard?
Ideas:
• Content models are specified by regular expressions
• A regular expression E can be translated into
a finite state automaton A (Glushkov automaton)
that checks which strings satisfy E
 Construct A from E and check
whether A is deterministic
Formalization
• Alphabet  (i.e., set of symbols):
In the following, we denote
the element names occurring
in the content
concatenation
by amodel
dot,
no more by a comma.
• Regular expressions over  are generated with the rule
e, f  a | (ef) | (e|f) | (e)+ | (e)*
where e, f are expressions and a  
• Language L(e) of an expression e (inductively defined)
• Exercise: Which of the following are in the language
defined by a*  (b | c)  a+ ?
– aab
– aba
– aaacaaa
– abca
Regular Expressions and DTDs
These are formalizations of DTDs and validation:
A DTD is a pair (d, s) where
• s   is the start symbol
• d maps every -symbol to a regular expression over 
A document tree t satisfies d (t is valid wrt d) iff
• the root of t is labeled s
• for every node n in t, with symbol a,
the string formed by the names of the children of n
satisfies d(a)
 Validation is checking whether a string satisfies a regexp
Markings
Distinguish between the different occurrences of a symbol in
a regexp by using numbers: markings of regexps
Examples:
• a1*  (b2 | c3)  a4+ is a marking of a*  (b | c)  a+
• king1 | queen2 | king3  queen4 is a marking of
king | queen | king  queen
Definition
A marking e′ of a regular expression e is
an assignment of numbers to every symbol in e.
Unmarked Version
Consider a regular expression e and a e marking of e
Definition:
For w  L(e) , we denote by w#
the corresponding unmarked string in L(r).
Example:
If w = b2a1a3, then w# = baa
“Unique Particle Attribution”: Formalization
Brüggemann-Klein/Wood [1998]
Definition: A regular expression r is deterministic iff
there are no strings uxv, uyw ∈ L(r′) with
• |x| = |y| = 1
• x  y,
(x and y are different marked symbols)
• x# = y#
(their unmarking is the same).
Example: (a | b)* a is not deterministic because there are
• marking ((a1 + b2)∗ a3)
• strings
b2 a1 a3 and b2 a3 
u x w
u x v
How can we check,
whether e is deterministic?
Finite State Automata
The automaton is deterministic if
every pair (q,a) is
• Regular anguages can also be defined
usingtoautomata
only mapped
a single state
• A finite state automaton (FSA) consists of:
– a set of states Q.
– an alphabet  (i.e., a set of symbols)
– a transition function ,
which maps every pair (q,a) to a set of states q’
– an initial state q0
– a set of accepting states F
• A word a1…an is in the language defined by an automaton
if there is a path from q0 to a state in F
with edges labeled a1,…,an
Which Language Does this FSA Define?
b
a
q2
q0
q1
a
b
c
q3
Non-Deterministic Automata
• An automaton is non-deterministic if
there is a state q and a letter a such that
there are at least two transitions from q
via edges labeled with a
What words are in the language of
a non-deterministic automaton?
• We now create a Glushkov automaton
from a regular expression
Creating a Glushkov Automaton
from a Regular Expression
Step 1: Create a marking
of the expression
a*(b|c)a+
a1*(b1|c1)a2+
Creating a Glushkov Automaton
from a Regular Expression
Step 2: Create a state q0
and create a state
for each subscripted letter
a1*(b1|c1)a2+
Step 3: Choose as accepting states
all subscripted letters with which
it is possible to end a word
How do we find
these states?
b1
q0
a1
a2
c1
Creating a Glushkov Automaton
from a Regular Expression
Step 4: Create a transition
from a state lj to a state kj if
there is a word in which kj follows li.
a1*(b1|c1)a2+
Label the transition with k
How do we find these transitions?
b1
q0
a1
a2
c1
Exercises
What are the Glushkov automata of
• a*  b (a  b)*
• (a | b)*  a  (a | b)
• (a | b)*a
?
Recognizing Deterministic Regular Expressions
Theorem (Book et al 1971, Brüggemann-Klein, Wood, 1998)
A regular expression is deterministic (one-unambiguous)
iff its Glushkov automaton is deterministic.
Construction of the Glushkov Automaton
For an arbitrary alphabet  and a language L  *
we define two sets
first(L) = a    u *. au  L
last(L) = a    u *. ua  L
and the function
follow(L,a) = b    u,v *. uabv  L.
Consider an expression e and its marking e
We can construct the Glushkov automaton for e if we know
the sets first(L(e)) , last(L(e)) ,
empty word
the function follow(L(e),  ) ,
and if we know whether   (L(e)) .
Why?
Construction of the Glushkov Automaton
Where do we get this info?
If e = a1 , then
• first(L(e)) =  a1 
• last(L(e)) =  a1 
• follow(L(e),  ) is not defined for any li 
Also,  L( e)
For e = f*, f+, fg,
exercise!
If e = (f | g) , then
• first(L(e)) = first(L(f)) first(L(g))
• last(L(e)) = last(L(f)) last(L(g))
• follow(L(e), li) is follow(L(f), li) if li  L(f) and
follow(L(g), li) if li  L(g)
Also,   L(e) if   L(f) or   L(g)
Recognizing Deterministic Regular Expressions
Observation:
• For each operator, first, last, and follow can be computed
in quadratic time.
This yields an O(n3) algorithm.
Theorem (Brüggemann-Klein, Wood, 1998)
• There is an O(n2) algorithm to check whether a regexp
is deterministic.
More Results
Theorems (Brüggemann-Klein, Wood, 1998)
• Not every regular language can be denoted by a
deterministic regular expression.
E.g., (a | b)* a (a | b)
• Deterministic regular languages are not closed under
union, concatenation, or Kleene-star.
I.e., there is no easy syntactic characterization
• If it exists, an equivalent deterministic regular expression
can be constructed in exponential time.
It is possible to help users, but that is costly
Theory for XML Schema
XML schema allows schemas where
• the same element appears with different types
However,
• it is illegal to have two elements of the same name,
but different types in one content model.
Also, content models must be deterministic.
Consequence:
Documents can be validated
in a deterministic top-down pass
References
This material draws upon slides by
• Sara Cohen
• Frank Neven,
notes by
• Leonid Libkin
and the papers by A. Brüggemann-Klein and D. Wood
Download