Complex Document Structures Michael Sperberg-McQueen Claus Huitfeldt

advertisement
Complex Document
Structures
Michael Sperberg-McQueen
W3C / MIT
Claus Huitfeldt
University of Bergen
Living texts
16 October 2008, e-Science Institute, Edinburgh
16.10.08
1
The structure of this talk
– Document markup (XML)
– TexMecs (notation)
– Goddag (data structure)
– Rabbit-Duck (validation mechanism)
Problems addressed
– Overlap
– Alternate ordering
– Discontinuity
16.10.08
2
16.10.08
3
16.10.08
4
16.10.08
5
16.10.08
6
16.10.08
7
16.10.08
8
XML – Extensible Markup Language
• The world before SGML/XML:
– Text as data type, a sequence of chars
• chars ≈ graphemes
• documents are not simply sequences of graphemes
– more or less milestone markup:
• troff, WordStar
• Cocoa
– proprietary formats
• XML provides:
– a simple notation
– a natural and useful data structure
– a powerful control mechanism
16.10.08
9
XML Notation
<text>
<front> ... </front>
<body language=”eng”>
<title> ... </title>
<p>...</p>
<p>...</p>
<!-- material omitted here -->
</body>
<back> ... </back>
</text>
16.10.08
10
XML Constraint language (DTD)
<!DOCTYPE text [
<!ELEMENT text
(front?,body,back?) >
<!ELEMENT front (#PCDATA) >
<!ELEMENT body
(title,p+)>
<!ATTLIST body language CDATA #IMPLIED >
<!ELEMENT back
(#PCDATA) >
<!ELEMENT title (#PCDATA) >
<!ELEMENT p
(#PCDATA) >
]>
16.10.08
11
XML Data structure (the XML tree)
text
front
…
16.10.08
body
back
title
p
p
…
…
…
…
12
The XML tree
• arcs = parent/child relations
– ascription of properties and part/whole
relations
• each node a component,
– each parent’s children are ordered,
• leaf nodes’ order= char sequence of document
– positional meaning
16.10.08
13
Interpreting the XML tree
• ascend from leaf to collect properties and
relations
– simple inheritance (additive meaning /
override)
• descend from node to collect structure and
contents
• supports operations like retrieval,
extraction, editing (delete/add) of leaf
contents and nodes
16.10.08
14
• SGML/XML and the OHCO model
– OHCO: Ordered Hierarchy of Content Objects
• Problems:
– overlap
– alternate co-existing orders
– discontinuous elements
• Observation:
– not such a bad idea, after all
16.10.08
15
MLCD: Markup Languages for
Complex Documents
– TexMecs (notation)
– Goddag (data structure)
– Rabbit-Duck (validation mechanism)
16.10.08
16
TexMecs
Start- and end-tags:
Sole tags:
<e| ... |e>
<e>
…+ attributes, entity references, comments, etc.
16.10.08
17
TexMecs is almost like XML:
<div|
<s n="1"|Die Welt is alles, was der Fall ist.|s>
<s n="1.1"|Die Welt ist die Gesamtheit der
Tatsachen, nicht der Dinge.|s>
<* ... *>
<s n="1.2"|Die Welt zerfällt in Tatsachen.|s>
<* ... *>
|div>
16.10.08
18
but TexMecs allows overlap:
<head|
<np|Broadway <vp|Hit|np> or Miss|vp>
|head>
(assuming ”np” for ”noun phrase” and
”vp” for ”verb phrase”)
16.10.08
19
<act|
<sp who="Åse"|
<L|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping.|stage>
No, I am not!|L>|sp>
<sp who="Åse"|<L|Well then, swear that
it is true!|L>|sp>
<sp who="Peer"|<L|Swear? Why should I?|sp>
<sp who="Åse"|See, you dare not!|L>
<L|It's a lie from first to last.|L>|sp>
|act>
16.10.08
20
… and self-overlap:
<head|
<coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2>
|head>
(assuming ”coll” for ”collocation”)
N.B. the suffixes (here ”~1” and ”~2”) have
no significance beyond allowing co-indexing.
•
16.10.08
21
Spurious overlap
<a| penumbra <b| umbra |a> penumbra |b>
Normal:
<a|…<b|…|a>…|b>
Spurious:
<a|<b|…|a>…|b>
<a|…<b||a>…|b>
<a|…<b|…|a>|b>
16.10.08
22
…and discontinuous elements:
<p id="P-6" n="Carroll, Alice I p6"|
Alice was beginning to get very tired of sitting by
her sister on the bank, and of having nothing to
do: once or twice she had peeped into the book
her sister was reading, but it had no pictures or
conversations in it,
<q|and what is the use of a book,|-q>
thought Alice
<+q|without pictures or conversation?|q>
|p>
16.10.08
23
…and virtual elements:
(Donald's house. Hughie, Louis, and Dewey.)
HUGHIE: How did that translation go?
da de dum de dum,
gets a new frog,
...
Is this poem in the text?
LOUIS: Er ...
it's a new pond.
When the old pond
gets a new frog,
DEWEY: Ah ...
it's a new pond.
When the old pond
Right. That's it.
16.10.08
24
<act|<scene|
<sp who="HUGHIE"|
<p|How did that translation go?|p>
<Lg type="haiku"|
<L|da de dum de dum,|L>
<L@frog|gets a new frog,|L>
<L|...|L>|Lg> |sp>
<sp who="LOUIS"|
<p|Er ...|p> <Lg|
<L@new|it's a new pond.|L>|Lg> |sp>
<sp who="DEWEY"|
<p|Ah ...|p> <Lg|
<L@pond|When the old pond|L>|Lg>
<p|Right. That's it.|p> |sp>
|scene>|act>
<Lg|<^L^pond><^L^frog><^L^new>|Lg>
16.10.08
25
…and unordered elements:
<birthday|
<|gifts||
<gift| a tie |gift>
<gift| socks |gift>
<gift| after shave |gift>
<gift| a kiss |gift>
||gifts|>
|birthday>
16.10.08
26
TexMecs overview
Elements
Start- and end-tags:
- with ID & attribute:
<e@foo type="a"| ... |e>
- overlap:
<e| ... <f| ... |e> ... |f>
- self-overlap:
<e~1|...<e~2|...|e~1>...|e~2>
- discontinuous:
<e|...|-e> ... <+e|...|e>
- unordered:
<|e|| .. ||e|>
Sole elements:
- virtual:
16.10.08
<e| ... |e>
<e>
<^a^foo type="b">
27
TexMecs overview, cont.
Entities
Internal
- unstructured:
- structured:
External:
<&eacute>
<&dot.fullstop> vs. <&dot.decimal>
<<url>>, e.g. <<http://www.w3.org/XML>>
Comments:
<* ... *>.
CDATA sections:
<#CDATA< ... >>
16.10.08
28
MLCD data structure: Goddag
<head| <np|Broadway <vp|Hit|np> or Miss|vp> |head>
Not:
head
np
vp
<head| <np|Broadway <vp|Hit|np> or Miss|vp> |head>
16.10.08
29
MLCD data structure: Goddag
<head| <np|Broadway <vp|Hit|np> or Miss|vp> |head>
But:
head
np
vp
<head| <np|Broadway <vp|Hit|np> or Miss|vp> |head>
16.10.08
30
Goddag
Generalized Ordered-Descendant Directed Acyclic Graph
Overlap is simply multiple parentage.
• Informally, Goddags are tangled trees with
shared substructures.
• More formally:
– directed acyclic graphs
– arcs = parent/child relation
– each parent's children ordered*
– parents may share children ≡ multiple parentage
16.10.08
31
Trees are Goddags
16.10.08
32
Overlaps form Goddags
16.10.08
33
and also self-overlap:
<head|
<coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2>
|head>
16.10.08
34
and virtual elements:
<act|<scene|
<sp who="HUGHIE"|
<p|How did that translation go?|p>
<Lg type="haiku"|
<L|da de dum de dum,|L>
<L@frog|gets a new frog,|L>
<L|...|L>|Lg> |sp>
<sp who="LOUIS"|
<p|Er ...|p> <Lg|
<L@new|it's a new pond.|L>|Lg> |sp>
<sp who="DEWEY"|
<p|Ah ...|p> <Lg|
<L@pond|When the old pond|L>|Lg>
<p|Right. That's it.|p> |sp>
|scene>|act>
16.10.08
<Lg|<^L^pond><^L^frog><^L^new>|Lg>
35
Hughie, Louis and Dewey Goddag
16.10.08
36
Goddag – concerns and problems
• Containment versus dominance
• Discontinuity
– TexMecs <e|…|-e> … <+e| … |e>
• Virtual elements – types or tokens?
• Reordering and reserialization
– Spurious overlap
– Overlap and discontinuity vs. Virtual els.
– In and out of XML-mechanisms
16.10.08
37
Peer Gynt, containment relation
16.10.08
38
Peer, dramatic view
16.10.08
39
Peer, verse view
16.10.08
40
Peer Gynt, dominance relations
16.10.08
41
Validation
• Without validation, no proper markup system.
• Without a constraint language, no validation
• So, how do we find a constraint language?
• Directions:
– rabbit/duck grammars
– predicate-based validation
– graph grammars?
16.10.08
42
Predicate-based validation
• For each element of a given type:
• Define predicates which must be true.
• Specify predicates using what language?
– XPath
– Goddag-extended XPath?
– other?
• Traverse the document, checking
predicates. Cf. Schematron.
16.10.08
43
Rabbit/duck grammars
• Modest extension to CONCUR
• Define L as
L1 ∩ L2 ∩ … ∩ Ln ∩
• Allow constraints on interaction
• Require shared structures
• Allow selective visibility
16.10.08
44
Class system
• Each grammar in a set partitions* the
vocabulary:
–
–
–
–
First-class elements (”normal”)
Second-class (”milestones”)
Third-class (”transparent”)
Fourth-class (”opaque”)
• Each GI in a vocabulary must be first-class in
some grammar(s) …
• … but may be 2d, 3d, 4th-class in others.
• No two first-class element types may overlap.
16.10.08
45
Peer Gynt: Verse Grammar
play ::= act+
act ::= scene+
scene ::= (L | stage | #tag(sp))*
L ::= ( #PCDATA | stage | #stag(sp) | #etag(sp))*
stage ::= (#PCDATA)
16.10.08
46
16.10.08
47
16.10.08
48
Peer Gynt: Dramatic Structure
play ::= act+
act ::= scene+
scene ::= (sp | stage | #tag(L))*
sp ::= ( #PCDATA | stage | #stag(L))*
stage ::= (#PCDATA)
16.10.08
49
16.10.08
50
16.10.08
51
Two invariants
1 Each rule in each grammar is enforced
wherever it applies.
2 When two elements overlap there are
multiple readings: each element
– is parsed as 1st class in at least one reading
– is 2nd or 3d class in other(s)
16.10.08
52
Self-overlap
page ::= #overlap(#PCDATA | #tag (p) |
#tag(lim) | ~lim~)*
lim ::= #overlap(#PCDATA | #tag (p) |
#tag(lim) | ~lim~)*
NB: here lim occurs as both first- and second-class.
16.10.08
53
Implementation
• filter into n XML documents, validate each
• parse against each grammar directly; lexical
scanner distinguishes elements by class
• parse in parallel
16.10.08
54
Related work
•
Text Encoding Initiative
•
LMNL (Piez, Tennison, Caton, Cowan)
•
Xconcur (Witt, Schoenefeld)
•
The ARCHway Project (Kentucky, Dekhtyar et.al.)
•
JITTs (Durusau and O’Donnell)
•
Multi-colored trees (Jagadish et.al.)
•
Standoff Markup (NITE, Carletta et.al.)
•
MultiX (Calabretto et.al., Lyon)
•
SGF (Bielefeld, Stührenberg, Goecke et.al.)
•
…and others
16.10.08
55
Download