Complex Document Structures Michael Sperberg-McQueen W3C / MIT Claus Huitfeldt University of Bergen Living texts 16 October 2008, e-Science Institute, Edinburgh 16.10.08 1 The structure of this talk – Document markup (XML) – TexMecs (notation) – Goddag (data structure) – Rabbit-Duck (validation mechanism) Problems addressed – Overlap – Alternate ordering – Discontinuity 16.10.08 2 16.10.08 3 16.10.08 4 16.10.08 5 16.10.08 6 16.10.08 7 16.10.08 8 XML – Extensible Markup Language • The world before SGML/XML: – Text as data type, a sequence of chars • chars ≈ graphemes • documents are not simply sequences of graphemes – more or less milestone markup: • troff, WordStar • Cocoa – proprietary formats • XML provides: – a simple notation – a natural and useful data structure – a powerful control mechanism 16.10.08 9 XML Notation <text> <front> ... </front> <body language=”eng”> <title> ... </title> <p>...</p> <p>...</p> <!-- material omitted here --> </body> <back> ... </back> </text> 16.10.08 10 XML Constraint language (DTD) <!DOCTYPE text [ <!ELEMENT text (front?,body,back?) > <!ELEMENT front (#PCDATA) > <!ELEMENT body (title,p+)> <!ATTLIST body language CDATA #IMPLIED > <!ELEMENT back (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT p (#PCDATA) > ]> 16.10.08 11 XML Data structure (the XML tree) text front … 16.10.08 body back title p p … … … … 12 The XML tree • arcs = parent/child relations – ascription of properties and part/whole relations • each node a component, – each parent’s children are ordered, • leaf nodes’ order= char sequence of document – positional meaning 16.10.08 13 Interpreting the XML tree • ascend from leaf to collect properties and relations – simple inheritance (additive meaning / override) • descend from node to collect structure and contents • supports operations like retrieval, extraction, editing (delete/add) of leaf contents and nodes 16.10.08 14 • SGML/XML and the OHCO model – OHCO: Ordered Hierarchy of Content Objects • Problems: – overlap – alternate co-existing orders – discontinuous elements • Observation: – not such a bad idea, after all 16.10.08 15 MLCD: Markup Languages for Complex Documents – TexMecs (notation) – Goddag (data structure) – Rabbit-Duck (validation mechanism) 16.10.08 16 TexMecs Start- and end-tags: Sole tags: <e| ... |e> <e> …+ attributes, entity references, comments, etc. 16.10.08 17 TexMecs is almost like XML: <div| <s n="1"|Die Welt is alles, was der Fall ist.|s> <s n="1.1"|Die Welt ist die Gesamtheit der Tatsachen, nicht der Dinge.|s> <* ... *> <s n="1.2"|Die Welt zerfällt in Tatsachen.|s> <* ... *> |div> 16.10.08 18 but TexMecs allows overlap: <head| <np|Broadway <vp|Hit|np> or Miss|vp> |head> (assuming ”np” for ”noun phrase” and ”vp” for ”verb phrase”) 16.10.08 19 <act| <sp who="Åse"| <L|Peer, you're lying!|sp> <sp who="Peer"|<stage|without stopping.|stage> No, I am not!|L>|sp> <sp who="Åse"|<L|Well then, swear that it is true!|L>|sp> <sp who="Peer"|<L|Swear? Why should I?|sp> <sp who="Åse"|See, you dare not!|L> <L|It's a lie from first to last.|L>|sp> |act> 16.10.08 20 … and self-overlap: <head| <coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2> |head> (assuming ”coll” for ”collocation”) N.B. the suffixes (here ”~1” and ”~2”) have no significance beyond allowing co-indexing. • 16.10.08 21 Spurious overlap <a| penumbra <b| umbra |a> penumbra |b> Normal: <a|…<b|…|a>…|b> Spurious: <a|<b|…|a>…|b> <a|…<b||a>…|b> <a|…<b|…|a>|b> 16.10.08 22 …and discontinuous elements: <p id="P-6" n="Carroll, Alice I p6"| Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <q|and what is the use of a book,|-q> thought Alice <+q|without pictures or conversation?|q> |p> 16.10.08 23 …and virtual elements: (Donald's house. Hughie, Louis, and Dewey.) HUGHIE: How did that translation go? da de dum de dum, gets a new frog, ... Is this poem in the text? LOUIS: Er ... it's a new pond. When the old pond gets a new frog, DEWEY: Ah ... it's a new pond. When the old pond Right. That's it. 16.10.08 24 <act|<scene| <sp who="HUGHIE"| <p|How did that translation go?|p> <Lg type="haiku"| <L|da de dum de dum,|L> <L@frog|gets a new frog,|L> <L|...|L>|Lg> |sp> <sp who="LOUIS"| <p|Er ...|p> <Lg| <L@new|it's a new pond.|L>|Lg> |sp> <sp who="DEWEY"| <p|Ah ...|p> <Lg| <L@pond|When the old pond|L>|Lg> <p|Right. That's it.|p> |sp> |scene>|act> <Lg|<^L^pond><^L^frog><^L^new>|Lg> 16.10.08 25 …and unordered elements: <birthday| <|gifts|| <gift| a tie |gift> <gift| socks |gift> <gift| after shave |gift> <gift| a kiss |gift> ||gifts|> |birthday> 16.10.08 26 TexMecs overview Elements Start- and end-tags: - with ID & attribute: <e@foo type="a"| ... |e> - overlap: <e| ... <f| ... |e> ... |f> - self-overlap: <e~1|...<e~2|...|e~1>...|e~2> - discontinuous: <e|...|-e> ... <+e|...|e> - unordered: <|e|| .. ||e|> Sole elements: - virtual: 16.10.08 <e| ... |e> <e> <^a^foo type="b"> 27 TexMecs overview, cont. Entities Internal - unstructured: - structured: External: <&eacute> <&dot.fullstop> vs. <&dot.decimal> <<url>>, e.g. <<http://www.w3.org/XML>> Comments: <* ... *>. CDATA sections: <#CDATA< ... >> 16.10.08 28 MLCD data structure: Goddag <head| <np|Broadway <vp|Hit|np> or Miss|vp> |head> Not: head np vp <head| <np|Broadway <vp|Hit|np> or Miss|vp> |head> 16.10.08 29 MLCD data structure: Goddag <head| <np|Broadway <vp|Hit|np> or Miss|vp> |head> But: head np vp <head| <np|Broadway <vp|Hit|np> or Miss|vp> |head> 16.10.08 30 Goddag Generalized Ordered-Descendant Directed Acyclic Graph Overlap is simply multiple parentage. • Informally, Goddags are tangled trees with shared substructures. • More formally: – directed acyclic graphs – arcs = parent/child relation – each parent's children ordered* – parents may share children ≡ multiple parentage 16.10.08 31 Trees are Goddags 16.10.08 32 Overlaps form Goddags 16.10.08 33 and also self-overlap: <head| <coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2> |head> 16.10.08 34 and virtual elements: <act|<scene| <sp who="HUGHIE"| <p|How did that translation go?|p> <Lg type="haiku"| <L|da de dum de dum,|L> <L@frog|gets a new frog,|L> <L|...|L>|Lg> |sp> <sp who="LOUIS"| <p|Er ...|p> <Lg| <L@new|it's a new pond.|L>|Lg> |sp> <sp who="DEWEY"| <p|Ah ...|p> <Lg| <L@pond|When the old pond|L>|Lg> <p|Right. That's it.|p> |sp> |scene>|act> 16.10.08 <Lg|<^L^pond><^L^frog><^L^new>|Lg> 35 Hughie, Louis and Dewey Goddag 16.10.08 36 Goddag – concerns and problems • Containment versus dominance • Discontinuity – TexMecs <e|…|-e> … <+e| … |e> • Virtual elements – types or tokens? • Reordering and reserialization – Spurious overlap – Overlap and discontinuity vs. Virtual els. – In and out of XML-mechanisms 16.10.08 37 Peer Gynt, containment relation 16.10.08 38 Peer, dramatic view 16.10.08 39 Peer, verse view 16.10.08 40 Peer Gynt, dominance relations 16.10.08 41 Validation • Without validation, no proper markup system. • Without a constraint language, no validation • So, how do we find a constraint language? • Directions: – rabbit/duck grammars – predicate-based validation – graph grammars? 16.10.08 42 Predicate-based validation • For each element of a given type: • Define predicates which must be true. • Specify predicates using what language? – XPath – Goddag-extended XPath? – other? • Traverse the document, checking predicates. Cf. Schematron. 16.10.08 43 Rabbit/duck grammars • Modest extension to CONCUR • Define L as L1 ∩ L2 ∩ … ∩ Ln ∩ • Allow constraints on interaction • Require shared structures • Allow selective visibility 16.10.08 44 Class system • Each grammar in a set partitions* the vocabulary: – – – – First-class elements (”normal”) Second-class (”milestones”) Third-class (”transparent”) Fourth-class (”opaque”) • Each GI in a vocabulary must be first-class in some grammar(s) … • … but may be 2d, 3d, 4th-class in others. • No two first-class element types may overlap. 16.10.08 45 Peer Gynt: Verse Grammar play ::= act+ act ::= scene+ scene ::= (L | stage | #tag(sp))* L ::= ( #PCDATA | stage | #stag(sp) | #etag(sp))* stage ::= (#PCDATA) 16.10.08 46 16.10.08 47 16.10.08 48 Peer Gynt: Dramatic Structure play ::= act+ act ::= scene+ scene ::= (sp | stage | #tag(L))* sp ::= ( #PCDATA | stage | #stag(L))* stage ::= (#PCDATA) 16.10.08 49 16.10.08 50 16.10.08 51 Two invariants 1 Each rule in each grammar is enforced wherever it applies. 2 When two elements overlap there are multiple readings: each element – is parsed as 1st class in at least one reading – is 2nd or 3d class in other(s) 16.10.08 52 Self-overlap page ::= #overlap(#PCDATA | #tag (p) | #tag(lim) | ~lim~)* lim ::= #overlap(#PCDATA | #tag (p) | #tag(lim) | ~lim~)* NB: here lim occurs as both first- and second-class. 16.10.08 53 Implementation • filter into n XML documents, validate each • parse against each grammar directly; lexical scanner distinguishes elements by class • parse in parallel 16.10.08 54 Related work • Text Encoding Initiative • LMNL (Piez, Tennison, Caton, Cowan) • Xconcur (Witt, Schoenefeld) • The ARCHway Project (Kentucky, Dekhtyar et.al.) • JITTs (Durusau and O’Donnell) • Multi-colored trees (Jagadish et.al.) • Standoff Markup (NITE, Carletta et.al.) • MultiX (Calabretto et.al., Lyon) • SGF (Bielefeld, Stührenberg, Goecke et.al.) • …and others 16.10.08 55