XML for Information Management 12.1.-16.1. 2009 University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen Day 4: Logical and Physical Structure of XML Documents Outline 1. Components of the logical structure 2. XML documents as trees 3. Entity types 4. Entity declarations and references 5. XML processor treatment of entity references 6. Motivations for the use of entities XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 2 1. Components of the logical structure • declarations • elements • comments • processing instructions XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 3 1. Components of the logical structure document ::= prolog element Misc* declarations comments processing instructions comments processing instructions elements comments processing instructions XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 4 1. Components of the logical structure Declarations: ‣ XML declaration [23] ‣ document type declaration [28] ‣ markup declaration [29] • element type declaration [45] • attribute list declaration [52] • entity declaration [70] • notation declaration [82] to constrain the logical structure to constrain the physical structure ‣ encoding declaration [80] ‣ standalone document declaration [32] ‣ text declaration [77] XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 5 1. Components of the logical structure Typical element type declarations: element content defined <!ELEMENT product <!ELEMENT model <!ELEMENT description <!ELEMENT clock (mfg, model, description, clock?)> (#PCDATA)> (#PCDATA | feature)*> EMPTY> empty element defined XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen mixed content defined 6 1. Components of the logical structure empty element defined: <!ELEMENT clock EMPTY> two forms of the element allowed in a well-formed document: <clock></clock> <clock/> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 7 1. Components of the logical structure element content: definition by content models with metasymbols * + | ? , () iteration (none or more) iteration (once or more) alternatives optional successive grouping Example from XHTML 1.0 Strict DTD: <!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> #PCDATA is not accepted in the content model! XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 8 1. Components of the logical structure mixed content: definition has basically two forms (#PCDATA) (#PCDATA | e1 | … | en)* examples: <!ELEMENT text <!ELEMENT section <!ELEMENT section (#PCDATA)> (#PCDATA | subsection)*> (#PCDATA | subsection | paragraph)*> #PCDATA is always included in the content specification and comes first in the list of alternatives XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 9 1. Components of the logical structure Attribute list declarations • to define the set of attributes pertaining to a given elemen type • to establish type constraints for these attributes • to provide default values for attributes XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 10 1. Components of the logical structure <!ATTLIST poem element type author CDATA attribute name attribute type: string #REQUIRED > constraint: the attribute must be specified for all elements of type poem XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 11 1. Components of the logical structure Defining constraints [60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue) #REQUIRED: attribute must always be provided in all elements of the given type #IMPLIED: attribute can be provided in a element; no default value is provided AttValue: default value is given between single or double quotes #FIXED AttValue: instances of the attribute must match the given default value XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 12 1. Components of the logical structure Attribute types [54] AttType ::= StringType | TokenizedType | EnumeratedType tokenized types: • ENTITY, ENTITIES: entity names • NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names • ID: names that uniquely identify elements • IDREF, IDREFS: references to ID type identifiers enumerated types: • NOTATION, NOTATIONS: identify notations • enumeration XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 13 1. Components of the logical structure <?xml version=”1.0”?> <!DOCTYPE text [ <!ELEMENT text <!ELEMENT line <!ATTLIST line (line+)> (#PCDATA)> id ID seeline IDREFS #REQUIRED #IMPLIED> ]> <text> <line id=”r1”>This is the first line</line> <line id=”r2” seeline=”r1”> This is the second line, but look at the first too </line> </text> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 14 2. XML documents as trees <Chapter section = '1' ><Narration narrator='Benjy'> <Imagery place='tree' mode=simile sense='smell'> <Fragment code='1.12'><Paragraph id='143'> <Subject person='Caddy'>She</Subject>smelled like trees. </Paragraph></Fragment></Imagery> </Narration></Chapter> XML-aware web browsers support the visualization of the hierarchic structure: example XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 15 2. XML documents as trees XML specification defines a concrete syntax for XML documents. W3C has defined four slightly different abstract models to decribe the abstract syntax of XML documents: • • • • XML Information Set DOM model XPath 1.0 model XQuery 1.0 and XPath 2.0 data model Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press. XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 16 2. XML documents as trees <poem author = ”Murasaki Shikibu” born = ”974”> <!-- The poem is translated from Japanese by Kenneth Rexroth --> <line>This life of ours would not cause you sorrow</line> <line>if you thought of it as like</line> <line>the mountain cherry blossoms</line> <line>which bloom and fade in a day. </line> </poem> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 17 2. XML documents as trees Node types of XPath 1.0 poem poem born 974 Author Murasaki Shikibu line line line line which bloom and fade in a day. the mountain cherry blossoms if you thought of it as like This life of ours would not cause you sorrow The poem is translated from Japanese by Kenneth Rexroth Root node Element node Text node Comment node Attribute node XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 18 3. Entity types Physical structure of XML documents consists of entities. An entity is a unit recognized by the XML processor, the content of an entity is text or other kind of data. XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 19 3. Entity types 3-dimensional categorization: parsed entities -- unparsed entities internal entities -- external entities general entities -- parameter entities XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 20 3. Entity types parsed entity intended to be parsed by the XML processor, content consists of marked-up text unparsed entity not intended to be parsed by the XML processor, content can be whatever data XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 21 3. Entity types internal entity name and value given in an entity declaration always a parsed entity external entity not internal parsed or unparsed XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 22 3. Entity types general entity used in elements and attributes parsed or unparsed internal or external parameter entity used in the document type definition always parsed internal or external XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 23 3. Entity types Alternatives parsed unparsed internal parameter internal general external parameter internal general external general XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 24 3. Entity types UNPARSED ENTITIES: • files not intended for XML processing but referred to by entity references in the INPUT FILES INPUT FILES for XML processing: • root entity, external subset of DTD • other files intended for XML processing Information application about: XML processor INTERNAL ENTITIES: • name and textual content given in DTD • • • • • • elements and attributes comments processing instructions character data namespaces notations and locations of unparsed entities XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 25 4. Entity declarations and references EntityDecl ::= GEDecl | PEDecl GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>' EntityDef ::= EntityValue | ( ExternalID NDataDecl?) PEDef ::= EntityValue | ExternalID entity definition for internal entity entity definition for external entity XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 26 4. Entity declarations and references internal entity name and value ( = literal value) given <!ENTITY % Shape <!ENTITY JY name "(rect | circle | poly | default )"> "Jyväskylän yliopisto"> literal value XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 27 4. Entity declarations and references external entity name and system identifier (possibly together with public identifier) given, for an unparsed entity also notation <!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" "xhtml-symbol.ent"> <!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" "xhtml-special.ent"> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html <!ENTITY virtuaaliyliopistouutiset SYSTEM "http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml"> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 28 4. Entity declarations and references Unparsed entity <!ENTITY image1 SYSTEM "../images/birdnest.gif” NDATA gif> notation name The notation must have been declared, for example: <!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN" > XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 29 4. Entity declarations and references References to parameter entities: %Shape; %HTMLsymbol; References to parsed general entities: &JY; &virtuaaliyliopistouutiset; Reference to an unparsed general entity: <poem image="image1"> The type of the attribute has to be ENTITY or ENTITIES XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 30 4. Entity declarations and references In addition to entity references, XML documents may contain character references. Refers to a specific character of Unicode Provides a decimal or hexadecimal representation of the character’s code point in Unicode Example: &#34; One-character entity defined: <!ENTITY quot "&#34;"> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 31 4. Entity declarations and references Where an entity or character reference can occur? reference to parameter entity can occur in ‣ document type definition parsed general entity ‣ element content ‣ attribute value (either in the start-tag or in the attribute definition) ‣ entity value unparsed general entity ‣ attribute value (either in the start-tag or in the attribute definition) character ‣ element content ‣ attribute value (either in the start-tag or in the attribute definition) ‣ entity value XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 32 5. XML processor treatment of entity references References to unparsed entities Validating processor makes the identifiers for the entities and associated notations available to the application. <poem image=”figure1"> <!-- From a poem of Aale Tynni --> <line>Seisoin ikkunassa ja nauroin. Ihana puu.</line> <line>Ihana pesä.</line> </poem> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 33 5. XML processor treatment of entity references References to parsed entities Dealing with two kinds of entity values: literal value - the character string written between quotes in the entity definition replacement text - derived by replacing the character references and parameter entity references in the literal value by their character values and replacement texts, respectively. The XML processor replaces the entity reference by its replacement text. XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 34 5. XML processor treatment of entity references entity declaration <!ENTITY rhyme1 "<rhyme xml:lang="fi"> <line>Ole aina iloinen</line> <line>niin kuin pikku varpunen</line> </rhyme>"> entity reference <rhymecollection> &rhyme1; </rhymecollection> replacement text = literal value <rhyme xml:lang="fi"> <line>Ole aina iloinen</line> <line>niin kuin pikku varpunen</line> </rhyme> XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 35 5. XML processor treatment of entity references <!ENTITY % StyleSheet ”CDATA”> <!-- style sheet data --> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html <!ENTITY % Text ”CDATA”> <!-- used for titles etc. --> <!ENTITY % coreattrs ”id ID class CDATA style %StyleSheet; title %Text; #IMPLIED #IMPLIED #IMPLIED #IMPLIED”> literal value of coreattrs: id class style title ID CDATA %StyleSheet; %Text; #IMPLIED #IMPLIED #IMPLIED #IMPLIED replacement text of coreattrs: id class style title ID CDATA CDATA CDATA #IMPLIED #IMPLIED #IMPLIED #IMPLIED XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 36 5. XML processor treatment of entity references Exercise 10 (Course Text, Chapter 5) Entity declaration from XHTML Strict-DTD: <!ENTITY % Block ”(%block; | form | %misc; )*”> What is the (a) literal value (b) replacement text of entity Block (a) literal value: (%block; | form | %misc; )* XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 37 5. XML processor treatment of entity references Other entity declarations needed from the DTD: <!ENTITY % heading ”h1| h2| h3| h4| h5| h6”> <!ENTITY % lists ”ul | ol | dl”> <!ENTITY % blocktext ”pre | hr | blockquote | address”> <!ENTITY % block ”p | %heading; | div | %lists; | %blocktext; | fieldset | table”> <!ENTITY % misc.inline ”ins | del | script”> <!ENTITY % misc ”noscript | %misc.inline;”> Declarations from XHTML specification: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 38 5. XML processor treatment of entity references Deriving the replacement text of Block : references to parameter entities in the literal value (%block; | form | %misc;)* replaced by their replacement texts. Literal value of block: p | %heading; | div | %lists; | %blocktext; | fieldset | table Replacement text of block: p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table Literal value of misc : noscript | %misc.inline; Replacement text of misc : noscript | ins | del | script Replacement text of Block : (p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table | form | noscript | ins | del | script )* XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 39 6. Motivations for the use of entities The use of entities supports: • use of non-textual data (audio, graphics, etc.) in XML documents (but can be added also in stylesheets) • modularization of documents • consistency • multiuse of definitions • adding semantic information by informative entity names and comments attached to entity declarations XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen 40