Chapter 4: Logical Structure of XML Documents

advertisement
XML for Information Management
12.1.-16.1. 2009
University of Erlangen-Nuremberg
Computational Linguistics
Instructor: Professor Airi Salminen
http://users.jyu.fi/~airi/
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
Day 4: Logical and Physical Structure of XML Documents
Outline
1. Components of the logical structure
2. XML documents as trees
3. Entity types
4. Entity declarations and references
5. XML processor treatment of entity references
6. Motivations for the use of entities
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
2
1. Components of the logical structure
• declarations
• elements
• comments
• processing instructions
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
3
1. Components of the logical structure
document ::= prolog element Misc*
declarations
comments
processing instructions
comments
processing instructions
elements
comments
processing instructions
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
4
1. Components of the logical structure
Declarations:
‣ XML declaration [23]
‣ document type declaration [28]
‣ markup declaration [29]
• element type declaration [45]
• attribute list declaration [52]
• entity declaration [70]
• notation declaration [82]
to constrain the logical
structure
to constrain the physical
structure
‣ encoding declaration [80]
‣ standalone document declaration [32]
‣ text declaration [77]
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
5
1. Components of the logical structure
Typical element type declarations:
element content defined
<!ELEMENT product
<!ELEMENT model
<!ELEMENT description
<!ELEMENT clock
(mfg, model, description, clock?)>
(#PCDATA)>
(#PCDATA | feature)*>
EMPTY>
empty element defined
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
mixed content defined
6
1. Components of the logical structure
empty element defined:
<!ELEMENT clock EMPTY>
two forms of the element allowed in a well-formed
document:
<clock></clock>
<clock/>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
7
1. Components of the logical structure
element content: definition by content models with
metasymbols
*
+
|
?
,
()
iteration (none or more)
iteration (once or more)
alternatives
optional
successive
grouping
Example from XHTML 1.0 Strict DTD:
<!ELEMENT table
(caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
#PCDATA is not accepted in the content model!
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
8
1. Components of the logical structure
mixed content: definition has basically two forms
(#PCDATA)
(#PCDATA | e1 | … | en)*
examples:
<!ELEMENT text
<!ELEMENT section
<!ELEMENT section
(#PCDATA)>
(#PCDATA | subsection)*>
(#PCDATA | subsection | paragraph)*>
#PCDATA is always included in the content specification
and comes first in the list of alternatives
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
9
1. Components of the logical structure
Attribute list declarations
•
to define the set of attributes pertaining to a given
elemen type
•
to establish type constraints for these attributes
•
to provide default values for attributes
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
10
1. Components of the logical structure
<!ATTLIST poem
element type
author CDATA
attribute name
attribute type:
string
#REQUIRED >
constraint: the attribute
must be specified for all
elements of type poem
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
11
1. Components of the logical structure
Defining constraints
[60] DefaultDecl ::= '#REQUIRED' |
'#IMPLIED'|
(('#FIXED' S) ? AttValue)
#REQUIRED: attribute must always be provided in all elements of
the given type
#IMPLIED: attribute can be provided in a element; no default value
is provided
AttValue: default value is given between single or double quotes
#FIXED AttValue: instances of the attribute must match the given
default value
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
12
1. Components of the logical structure
Attribute types
[54] AttType ::= StringType | TokenizedType | EnumeratedType
tokenized types:
• ENTITY, ENTITIES: entity names
• NMTOKEN, NMTOKENS: text tokens consisting of characters
accepted in names
• ID: names that uniquely identify elements
• IDREF, IDREFS: references to ID type identifiers
enumerated types:
• NOTATION, NOTATIONS: identify notations
• enumeration
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
13
1. Components of the logical structure
<?xml version=”1.0”?>
<!DOCTYPE text [
<!ELEMENT
text
<!ELEMENT
line
<!ATTLIST
line
(line+)>
(#PCDATA)>
id
ID
seeline IDREFS
#REQUIRED
#IMPLIED> ]>
<text>
<line id=”r1”>This is the first line</line>
<line id=”r2” seeline=”r1”>
This is the second line, but look at the first too
</line>
</text>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
14
2. XML documents as trees
<Chapter section = '1' ><Narration narrator='Benjy'>
<Imagery place='tree' mode=simile sense='smell'>
<Fragment code='1.12'><Paragraph id='143'>
<Subject person='Caddy'>She</Subject>smelled like trees.
</Paragraph></Fragment></Imagery>
</Narration></Chapter>
XML-aware web browsers support the visualization of
the hierarchic structure: example
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
15
2. XML documents as trees
XML specification defines a concrete syntax for XML
documents.
W3C has defined four slightly different abstract models to
decribe the abstract syntax of XML documents:
•
•
•
•
XML Information Set
DOM model
XPath 1.0 model
XQuery 1.0 and XPath 2.0 data model
Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for
XML document database systems. Proc. of the ACM Symposium on Document Engineering
(DocEng '01) (pp. 85-94). New York: ACM Press.
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
16
2. XML documents as trees
<poem author = ”Murasaki Shikibu” born = ”974”>
<!-- The poem is translated from Japanese by Kenneth Rexroth -->
<line>This life of ours would not cause you sorrow</line>
<line>if you thought of it as like</line>
<line>the mountain cherry blossoms</line>
<line>which bloom and fade in a day. </line>
</poem>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
17
2. XML documents as trees
Node types of XPath 1.0
poem
poem
born
974
Author
Murasaki Shikibu
line
line
line
line
which bloom and fade in a day.
the mountain cherry blossoms
if you thought of it as like
This life of ours would not cause you sorrow
The poem is translated from Japanese by Kenneth Rexroth
Root node
Element node
Text node
Comment node
Attribute node
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
18
3. Entity types
Physical structure of XML documents consists of
entities.
An entity is a unit recognized by the XML processor,
the content of an entity is text or other kind of data.
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
19
3. Entity types
3-dimensional categorization:
 parsed entities -- unparsed entities
 internal entities -- external entities
 general entities -- parameter entities
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
20
3. Entity types
parsed entity
intended to be parsed by the XML processor, content
consists of marked-up text
unparsed entity
not intended to be parsed by the XML processor,
content can be whatever data
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
21
3. Entity types
internal entity
name and value given in an entity declaration
always a parsed entity
external entity
not internal
parsed or unparsed
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
22
3. Entity types
general entity
used in elements and attributes
parsed or unparsed
internal or external
parameter entity
used in the document type definition
always parsed
internal or external
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
23
3. Entity types
Alternatives
parsed
unparsed
internal
parameter
internal
general
external
parameter
internal
general
external
general
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
24
3. Entity types
UNPARSED ENTITIES:
• files not intended for XML processing but
referred to by entity references in the INPUT
FILES
INPUT FILES for XML
processing:
• root entity, external
subset of DTD
• other files intended for
XML processing
Information
application
about:
XML
processor
INTERNAL ENTITIES:
• name and textual
content given in DTD
•
•
•
•
•
•
elements and attributes
comments
processing instructions
character data
namespaces
notations and locations of
unparsed entities
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
25
4. Entity declarations and references
EntityDecl ::= GEDecl | PEDecl
GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'
PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>'
EntityDef ::= EntityValue | ( ExternalID NDataDecl?)
PEDef ::=
EntityValue | ExternalID
entity definition for internal entity
entity definition for external entity
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
26
4. Entity declarations and references
internal entity
name and value ( = literal value) given
<!ENTITY
% Shape
<!ENTITY
JY
name
"(rect | circle | poly | default )">
"Jyväskylän yliopisto">
literal value
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
27
4. Entity declarations and references
external entity
name and system identifier (possibly together with public
identifier) given, for an unparsed entity also notation
<!ENTITY % HTMLsymbol
PUBLIC
"-//W3C//ENTITIES Symbols for XHTML//EN"
"xhtml-symbol.ent">
<!ENTITY % HTMLspecial
PUBLIC
"-//W3C//ENTITIES Special for XHTML//EN"
"xhtml-special.ent">
Declarations from XHTML specification:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
<!ENTITY virtuaaliyliopistouutiset SYSTEM
"http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml">
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
28
4. Entity declarations and references
Unparsed entity
<!ENTITY
image1 SYSTEM "../images/birdnest.gif” NDATA gif>
notation name
The notation must have been declared, for example:
<!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION
CompuServe Graphic Interchange Format//EN" >
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
29
4. Entity declarations and references
References to parameter entities:
%Shape;
%HTMLsymbol;
References to parsed general entities:
&JY;
&virtuaaliyliopistouutiset;
Reference to an unparsed general entity:
<poem image="image1">
The type of the attribute has to be ENTITY or ENTITIES
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
30
4. Entity declarations and references
In addition to entity references, XML documents may
contain character references.
Refers to a specific character of Unicode
Provides a decimal or hexadecimal representation of the
character’s code point in Unicode
Example:
"
One-character entity defined:
<!ENTITY quot """>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
31
4. Entity declarations and references
Where an entity or character reference can occur?
reference to
parameter entity
can occur in
‣ document type definition
parsed general entity
‣ element content
‣ attribute value (either in the start-tag or
in the attribute definition)
‣ entity value
unparsed general entity
‣ attribute value (either in the start-tag or
in the attribute definition)
character
‣ element content
‣ attribute value (either in the start-tag or
in the attribute definition)
‣ entity value
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
32
5. XML processor treatment of entity references
References to unparsed entities
Validating processor makes the identifiers for the
entities and associated notations available to the
application.
<poem image=”figure1">
<!-- From a poem of Aale Tynni -->
<line>Seisoin ikkunassa ja nauroin. Ihana puu.</line>
<line>Ihana pesä.</line>
</poem>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
33
5. XML processor treatment of entity references
References to parsed entities
Dealing with two kinds of entity values:
literal value - the character string written between quotes in the
entity definition
replacement text - derived by replacing the character
references and parameter entity references in the literal value
by their character values and replacement texts, respectively.
The XML processor replaces the entity reference
by its replacement text.
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
34
5. XML processor treatment of entity references
entity declaration
<!ENTITY rhyme1
"<rhyme xml:lang="fi">
<line>Ole aina iloinen</line>
<line>niin kuin pikku varpunen</line>
</rhyme>">
entity reference
<rhymecollection>
&rhyme1;
</rhymecollection>
replacement text
= literal value
<rhyme xml:lang="fi">
<line>Ole aina iloinen</line>
<line>niin kuin pikku varpunen</line>
</rhyme>
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
35
5. XML processor treatment of entity references
<!ENTITY % StyleSheet ”CDATA”>
<!-- style sheet data -->
Declarations from XHTML specification:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
<!ENTITY % Text ”CDATA”>
<!-- used for titles etc. -->
<!ENTITY % coreattrs
”id
ID
class CDATA
style
%StyleSheet;
title
%Text;
#IMPLIED
#IMPLIED
#IMPLIED
#IMPLIED”>
literal value of coreattrs:
id
class
style
title
ID
CDATA
%StyleSheet;
%Text;
#IMPLIED
#IMPLIED
#IMPLIED
#IMPLIED
replacement text of coreattrs:
id
class
style
title
ID
CDATA
CDATA
CDATA
#IMPLIED
#IMPLIED
#IMPLIED
#IMPLIED
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
36
5. XML processor treatment of entity references
Exercise 10 (Course Text, Chapter 5)
Entity declaration from XHTML Strict-DTD:
<!ENTITY % Block
”(%block; | form | %misc; )*”>
What is the
(a) literal value
(b) replacement text
of entity Block
(a) literal value: (%block; | form | %misc; )*
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
37
5. XML processor treatment of entity references
Other entity declarations needed from the DTD:
<!ENTITY % heading
”h1| h2| h3| h4| h5| h6”>
<!ENTITY % lists
”ul | ol | dl”>
<!ENTITY % blocktext
”pre | hr | blockquote | address”>
<!ENTITY % block
”p | %heading; | div | %lists; | %blocktext; | fieldset | table”>
<!ENTITY % misc.inline
”ins | del | script”>
<!ENTITY % misc
”noscript | %misc.inline;”>
Declarations from XHTML specification:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
38
5. XML processor treatment of entity references
Deriving the replacement text of Block :
references to parameter entities in the literal value
(%block; | form | %misc;)* replaced by their replacement texts.
Literal value of block:
p | %heading; | div | %lists; | %blocktext; | fieldset | table
Replacement text of block:
p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote |
address | fieldset | table
Literal value of misc : noscript | %misc.inline;
Replacement text of misc : noscript | ins | del | script
Replacement text of Block :
(p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote |
address | fieldset | table | form | noscript | ins | del | script )*
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
39
6. Motivations for the use of entities
The use of entities supports:
• use of non-textual data (audio, graphics, etc.) in XML
documents (but can be added also in stylesheets)
• modularization of documents
• consistency
• multiuse of definitions
• adding semantic information by informative entity names and
comments attached to entity declarations
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents
Airi Salminen
40
Download