COMP60411 Semi-structured Data and the Web Basic concepts, week 1 Uli Sattler

advertisement
COMP60411
Semi-structured Data and the Web
Basic concepts, week 1
Uli Sattler
University of Manchester
1
Friday, 30 September 2011
1
Organisational
•
•
COMP60411 is taught by:
1. Bijan Parsia and
2. myself, Uli Sattler
Prerequisites: some familiarity with programming, Java
•
Teaching period: Fridays of the next 5 weeks
– with demonstrators present to ask Mondays - Thursdays
•
•
...and we will use Blackboard for additional material and the coursework
...which links to course homepage at
•
Please do not hesitate to ask if you have a question!
http://www.cs.manchester.ac.uk/pgt/COMP60411/
2
Friday, 30 September 2011
2
Organisational
•
Lab Demonstrators:
– Samantha Bail
– Alexandru Constantin
– Chiara Del Vescovo
– Azad Dehghan
– Rafael Goncalves
•
•
•
•
will be introduced later today
cover the MSc lab for your questions
...times will be announced on course web site
http://www.cs.manchester.ac.uk/pgt/COMP60411/
3
Friday, 30 September 2011
3
Organisational
•
Assessment: 50% exam, 50% coursework
•
Coursework and Exercises:
– 1.5 days per week plus
– reading plus
– half of week 6
4
Friday, 30 September 2011
4
Coursework and Exercises
•
•
•
•
•
All work is distributed and collected through Blackboard
– always retain a copy of your work elsewhere!
– backup!
Marks & feedback are distributed through Blackboard
We encourage you to use Blackboard’s discussion system
– we will help with questions
Up-to-date announcements on Twitter
– #uMan #ssd60411
Lateness
– work is generally due one week after assignment
i.e., Fridays at 9am
– work that is late: marked 0, no late submission
– if your overall coursework mark < 50% due to missed deadlines,
– you can submit some designated, additional coursework in reading
week
– to help bring your coursework mark up to max. 50%
5
Friday, 30 September 2011
5
Coursework and Exercises
Each week, we give you 4 pieces of coursework:
• [10 marks] a couple of small, short questions, often multiple choice
– to ensure you grasp the basic concepts
• [5 marks] a short essay of ~200 - 300 words
– about an average blog post
– to make you think & practise writing (project!)
• [5 - 10 marks] a small modelling task
– to appreciate the numerous ways in which things can be done
– to get your hands dirty
• [15 - 20 marks] assignment
– a programming task
– in Java, XQuery, XSLT, etc.
➡ 40 marks per week
6
Friday, 30 September 2011
6
Marking and Your Expectation
•
•
Remember:
– >= 70% is distinction level
– hence 90% or above will be very rare
You will do
– 5 weeks of coursework, each
– 40 marks
– which gives you a total of 200 marks
Hence each course work mark is worth 0.5% of your coursework mark
•
...and your coursework mark counts 50% of your course unit mark
•
So, please don’t panic if
– you get only 20 marks in the first week
– ...this is still 50% and thus at pass level
– ...to be improved in the next weeks!
•
7
Friday, 30 September 2011
7
Plagiarism & Academic Malpractice
•
We assume that you have all by now successfully completed the
Plagiarism and Malpractice Test
on Blackboard
•
...if you haven’t:
do so before you submit any coursework (assignment or assessment)
•
...because we work under the assumption that you know what you do
•
...and if you don’t, and submit coursework where you have
copied incorrectly
it costs you at least marks or more, e.g., your MSc
8
Friday, 30 September 2011
8
Literature
To obtain more detailed information, please refer to
• W3C documents at http://www.w3.org/TR/...
• S. Abiteboul, P. Buneman, and D. Suciu: Data on the Web. Morgan
Kaufmann Publishers, 2000.
• E. R. Harold and W. S. Means: XML in a Nutshell. O’Reilly, 2004.
• ...and follow the various available web resources linked from the course
web page
• ... we assume that you
– are enthusiastic about your subject
– go and find out about stuff you don’t know yet
•
•
no need to buy a book
and we will have a comprehensive list on course web site
9
Friday, 30 September 2011
9
Preliminary outline of the course
1. Introduction
• data models
• query languages
• semi-structured data & XML basics
• trees & regular expressions
• parsing & serialisation
• SAX & DOM
• DTDs as schemas
2. Extending the horizon: first query language and another schema language
• XPath
• Namespaces
• XML schema
3. Extending the horizon: another query language and types
• more XML schema -- type derivation
• XQuery, functions, types on expressions
10
Friday, 30 September 2011
10
Preliminary outline of the course (ctd)
4. Extending the horizon:
• RelaxNG, another schema language
• XSLT, another query language
• tree grammars for comparing schema languages
• error handling
5. Other concepts
• schema containment and emptiness
• keys and uniqueness constraints
• select topics, e.g., XSugar
Please note that this course is different from last years!
11
Friday, 30 September 2011
11
Storing & Manipulating Data: Relational Databases
•
•
•
proven technology, currently storing/managing vast amounts of
data in tables
– we impose a certain structure on our data
separation between 3 levels:
– conceptual: ER diagrams, are transformed/normalised into
– logical:
tables, and
– physical:
implementation of tables, indices
Data model is quite close to the logical level:
– a table is a relation, i.e., a set of tuples (i.e., unordered!),
– each column has its attribute (also unordered)
Picture from http://en.wikipedia.org/wiki/Relational_database
Friday, 30 September 2011
12
12
Storing & Manipulating Data: Relational Databases
•
•
•
•
Query Language
Datamodel!
– a query returns a table
– operations used in query correspond to operations on relations,
e.g., select, project, join
normal forms
– methodology to achieve good behaviour/performance
Schemas Languages and Integrity Constraints
– declarative way to
• describe meaningful entries and
• to prevent ‘meaningless’ data entries
– updates are checked against those
...a lot of research into
– query optimisation
– view maintenance
– integrity constraints,
– query languages
13
Friday, 30 September 2011
13
If you don’t know…
•
...or don’t remember your UG database class, read a
text book on databases
•
e.g., Ullman & Widom’s “A first course in database systems”
•
or don’t remember the difference between a
– set,
e.g., {a,b,c}
– bag/multiset, e.g., {{a,a,b,c}}
– list
e.g., <a,c,b,a,c>
➡ ...read it up
14
Friday, 30 September 2011
14
Storing & Manipulating Data: Relational Databases
•
•
•
main goal:
– efficient implementation of query answering over large DBs
pressing data into tables is a non-trivial task & might cause difficulties
– normalisation
– assume/impose regularity or
– ...think of storing people’s phone numbers/email addresses, etc.
– if structure of data changes, tables need to change...
– if you want to integrate with data from other tables, you need to find
common keys
information about the data is only in table ‘headers’
15
Friday, 30 September 2011
15
From:
Oracle9iAS TopLink Getting Started
Release 2 (9.0.3)
Part Number B10061-01
Friday, 30 September 2011
16
16
Storing & Manipulating Data: Relational Databases
•
•
•
•
•
main goal:
– efficient implementation of query answering over large DBs
pressing data into tables is a non-trivial task & might cause difficulties
– normalisation
– assume/impose regularity or
– ...think of storing people’s phone numbers/email addresses, etc.
– if structure of data changes, tables need to change...
– if you want to integrate with data from other tables, you need to find
common keys
information about the data is only in table ‘headers’
– e.g., is the 3rd cell income or expenses?
– metadata needs to be taken into account
data integration requires a lot of handcrafting & data cleaning & ...
used/accessed mainly by “insiders”
17
Friday, 30 September 2011
17
Alternative Database Models
•
•
•
Hierarchical model: tree of records
Network model: DAG of records
Object-oriented model/object-relational model: linked objects with
attributes
•
Semi-structured model:
– OEM
– Lore
– XML
•
•
•
•
how does this work?
what are the underlying principles & available technologies?
how can we use these to use XML well?
and why would we want to use XML?
18
Friday, 30 September 2011
18
19
Friday, 30 September 2011
19
Protein data from UniProt
UniProt
• provides a web query interface to Uniprot database
• e.g., query http://www.uniprot.org/uniprot/ for ‘BRCA’
•
•
•
...biologists need to integrate, share, query, analyse, and
search this data
...so what format is/should it be in?
...or what format should it be made available in to be
integrated with other data?
20
Friday, 30 September 2011
20
Protein data from UniProt: an example
<?xml version="1.0" encoding="UTF-8"?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="2005-01-04" modified="2010-08-10" version="80">
<accession>Q9BX63</accession>
<accession>Q3MJE2</accession>
<accession>Q8NCI5</accession>
<name>FANCJ_HUMAN</name>
<protein>
<recommendedName ref="1">
<fullName>Fanconi anemia group J protein</fullName>
<shortName>Protein FACJ</shortName>
</recommendedName>
<alternativeName>
<fullName>ATP-dependent RNA helicase BRIP1</fullName>
</alternativeName>
<alternativeName>
<fullName>BRCA1-interacting protein C-terminal helicase 1</fullName>
<shortName>BRCA1-interacting protein 1</shortName>
</alternativeName>
<alternativeName>
<fullName>BRCA1-associated C-terminal helicase 1</fullName>
</alternativeName>
</protein>
<gene>
<name type="primary">BRIP1</name>
<name type="synonym">BACH1</name>
<name type="synonym">FANCJ</name>
</gene>
Friday, 30 September 2011
21
21
…….
<organism>
<name type="scientific">Homo sapiens</name>
<name type="common">Human</name>
<dbReference type="NCBI Taxonomy" id="9606" key="2"/>
<lineage>
<taxon>Eukaryota</taxon>
<taxon>Metazoa</taxon>
<taxon>Chordata</taxon>
<taxon>Craniata</taxon>
<taxon>Vertebrata</taxon>
<taxon>Euteleostomi</taxon>
<taxon>Mammalia</taxon>
<taxon>Eutheria</taxon>
<taxon>Euarchontoglires</taxon>
<taxon>Primates</taxon>
<taxon>Haplorrhini</taxon>
<taxon>Catarrhini</taxon>
<taxon>Hominidae</taxon>
<taxon>Homo</taxon>
</lineage>
</organism>
<reference key="3">
<citation type="journal article" date="2001" name="Cell" volume="105" first="149" last="160">
<title>BACH1, a novel helicase-like protein, interacts directly with BRCA1 and contributes to its DNA repair
function.</title>
<authorList>
<person name="Cantor S.B."/>
<person name="Bell D.W."/>
<person name="Ganesan S."/>
<person name="Kass E.M."/>
<person name="Drapkin R."/>
Friday, 30 September 2011
22
22
The Basics First: Semi-structured data
{name: {first:”Uli”, last: “Sattler”},
tel: 56176,
email:”sattler@cs.man.ac.uk”}
Semi-structured data
• predates XML
• is an attempt to reconcile
there is
– (Web) document view and
structure!
– (DB) strict structures
but not
• is data organised in semantic entities, where
too much
structure!
– similar entities are grouped together
– entities in same group may not have same attributes
• often defined as a possibly nested set of attribute-value pairs
• order of attributes is not necessarily important
– having sets or lists of telephone numbers makes a difference:
– fixing an order allows to give meaning to rank
• not all attributes may be required
• carries its own description
23
Friday, 30 September 2011
23
The Basics First: Semi-structured data
Example (ctd):
Values can in turn be structured:
{name: {first:”Uli”, last: “Sattler”},
tel: 56176,
email:”sattler@cs.man.ac.uk”}
And we can have several values for the same attribute:
{name: {first:”Uli”, last: “Sattler”},
tel: 56176,
tel: 56182,
email:”sattler@cs.man.ac.uk”}
24
Friday, 30 September 2011
24
The Basics First: Semi-structured data (SSD)
Graphical representation as a tree (where order of children doesn’t
necessarily matter):
tel.
name
56176
first
“Uli”
tel.
email
56182
“sattler@cs.man.ac.uk
last
“Sattler”
{name: {first:”Uli”, last: “Sattler”},
tel: 56176,
tel: 56182,
email:”sattler@cs.man.ac.uk”}
25
Friday, 30 September 2011
25
The Basics First: Semi-structured data (SSD)
Graphical representation as a tree (where order of children doesn’t
necessarily matter):
tel.
name
56182
first
“Uli”
tel.
email
56176
“sattler@cs.man.ac.uk
last
“Sattler”
Is this the same or a different tree?
Is this the same or different data?
{name: {first:”Uli”, last: “Sattler”},
tel: 56182,
tel: 56176,
email:”sattler@cs.man.ac.uk”}
18
26
Friday, 30 September 2011
26
The Basics First: Semi-structured data (SSD)
•
In general, a piece of SSD/nested set of attribute-value pairs,
– can be represented as a graph
• leaf nodes standing for single data items
• inner nodes carry no label
• edges labelled with attribute names
{name: {first:”Uli”, last: “Sattle
tel: 56182,
tel: 56176,
email:”sattler@cs.man.ac.uk
tel.
name
56176
first
“Uli”
tel.
email
56182
“sattler@cs.man.ac.uk
last
“Sattler”
27
Friday, 30 September 2011
27
Semi-structured data: tuples with variations
We can easily represent nested tuples
[[[Uli, Sattler], 56176, sattler@cs.man.ac.uk],
[Bijan, 56183, 783 4672, bparsia@cs.man.ac.uk],
[Leo, 8488342, leo@gmx.com]]
as sets of attribute-value pairs
even if they have missing or duplicated pairs
...best if we know which element belongs to what
e.g., is “ 783 4672” Bijan’s telephone number? his email address? age?
{person:
{name: {first: “Uli”, last: “sattler}, tel: 56176, email: “sattler@cs.man.ac.uk”}
person:
{name: “Bijan”, tel: 56183, tel: 783 4672,
email: “bparsia@cs.man.ac.uk”}
person:
{name: “Leo”, tel: 8488342, email: “leo@gmx.com”}}
28
Friday, 30 September 2011
28
Semi-structured data: tuples with variations
We can easily represent nested tuples
[[[Uli, Sattler], 56176, sattler@cs.man.ac.uk],
[Bijan, 56183, 783 4672, bparsia@cs.man.ac.uk],
[Leo, 8488342, leo@gmx.com]]
as sets of attribute-value pairs
even if they have missing or duplicated pairs
...but also without knowing role of elements:
{1:
{1: {1: “Uli”, 2: “sattler}, 2: 56176, 3: “sattler@cs.man.ac.uk”}
2:
{1: “Bijan”, 2: 56183, 3: 783 4672, 4: “bparsia@cs.man.ac.uk”}
3:
{1: “Leo”, 2: 8488342, 3: “leo@gmx.com”}}
29
Friday, 30 September 2011
29
Semi-structured data: tuples with variations
{person:
{name: {first: “Uli”, last: “sattler}, tel: 56176, email: “sattler@cs.man.ac.uk”}
person:
{name: “Bijan”, tel: 56183, tel: 783 4672,
email: “bparsia@cs.man.ac.uk”}
person:
{name: “Leo”, tel: 8488342, email: “leo@gmx.com”}}
SSD
•
•
can be serialized:
– convert SSD into a byte stream
– for transmission
is self-describing:
– each data-item (e.g., 56175) is annotated with its description (e.g., tel.:)
– space consuming, but enhances inter-operability
30
Friday, 30 September 2011
30
SSD: representing relational data
R
Consider two relations :
c
d
c1
c2
d2
c2
c3
d3
c4
d4
a
b
c
a1
b1
a2
b2
S
and their tree representation:
R
row
R
row
R
a
a1
S
row
row
R
S
rowR
b
c
a
b1
c1 a2
b
b2
S
S row
c
c2
row
row
S
row S
row S
c
d
c
d
c2
d2
c3
d3
c
c4
d
d4
➔ we can represent relational data, though with an overhead
31
Friday, 30 September 2011
31
SSD: representing object databases
•
we can represent data from object-oriented DBMSs or SE as SSD
– provided we have object identifiers, e.g., &o1
– so that objects can refer to each other
Example: { persons:
{person:
person:
&o1 {
&o2 {
person: &o3 {
•
name: “John”,
age: 47,
relatives: {child: &o2,
child: &o3}}
name: “Mary”,
age: 21,
relatives: {father: &o1,
sister: &o3}}
name: “Paula”,
age: 23,
relatives: {father: &o1,
sister: &o2}}}}
Draw a graph representation of this piece of semi-structured data!
32
Friday, 30 September 2011
32
SSD: how to represent/store
•
there have been various formalisms suggested to store semi-structured
data
– e.g., Object Exchange Model (OEM, close to previous examples)
– e.g., Lore
– e.g., XML
– different mechanisms for self-describing
– different datatypes supported
– different description mechanisms for (semi) structure
• which attributes are allowed/required where
• which values allowed/required where
– different query languages & manipulation mechanisms
33
Friday, 30 September 2011
33
2. XML - eXtensible Markup Language
Q1-2
34
Friday, 30 September 2011
34
XML
•
•
•
is a format for the representation of semi-structured data
is not designed to specify the lay-out of documents
alone will not solve the problem of efficiently querying (web) data:
we might have to use RDBMSs technology as well
35
Friday, 30 September 2011
35
A brief history of XML
•
•
•
GML (Generalised Markup Language), 60ies, IBM
SGML (Standard Generalised Markup Language), 1985:
– flexible, expressive, with DTDs
– custom tags
HTML (Hypertext Markup Language), early 1990ies:
– application of SGML
– designed for presentation of documents
•
– single document type, presentation-oriented tags, e.g., <h1>...</h1>
– led to the web as we know it
XML, 1998 first edition of XML 1.0 (now 4th edition)
W3C?!
– a W3C standard
– subset/fragment of SGML
– designed
• to be “web friendly”
• for the exchange/sharing of data
•
• to allow for the principled decentralized extension of HTML and
• the elimination or radical reduction of errors on the web
XHTML is an application of XML
– almost a fragment of HTML
Friday, 30 September 2011
36
36
A rough map of a small part of the acronym world
HTML
is an application of
DTD
describes
SGML
XML Schema
is basically
a restriction of
XHTML
describes
is basically
a restriction of
is an application of
describes
describes
Schematron
RelaxNG
XML
queries
queries
queries
XQueries
part of
XSLT
part of
XPath
37
Friday, 30 September 2011
37
An XML Example
A snippet of XML describing the above Dilbert cartoon
<cartoon copyright="United Feature Syndicate" year="2000">
<prolog>
<series>Dilbert</series>
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
</characters>
</prolog>
<panels>
<panel colour="none">
<scene> Pointy-Haired Boss and Dilbert sitting at table. </scene>
<bubbles>
<bubble>
<speaker>Dilbert</speaker>
<speech>You havenʼt given me enough resources to do my project.</speech>
</bubble>
</bubbles>
</panel>
...
</panels>
</cartoon>
Friday, 30 September 2011
38
38
What is XML?
•
•
•
•
•
•
Technical terms, when
used for the first time,
are marked red
XML is a specialization of SGML
XML is a W3C standard since 1998, see http://www.w3.org/XML/
XML was designed to be simple, generic, and extensible
an XML document is a piece of text that
tel.
tel.
nam
– “contains”
561
561
• structure
last
first
• data
“Uli”
“Sat
– can be associated with a tree, its DOM tree or infoset
an XML document is divided into smaller pieces called elements
(associated with nodes in tree):
– an XML document contains elements
– elements can contain elements
– with a non-ambiguous hierarchical structure amongst elements
an XML document consists of
– some administrative information followed by
– a root element containing all other elements
ema
“sattler@cs.m
39
Friday, 30 September 2011
39
Example
And here is the full XML document
<?xml version="1.0" encoding="UTF-8"?>
Administrative
<!DOCTYPE cartoon SYSTEM "cartoon.dtd">
Information
<cartoon copyright="United Feature Syndicate" year="2000">
<prolog>
Root
<series>Dilbert</series>
element
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
</characters>
</prolog>
<panels>
....
</panels>
</cartoon>
40
Friday, 30 September 2011
40
What is XML? (ctd)
The above mentioned administrative information of an XML document:
1. XML declaration, e.g., <?xml version=“1.0” encoding=“iso-8859-1”?>
(optional) identifies the
– XML version (1.0) and
– character encoding (iso-8859-1)
2. document type declaration (optional) references a grammar
describing document called Document Type Definition
– e.g. <!DOCTYPE cartoon SYSTEM “cartoon.dtd”>
1. a DTD constrains the structure, content & tags of a document
2. can either be local or remote
3. then we find the root element -- also called document element
4. which in turn contains other elements with possibly more elements....
41
Friday, 30 September 2011
41
XML elements
•
•
•
•
•
•
elements are delimited by tags
tags are enclosed in angle brackets, e.g., <panel>, </from>
tags are case-sensitive, i.e., <FROM> is not the same as <from>
we distinguish
– start tags: <...>, e.g., <panel>
– end tags: </...>, e.g., </from>
a pair of matching start- and end tags delimits an element
(like parentheses)
attributes specify properties of an element
e.g., <cartoon copyright=“United Feature Syndicate”>
42
Friday, 30 September 2011
42
Example
And here is the full XML document
<?xml version="1.0" encoding="UTF-8"?>
Attributes
<!DOCTYPE cartoon SYSTEM "cartoon.dtd">
<cartoon copyright="United Feature Syndicate" year="2000">
Start Tag
<prolog>
<series>Dilbert</series>
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
End Tag
<character>Dilbert</character>
</characters>
</prolog>
<panels>
....
</panels>
</cartoon>
43
Friday, 30 September 2011
43
XML Core Concepts: elements (the main concept)
<element-name attr-decl1 ... attr-decln>
element-content
</element-name>
•
•
•
•
•
arbitrary number of attributes is allowed
each attr-decli is of the form attr-name=“attr-value”
but each attr-name occurs at most once in one element
the element-content can be
– empty
simple content
– text and/or
mixed content
– one or more elements
element content
an empty element can be abbreviated as
<element-name attr-decl1 ... attr-decln/>
44
Friday, 30 September 2011
44
Example
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cartoon SYSTEM "cartoon.dtd">
<cartoon copyright="United Feature Syndicate" year="2000">
<prolog>
Simple
<series>Dilbert</series>
content
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
</characters>
</prolog>
<panels>
....
</panels>
</cartoon>
Element
content
45
Friday, 30 September 2011
45
XML Core Concepts: Prologue -- XML declaration
More at http://www.w3.org/TR/REC-xml/
<?xml param1 param2 ...?>
Each parami is in the form
parameter-name=“parameter-value”
<?xml version=“1.0” encoding=“US-ASCII” standalone=“yes”?>
Parameters for
• the xml version used within document
• the character encoding
• whether document is standalone or uses external declarations
(see validity constraint for when standalone=“yes” is required)
An XML document should have an XML declaration (but does not need to)
Friday, 30 September 2011
46
46
XML Core Concepts: Prologue -- doctype declaration
<!DOCTYPE element-name PUBLIC “pub-id” “f-name.dtd” |
SYSTEM “f-name.dtd” |
[dt-declarations]>
•
•
•
•
•
•
one such declaration, before root element
element-name is the name of the root element of the document
the optional dt-declarations is
– called internal subset
– a list of document type definitions
the optional f-name.dtd refers to the external subset also containing
document type definitions
e.g., <!DOCTYPE html PUBLIC “http://www.abc.org/dtds/html.dtd”
“http://www.abc.org/dtds/html.dtd” >
more later...
47
Friday, 30 September 2011
47
What is XML? (ctd)
•
•
in XML, the set of tags is not fixed
– in HTML, the tag set is fixed
– <h1>, <b>, <ul>,...
elements can be nested, to arbitrary depth
•
the same element name can occur many times in a document,
– e.g., <character>
•
XML itself is not a markup language,
but we can specify markup languages with XML
– an XML document can contain or refer to its specification:
!DOCTYPE
48
Friday, 30 September 2011
48
How to view or edit XML?
•
•
•
•
XML is not for human consumption
– far to verbose
– in contrast to HTML, your browser won’t help: you can only do a
“view source” or
– first style it (using XSLT or CSS, later more) to transform XML into
HTML, then use your web browser to view it
XML is text, so you can use your favourite editor, e.g., emacs in xml mode
Or you can use an XML editor, e.g., XMLSpy, Stylus Studio, <oXygen/>,
MyEclipse, and many more
<oXygen/> runs on the lab machines
– it supports many features
– query languages
– schemas, etc.
– has been given to us for free
– if you want to use it at home/on your laptop, use a free 30 day trial
49
Friday, 30 September 2011
49
XML and HTML
•
•
•
•
•
XML is always case sensitive, i.e., "Hello" is different from "hello"
– HTML isn’t: it uses SGML's default "ignore case"
in XML, all tags must be present
– in HTML, some ”tag omission" may be permissible (e.g., <br>)
in XML, we have a special way to write empty tags <myname/>
– which can’t be used in HTML
in XML, all attribute values must be quoted, e.g., <name lang= ”eng”>...
– in SGML (and therefore in HTML) this is only required if value contains
space
in XML, attribute names cannot be omitted
– in HTML they may be omitted using shorttags
50
Friday, 30 September 2011
50
When is an XML document well-formed?
An XML document is well-formed if
1. there is exactly one root element
2. tags, <, and > are correct (incl. no unescaped < or & in character data)
3. tags are properly nested
4. attributes are unique for each tag and attribute values are quoted
5. no comments inside tags
Q3-7
This is a very weak notion of well-formedness: basically,
it only ensures that we can parse a document into a tree
51
Friday, 30 September 2011
51
Trees come in different shapes!
tel.
nam
first
“Uli”
last
561
tel.
ema
561
“sattler@cs.man.ac
“Sattl
Document
nodeType =
DOCUMENT_NOD
Element
nodeType =
ELEMENT_NODE
nodeName = mytext
Element
nodeType =
ELEMENT_NOD
E
Element
nodeType =
ELEMENT_NOD
E
Text
Text
nodeType =
TEXT_NODE
nodeType =
TEXT_NODE
PI
Attribute
nodeType =
52
Friday, 30 September 2011
52
Interlude: Abstract trees - nodes as strings!
A tree
A tree
with nodes
as strings
A tree over {A,B,C}
ε B
ε
A
0
1
0
1,0
0,0
0,1
0,2
• so we can refer to
nodes by names
• order matters!
• the node 0,0 is
different from 0,1
A
1
B
0,0
A
0,1
B
1,0
B
0,2
• so we can distinguish
• a node from
• a node’s label
53
Friday, 30 September 2011
53
Interlude: Abstract trees - nodes as strings!ε
๏ We use ℕ for the non-negative integers (including 0)
๏ we use
ℕ* for the set of all (finite) strings over ℕ
A
0
B
A
1
B
• ε is used for the empty string
1,0
B
A
B
• 0,1,0 is a string of length 3
• each string stands for a node
0,0
0,1
0,2
๏ An alphabet is a finite set of symbols
๏ A tree T over an alphabet Σ is a mapping T: ℕ* → Σ whose domain is
๏ finite
i.e., T(n) is defined for only finitely many strings over ℕ
each tree has only finitely many nodes
๏ contains ε
i.e., T(ε) is defined
each tree has a root ε
๏ is prefixed-closed
i.e., if T(w,n) is defined, then T(w) is as well
the predecessor w of a node (w,n) is in T
54
Friday, 30 September 2011
54
Interlude: Abstract trees - nodes as strings!
•
Explanation:
• the strings in the domain of T represent T’s nodes
• (w,n) is the successor of w,
• T(w) is the label of w (as shown in picture)
• we use nodes(T) for the (finite) domain of/nodes in T
•
Is the following mapping T a tree? If yes, draw the tree T!
Σ = {W, X, Y, Z}
T(ε) = X
T(0) = X
T(1) = X
T(2) = X
T(3) = Z
T(0,0) = Y
T(0,0,0) = Y
T(3,1) = Z
Friday, 30 September 2011
ε X
X
0
Y
0,0
Y
0,0,0
X
1
2
X
3
Z
Z
3,1
55
55
A Datamodel for XML documents
•
An XML document is a piece of text
– it has tags, etc.
– it has no nodes, structure, successors, etc.
•
having a datamodel for XML documents makes many things easier:
– talking about documents, elements, nodes, etc.
– ignoring things like whitespace issues, etc.
– implementing software that handles XML
– specifying schema languages, other formalisms around it
➡ think of relational model as basis for rel. DBMSs
•
this has motivated the
– XML Information Set recommendation,
– Document Object Model (DOM), and others
unsurprisingly, they model an XML document as a tree
•
56
Friday, 30 September 2011
56
Level
Data unit examples
Information or
Property required
cognitive
application
tree adorned with...
namespace
schema
tree
token
Element
Element
Element
Attribute
Element
Element
Element
Attribute
complex
<foo:Name t=”8”>Bob
simple
<foo:Name t=”8”>Bob
character
< foo:Name t=”8”>Bob
bit
10011010
nothing
a schema
well-formedness
which encoding
(e.g., UTF-8)
57
Friday, 30 September 2011
57
DOM: datamodel for XML documents
•
•
we will use the DOM tree as a datamodel:
it can be viewed as an implementation of the slightly more abstract infoset
DOM is a platform & language independent specification of an API for
accessing an XML document in the form of a tree
– “DOM parser” is a parser that outputs a DOM tree
– but DOM is much more
strings
XML document,
i.e., text
parser
e.g., Dom parser
serializer
your
standard API,
eg. DOM tree
application
58
Friday, 30 September 2011
58
Programmatic Manipulation of XML Documents
As a rule, whenever we manipulate XML documents in an application, we
should use standard APIs:
strings
XML document,
i.e., text
parser
e.g., Dom parser
serializer
your
standard API,
eg. DOM
application
parser: analyses document, generates parse tree with nodes labelled with
tags, text content, and attribute-value pairs
serializer: takes a (tree) data structure and generates an XML document
59
Friday, 30 September 2011
59
Parsing & Serializing XML documents
XML document
parser
standard API,
your application
serializer
•
•
•
parser:
– reads & analyses XML document
– may generate parse tree that reflect document’s element structure
e.g., DOM tree
• with nodes labelled with
– tags,
– text content, and
– attributes and their values
serializer:
– takes a data structure, e.g., some trees, linked objects, etc.
– generates an XML document
round tripping:
– XML ➙ tree ➙ XML
– ...doesn’t have to lead to identical XML document...more later
Friday, 30 September 2011
60
60
Level
Data unit examples
Information or
Property
required
cognitive
application
tree adorned with...
namespace
schema
Element
Attribute
Element
Element
Element
Attribute
complex
<foo:Name t=”8”>Bob
simple
<foo:Name t=”8”>Bob
character
< foo:Name t=”8”>Bob
bit
10011010
nothing
a schema
well-formedness
serializing
token
Element
parsing
tree
Element
which encoding
(e.g., UTF-8)
61
Friday, 30 September 2011
61
DOM trees as a datamodel for XML documents
Document
A simple example:
nodeType = DOCUMENT_NODE
nodeName = #document
nodeValue = (null)
<?xml version="1.0" encoding="UTF-8"?>
<mytext content=“medium”>
"
<title>Hallo!</title>
"
<content>Bye!</content>
</mytext>
Element
nodeType = ELEMENT_NODE
nodeName = title
nodeValue = (null)
firstchild
Element
nodeType = ELEMENT_NODE
nodeName = mytext
nodeValue = (null)
firstchild
lastchild attributes
Element
nodeType = ELEMENT_NODE
nodeName = content
nodeValue = (null)
firstchild
Text
Text
nodeType = TEXT_NODE
nodeType = TEXT_NODE
nodeName = #text
nodeName = #text
nodeValue = Hallo!
nodeValue = Bye!
PI
nodeType = Processing
Instruction
Attribute
nodeType = ATTRIBUTE_NODE
nodeName = content
nodeValue = medium
62
Friday, 30 September 2011
62
DOM trees as a datamodel for XML documents
•
•
In general, we have the following correspondence:
– XML document D
→ tree t(D)
– element
e in D → node t(e) in t(D)
– empty element
→ leaf node
– root element e in D → not root node in t(D)
but document node - see previous example!
DOM’s Node interface provides the following attributes to navigate
around a node in the DOM tree:
parentNode
previousSibling
firstChild
•
Node
ChildNodes
nextSibling
lastChild
attributes
and also methods such as appendChild, hasAttributes, insertBefore, etc.
63
Friday, 30 September 2011
63
DOM by example
mydocument.xml:
<mytext content=“medium”>
"
<title>Hallo!</title>
"
<body>Bye!</body>
</mytext>
A little Java example:
find the content of 2nd child of mytexts if 1st child is “Hallo”
1. let a parser build the DOM of mydocument.xml
factory = DocumentBuilderFactory.newInstance();
myParser = factory.newDocumentBuilder();
parseTree = myParser.parse(”mydocument.xml");
2. Retrieve all “mytext” nodes into a NodeList interface:
mytextNodes = parseTree.getElementsByTagName(“mytext”)
3. Navigate and retrieve all contents:
for (int i=0; i < mytextNodes.getLength(); i++) {
actmytextNode = mytextNodes.item(i);
acttitleNode = actmytextNode.getFirstChild();
actstring = acttitleNode.getFirstChild().getNodeValue();
if (actstring.equals(“Hallo”)) {
actcontentNode = acttitleNode.getNextSibling();
returnstring = actcontentNode.getFirstChild().getNodeValue();
break; } }
Friday, 30 September 2011
64
64
Parsing XML
•
DOM parsers parse an XML document into a DOM tree
– this might be huge/not fit in memory
– your application may take a few relevant bits from it and build an own
datastructure, so (DOM) tree was short-loved/built in vain
strings
your
XML document,
i.e., text
parser
serializer
•
standard API,
eg. DOM
application
SAX parsers work very differently
– they don’t build a tree but
– go through document depth first and “shout out” their findings...
65
Friday, 30 September 2011
65
SAX parser in brief
•
•
•
•
•
“SAX” is short for Simple API for XML
not a W3C standard, but “quite standard”
there is SAX and SAX2, using different names
originally only for Java, now supported by various languages
can be said to be based on a parser that is
– multi-step, i.e., parses the document step-by-step
– push, i.e., the parser has the control, not the application
a.k.a. event-based
•
in contrast to DOM,
– no parse tree is generated/maintained
➥ useful for large documents
– it has no generic object model
➥ no objects are generated & trashed
66
Friday, 30 September 2011
66
SAX in brief
•
how the parser (or XML reader)
is in control and the application
“listens”
XML
document
•
•
info
SAX
parse
parser
start
event handler
application
SAX creates a series of events based on its depth-first traversal of document
E.g.,
<?xml version="1.0" encoding="UTF-8"?>
<mytext content=“medium”> "
"
<title> " "
"
"
Hallo! "
"
</title> " "
"
<content> "
"
"
"
Bye! "
"
</content> "
"
</mytext>
Friday, 30 September 2011
start document
start Element: mytext attribute content value medium
start Element: title
characters: Hallo!
end Element: title
start Element: content
characters: Bye!
end Element: content
end Element: mytext
"
"
67
67
SAX in brief
•
•
•
•
•
SAX parser, when started, goes through document while “commenting”
what it does
application listens to these comments, i.e., to list of all pieces of an XML
document
– whilst “taking notes”: when it’s gone, it’s gone!
the primary interface is the ContentHandler interface
– provides methods for relevant structural types in an XML document,
e.g. startElement(), endElement(), characters()
we need implementations of these methods:
– we can use DefaultHandler
– we can create a subclass of DefaultHandler and re-use as much of
it as we see fit
let’s see a trivial example of such an application...
from http://www.javaworld.com/javaworld/jw-08-2000/jw-0804-sax.html?
page=4
68
Friday, 30 September 2011
68
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;
public class Example extends DefaultHandler {
// Override methods of the DefaultHandler
// class to gain notification of SAX Events.
public void startDocument( ) throws SAXException {
System.out.println( "SAX E.: START DOCUMENT" );
}
public void endDocument( ) throws SAXException {
System.out.println( "SAX E.: END DOCUMENT" );
}
public void startElement(
String namespaceURI,
String localName,
String qName,
Attributes attr ) throws SAXException {
System.out.println( "SAX E.: START ELEMENT[ " +
localName + " ]" );
// and let's print the attributes!
for ( int i = 0; i < attr.getLength(); i++ ){
System.out.println( " ATTRIBUTE: " +
attr.getLocalName(i) + " VALUE: " +
attr.getValue(i) );
}
}
The green parts are to be replaced
with something more sensible, e.g.:
if ( localName.equals( "FirstName" ) ) {
cust.firstName = contents.toString();
...
}
public void endElement(
String namespaceURI,
String localName,
String qName ) throws SAXException {
System.out.println( "SAX E.: END ELEMENT[ "localName + " ]" );
}
public void characters( char[] ch, int start, int length )
throws SAXException {
System.out.print( "SAX Event: CHARACTERS[ " );
try {
OutputStreamWriter outw = new OutputStreamWriter(System.out);
outw.write( ch, start,length );
outw.flush();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println( " ]" );
}
public static void main( String[] argv ){
System.out.println( "Example1 SAX E.s:" );
try {
// Create SAX 2 parser...
XMLReader xr = XMLReaderFactory.createXMLReader();
// Set the ContentHandler...
xr.setContentHandler( new Example() );
// Parse the file...
xr.parse( new InputSource( new FileReader( ”myexample.xml" )));
}catch ( Exception e ) {
e.printStackTrace();
}
}
69
Friday, 30 September 2011
69
•
when applied to
•
this program results in
<?xml version="1.0"?>
<simple date="7/7/2000" >
<name> Bob </name>
<location> New York </location>
</simple>
Example1 SAX Events:
SAX E.: START DOCUMENT
SAX E.: START ELEMENT[ simple ]
ATTRIBUTE: date VALUE: 7/7/2000
SAX E.: CHARACTERS[
]
SAX E.: START ELEMENT[ name ]
SAX E.: CHARACTERS[ Bob ]
SAX E.: END ELEMENT[ name ]
SAX E.: CHARACTERS[
]
SAX E.: START ELEMENT[ location ]
SAX E.: CHARACTERS[ New York ]
SAX E.: END ELEMENT[ location ]
SAX E.: CHARACTERS[
]
SAX E.: END ELEMENT[ simple ]
SAX E.: END DOCUMENT
70
Friday, 30 September 2011
70
SAX: some pros and cons
+ fast: we don’t need to wait until XML document is parsed before we start
doing things
+ memory efficient: the parser does not keep the parse tree in memory
+ we might create our own structure anyway, so why duplicate effort?!
– we cannot “jump around” in the document; it might be tricky to keep track of
the document’s structure
– unusual concept, so it might take some time to get used to using a SAX
parser
71
Friday, 30 September 2011
71
DOM and SAX -- summary
•
•
so, if you are developing an application that needs to extract information
from an XML document, you have the choice:
1. write your own XML reader
2. use some other XML reader
3. use DOM
4. use SAX
all have pros and cons, e.g.,
1. might be time-consuming but may result in something really efficient
because it is application specific
2. might be less time-consuming, but is it portable? supported? re-usable?
3. relatively easy, but possibly memory-hungry
4. a bit tricky to grasp, but memory-efficient
72
Friday, 30 September 2011
72
Self-describing?!
•
XML is said to be self-describing...what does this mean?
<a123>
<b345 b345="$%#987">Hi there!</b345>
</a123>
•
•
...can you understand what this is about?
Let’s compare to CSV (comma separated values):
– each line is a record
– commas separate fields (and no commas in fields!)
– each record has the same number of fields
Bijan,Parsia,2.32
Uli,Sattler,2.24
– ...can you understand what this is about?
73
Friday, 30 September 2011
73
Self-describing?!
•
One way of translating our example into XML
– ...can you understand what this is about?
Bijan,Parsia,2.32
Uli,Sattler,2.24
<csvFile>
<record>
<field>Bijan</field>
<field>Parsia</field>
<field>2.32</field>
</record>
<record>
<field>Uli</field>
<field>Sattler</field>
<field>2.21</field>
</record>
</csvFile>
74
Friday, 30 September 2011
74
Self-describing?!
Name,Surname,Room
Bijan,Parsia,2.32
Uli,Sattler,2.24
•
Let’s consider a self-describing CSV (ExCSV)
– first line is header with field names
– ...can you understand what this is about?
•
We could even generically translate such CSVs in XML:
<csvFile>
<record>
<name>Bijan</name>
<surname>Parsia</surname>
<room>2.32</room>
</record>
<record>
<record>Uli</name>
<surname>Sattler</surname>
<room>2.21</room>
</record>
</csvFile>
Friday, 30 September 2011
<addresses>
<address>
<name>Bijan</name>
<surname>Parsia</surname>
or,
<room>2.32</room>
manually,
</address>
even
<address>
better:
<name>Uli</name>
<surname>Sattler</surname>
<room>2.21</room>
</address>
</addresses>
75
75
Self-describing versus Guessability
• We can go a long way by guessing
Bijan,Parsia,2.32
Uli,Sattler,2.24
– CSV was less guessable
• requires background knowledge
Name,Surname,Room
– ExCSV was more guessable
Bijan,Parsia,2.32
Uli,Sattler,2.24
• still some guessing
• could read the field tags &
<address>
<name>Bijan</name>
• guess intent
<surname>Parsia</surname>
• had to guess the
<room>2.32</room>
</address>
record type
– Guessability is tricky
• Is self-describing just being more or less guessable?
76
Friday, 30 September 2011
76
Self-describing
The Essence of XML (Siméon and Walder 2003):
“From the external representation one should be able to
derive the corresponding internal representation.”
•
•
•
•
•
Internal: e.g., the DOM tree, our application’s interpretation of the content
External: the XML document, i.e., text!
Are CSV, ExCSV, XML self-describing?
Which is more self-describing?
Given
1. a base format,
e.g., ExCSV
Name,Surname,Room
2. a/some specific document(s), e.g.,
•
what data structured can we extract?
Bijan,Parsia,2.32
Uli,Sattler,2.24
77
Friday, 30 September 2011
77
Self-describing
•
Given
1. a base format,
e.g., ExCSV
2. a/some specific document(s), e.g.,
•
Name,Surname,Room
Bijan,Parsia,2.32
Uli,Sattler,2.24
what data structured can we extract?
•
•
CSV, ExCSV: tables, flat records, arrays, lists, etc.
XML: labelled, ordered trees of unbounded depth!
•
Clearly, you could parse specific CSV files into trees,
but you’d need to use extra-CSV rules for that.
78
Friday, 30 September 2011
78
Schemas: why?
•
•
•
SGML
– Parsing & Validation
– User/developer documentation
RDBMS
– No database without schema
– DB schema determines tables, attributes, names, etc.
– Query optimization, integrity, etc.
XML
– No schema needed at all!
– Well-formed XML can be
• parsed to yield data that can be
• manipulated, queried, etc.
– Non-well formed XML....not so much
– Well-formedness is a universal minimal schema
79
Friday, 30 September 2011
79
Schemas for XML: why?
•
•
•
Well-formedness is minimal
– any name can appear as an element
or attribute name
– any shape of content/structure of
nesting is permitted
Few applications want that…
we’d like to rely on a format with
– core concepts that result in
– core (tag & attribute) names and
– intended structure
– intended datatypes
e.g., string for names, integer for age
– although you might want to keep it
extensible & flexible
Friday, 30 September 2011
<addresses>
<name>
<address>Bijan</address>
<surname>Parsia</surname>
<room>2.32</room>
</name>
<room>
<room><room>
Uli</room> </room>
<room>Sattler</room>
<room>2.21</room>
</room>
</addresses>
<addresses>
<address>
<name>Bijan</name>
<surname>Parsia</surname>
</address>
<address>
<name>Uli</name>
<minit>M<minit>
<surname>Sattler</surname>
<room>2.21</room>
</address>
</addresses>
80
80
Schemas for XML: why?
•
•
A schema describes aspects of data:
– what’s legal
(what a document can contain)
– what’s expected
(what a document must contain)
– what’s assumed
(default values)
Two modes for using a schema
– descriptive:
• describing documents
• for other people
• so that they know how to serialize
– prescriptive:
• prevent your application from using
wrong documents
<addresses>
<address>
<name>Bijan</name>
<surname>Parsia</surname>
</address>
<address>
<name>Uli</name>
<minit>M<minit>
<surname>Sattler</surname>
<room>2.21</room>
</address>
</addresses>
81
Friday, 30 September 2011
81
Benefits of an (XML) schema
•
•
•
•
•
Specification
– you document/describe/publish your format
– so that it can be used across multiple implementations
Document for applications
– applications can do error-checking in a format independent way
• checking whether ax XML document conforms to a schema
can be done by a generic tool,
• no need to be changed when schema changes
• automatically!
Optimization (a la RDBMS)
– query answering can take schema into account to improve performance
Extra support for authoring
– Default values & auto-completion
– Nicer queries (see coursework week 3)
– see <Oxygen/>
Key questions: when are these benefits worth the hassle?
Which schema language to choose (there are many!)?
82
Friday, 30 September 2011
82
Things an XML schema describes
•
•
At least:
– legal names
• elements
• attributes
• entities, etc.
– legal relationships between items
• content model for elements and attributes
describing what is allowed as contents
• value of an entity
Many XML schema languages:
– DTDs (old but interesting)
– W3C XML Schema (data champion)
– RelaxNG (document champion)
– Schematron (error handling star)
– and many more
Grammar based
OO/type based
Rule based
83
Friday, 30 September 2011
83
DTDs: a starter!
•
•
<addresses>
<address type="acad">
<name>Bijan</name>
<surname>Parsia</surname>
<room loc="Kilb">2.32</room>
</address>
<address type="acad">
<name>Uli</name>
<surname>Sattler</surname>
<room loc="Kilb">2.21</room>
</address>
</addresses>
A DTD is a collection of declarations that specify
– which elements are allowed,
– how elements can be nested,
i.e., what child elements an element can have,
incl. their
• type, order, and (weakly) number
– what attributes an element can have,
incl. their
• attribute name, type, number (required or optional)
– what entities may appear,
incl. their value
DTDs are standardized in XML 1.0
– widely implemented, but diminishingly used
84
Friday, 30 September 2011
84
DTDs: a starter!
•
•
•
Derived from SGML, but simplified
non-XML syntax
– not parseable/manipulatable by a DOM
– not extensible
• you can’t ‘import’ one DTD into another one
– can be included in XML document as internal subset
– human readable/writable
limited expressivity
– limited for describing structure of documents (more later)
– limited for describing data
• e.g., no ‘date’, ‘real’, etc.
85
Friday, 30 September 2011
85
DTD Declarations
•
To describe logical structure:
– elements
<!ELEMENT name (#PCDATA)>
<!ELEMENT person (name,address+,email*)>
<!ELEMENT address (city,(nr,street)?)>
– attributes
<!ATTLIST name type (family|personal|place) "personal">
“as attributes for elements name you can use type and its value is
either “family” or “personal” or “place”, with “personal” being default
ok are:
<name>Bijan</name>
<name type=”personal”>Bijan</name>
<name type=”family”>Bijan</name>
not ok is:
<name type=”DontKnow”>Bijan</name>
86
Friday, 30 September 2011
86
Element content model in DTDs & regular expressions
•
In a DTD, we have 1 element declaration per element name
<!ELEMENT element-name (element-content)>
•
and element-content is a regular expression over element names
•
Given a set of symbols N, the set of regular expressions regexp(N) over N
is the smallest set containing
– the empty string ε and all symbols in N and
– if e1 and e2 ∈ regexp(N), then so are
• e1,e2 (concatenation)
• e1|e2 (choice)
• e1*
(repetition)
•
In DTDs, we use
– EMPTY instead of ε
– ANY instead of N*
87
Friday, 30 September 2011
87
Regular expressions
•
Given a set of symbols N, the set of regular expressions regexp(N) over N
is the smallest set containing
– the empty string ε and all symbols in N and
– if e1 and e2 ∈ regexp(N), then so are
• e1,e2 (concatenation)
• e1|e2 (choice)
• e1* (repetition)
•
•
•
•
•
•
Given a regular expression e, a string w matches e,
if w = ε = e or w = n = e for some n in N, or
if w = w1 w2 and e = (e1 , e2) and w1 matches e1 and w2 matches e2 , or
if e = (e1 | e2) and w matches e1 or w matches e2
if w = ε and e = e1*
if w = w1 w2... wn and e = e1* and each wi matches e1
88
Friday, 30 September 2011
88
Regular expressions
•
Hence we can use
– e+ as abbreviation for (e,e*)
– e? as abbreviation for (e|ε)
89
Friday, 30 September 2011
89
DTD Declarations
•
To describe physical structure:
– entities
<!ENTITY WelshPlace
"Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch">
– notations
• can be ignored or read up
• for non-textual data
90
Friday, 30 September 2011
90
XML with external subset
1 Dilbert cartoon, in say Dilbert678.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cartoon SYSTEM "cartoon.dtd">
<cartoon copyright="United Feature Syndicate" year="2000">
<prolog>
<series>Dilbert</series>
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
</characters>
</prolog>
<panels>
<panel colour="none">
<scene> Pointy-Haired Boss and Dilbert sitting
at table. </scene>
<bubbles>
<bubble>
<speaker>Dilbert</speaker>
<speech>You havenʼt given me enough
resources to do
my project.</speech>
</bubble>
</bubbles>
</panel>
...
</panels>
</cartoon>
Friday, 30
September 2011
using DTD in cartoon.dtd:
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT cartoon (prolog, panels)>
<!ATTLIST cartoon copyright CDATA #REQUIRED>
<!ATTLIST cartoon year CDATA #REQUIRED>
<!ELEMENT prolog (series, author, characters)>
<!ELEMENT series (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT characters (character)*>
<!ELEMENT character (#PCDATA)>
<!ELEMENT panels (panel)+>
<!ELEMENT panel (scene, bubbles)>
<!ATTLIST panel colour CDATA #IMPLIED>
<!ELEMENT scene (#PCDATA)>
<!ELEMENT bubbles (bubble)*>
<!ELEMENT bubble (speaker, speech)*>
<!ELEMENT speaker (#PCDATA)>
<!ELEMENT speech (#PCDATA)>
91
91
XML with internal subset
1 Dilbert cartoon, in say Dilbert678.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cartoon [
<!ELEMENT cartoon (prolog, panels)>
<!ATTLIST cartoon copyright CDATA #REQUIRED>
<!ATTLIST cartoon year CDATA #REQUIRED>
<!ELEMENT prolog (series, author, characters)>
<!ELEMENT series (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT characters (character)*>
<!ELEMENT character (#PCDATA)>
<!ELEMENT panels (panel)+>
<!ELEMENT panel (scene, bubbles)>
<!ATTLIST panel colour CDATA #IMPLIED>
<!ELEMENT scene (#PCDATA)>
<!ELEMENT bubbles (bubble)*>
<!ELEMENT bubble (speaker, speech)*>
<!ELEMENT speaker (#PCDATA)>
<!ELEMENT speech (#PCDATA)>]>
<cartoon copyright="United Feature Syndicate"
year="2000">
<prolog>
<series>Dilbert</series>
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
Friday, 30 September 2011
</characters>
</prolog>
<panels>
<panel colour="none">
<scene> Pointy-Haired Boss and Dilbert
sitting at table. </scene>
<bubbles>
<bubble>
<speaker>Dilbert</speaker>
<speech>You havenʼt given me
enough resources to do
my project.</speech>
</bubble>
</bubbles>
</panel>
...
</panels>
</cartoon>
92
92
XML with mixed internal and external subset
1 Dilbert cartoon, in say Dilbert678.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cartoon SYSTEM "cartoon.dtd"
[<!ATTLIST cartoon oneMore CDATA #IMPLIED>]>
<cartoon copyright="United Feature Syndicate" year="2000">
<prolog>
<series>Dilbert</series>
<author>Scott Adams</author>
<characters>
<character>The Pointy-Haired Boss</character>
<character>Dilbert</character>
</characters>
</prolog>
<panels>
<panel colour="none">
<scene> Pointy-Haired Boss and Dilbert sitting
at table. </scene>
<bubbles>
<bubble>
<speaker>Dilbert</speaker>
<speech>You havenʼt given me enough
resources to do
my project.</speech>
</bubble>
</bubbles>
</panel>
...
</panels>
Friday, 30
September 2011
</cartoon>
using DTD in cartoon.dtd:
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT cartoon (prolog, panels)>
<!ATTLIST cartoon copyright CDATA #REQUIRED>
<!ATTLIST cartoon year CDATA #REQUIRED>
<!ELEMENT prolog (series, author, characters)>
<!ELEMENT series (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT characters (character)*>
<!ELEMENT character (#PCDATA)>
<!ELEMENT panels (panel)+>
<!ELEMENT panel (scene, bubbles)>
<!ATTLIST panel colour CDATA #IMPLIED>
<!ELEMENT scene (#PCDATA)>
<!ELEMENT bubbles (bubble)*>
<!ELEMENT bubble (speaker, speech)*>
<!ELEMENT speaker (#PCDATA)>
<!ELEMENT speech (#PCDATA)>
93
93
Validity of XML documents w.r.t. DTDs
A document
• can have a doctype declaration, and this declaration specifies the
– external subset or
– internal subset or
– both.
– ...the DTD associated with a document is the union of both subsets.
• is valid w.r.t. a DTD if it satisfies all constraints in DTD
• In particular, each element X in doc
– must have an element declaration for X’s name n in DTD and
– must conform to it, i.e.,
<!ELEMENT n (element-content)>
• its childnodes must match element-content, i.e.,
– if element-content = #PCDATA,
then X must have simple (text) content
– if element-content is regular expression e,
then the sequence of X’s child nodes’ names must match e
• its attributes must conform to n’s !ATTLIST declaration(s)
Friday, 30 September 2011
94
94
Validity of XML documents w.r.t. DTDs
•
A document D is valid if it
– D is associated with a DTD (internal or external or both) and
– D is valid w.r.t. that DTD and
– the declaration element is D’s root element
<!DOCTYPE cartoon SYSTEM "cartoon.dtd">
•
•
Note: a document can be valid, and also valid w.r.t. another DTD
– which might be more or less strict than the associated DTD
Try <oXygen/>
– for your coursework
– to write XML documents and DTDs
– it automatically checks
• whether your document is well-formed and
• whether your document conforms to your DTD!
95
Friday, 30 September 2011
95
Pfew - Summary of today
•
•
•
•
•
•
Semi-structured data
– datamodel
– XML
– trees
Parsing & serializing
Dom & SAX
Self-describing
Why schemas
A first schema language: DTDs
•
...see you in the labs - now - to get you started on the coursework!
96
Friday, 30 September 2011
96
Download