COMP60411 Semi-structured Data and the Web Basic concepts, week 1 Uli Sattler University of Manchester 1 Friday, 30 September 2011 1 Organisational • • COMP60411 is taught by: 1. Bijan Parsia and 2. myself, Uli Sattler Prerequisites: some familiarity with programming, Java • Teaching period: Fridays of the next 5 weeks – with demonstrators present to ask Mondays - Thursdays • • ...and we will use Blackboard for additional material and the coursework ...which links to course homepage at • Please do not hesitate to ask if you have a question! http://www.cs.manchester.ac.uk/pgt/COMP60411/ 2 Friday, 30 September 2011 2 Organisational • Lab Demonstrators: – Samantha Bail – Alexandru Constantin – Chiara Del Vescovo – Azad Dehghan – Rafael Goncalves • • • • will be introduced later today cover the MSc lab for your questions ...times will be announced on course web site http://www.cs.manchester.ac.uk/pgt/COMP60411/ 3 Friday, 30 September 2011 3 Organisational • Assessment: 50% exam, 50% coursework • Coursework and Exercises: – 1.5 days per week plus – reading plus – half of week 6 4 Friday, 30 September 2011 4 Coursework and Exercises • • • • • All work is distributed and collected through Blackboard – always retain a copy of your work elsewhere! – backup! Marks & feedback are distributed through Blackboard We encourage you to use Blackboard’s discussion system – we will help with questions Up-to-date announcements on Twitter – #uMan #ssd60411 Lateness – work is generally due one week after assignment i.e., Fridays at 9am – work that is late: marked 0, no late submission – if your overall coursework mark < 50% due to missed deadlines, – you can submit some designated, additional coursework in reading week – to help bring your coursework mark up to max. 50% 5 Friday, 30 September 2011 5 Coursework and Exercises Each week, we give you 4 pieces of coursework: • [10 marks] a couple of small, short questions, often multiple choice – to ensure you grasp the basic concepts • [5 marks] a short essay of ~200 - 300 words – about an average blog post – to make you think & practise writing (project!) • [5 - 10 marks] a small modelling task – to appreciate the numerous ways in which things can be done – to get your hands dirty • [15 - 20 marks] assignment – a programming task – in Java, XQuery, XSLT, etc. ➡ 40 marks per week 6 Friday, 30 September 2011 6 Marking and Your Expectation • • Remember: – >= 70% is distinction level – hence 90% or above will be very rare You will do – 5 weeks of coursework, each – 40 marks – which gives you a total of 200 marks Hence each course work mark is worth 0.5% of your coursework mark • ...and your coursework mark counts 50% of your course unit mark • So, please don’t panic if – you get only 20 marks in the first week – ...this is still 50% and thus at pass level – ...to be improved in the next weeks! • 7 Friday, 30 September 2011 7 Plagiarism & Academic Malpractice • We assume that you have all by now successfully completed the Plagiarism and Malpractice Test on Blackboard • ...if you haven’t: do so before you submit any coursework (assignment or assessment) • ...because we work under the assumption that you know what you do • ...and if you don’t, and submit coursework where you have copied incorrectly it costs you at least marks or more, e.g., your MSc 8 Friday, 30 September 2011 8 Literature To obtain more detailed information, please refer to • W3C documents at http://www.w3.org/TR/... • S. Abiteboul, P. Buneman, and D. Suciu: Data on the Web. Morgan Kaufmann Publishers, 2000. • E. R. Harold and W. S. Means: XML in a Nutshell. O’Reilly, 2004. • ...and follow the various available web resources linked from the course web page • ... we assume that you – are enthusiastic about your subject – go and find out about stuff you don’t know yet • • no need to buy a book and we will have a comprehensive list on course web site 9 Friday, 30 September 2011 9 Preliminary outline of the course 1. Introduction • data models • query languages • semi-structured data & XML basics • trees & regular expressions • parsing & serialisation • SAX & DOM • DTDs as schemas 2. Extending the horizon: first query language and another schema language • XPath • Namespaces • XML schema 3. Extending the horizon: another query language and types • more XML schema -- type derivation • XQuery, functions, types on expressions 10 Friday, 30 September 2011 10 Preliminary outline of the course (ctd) 4. Extending the horizon: • RelaxNG, another schema language • XSLT, another query language • tree grammars for comparing schema languages • error handling 5. Other concepts • schema containment and emptiness • keys and uniqueness constraints • select topics, e.g., XSugar Please note that this course is different from last years! 11 Friday, 30 September 2011 11 Storing & Manipulating Data: Relational Databases • • • proven technology, currently storing/managing vast amounts of data in tables – we impose a certain structure on our data separation between 3 levels: – conceptual: ER diagrams, are transformed/normalised into – logical: tables, and – physical: implementation of tables, indices Data model is quite close to the logical level: – a table is a relation, i.e., a set of tuples (i.e., unordered!), – each column has its attribute (also unordered) Picture from http://en.wikipedia.org/wiki/Relational_database Friday, 30 September 2011 12 12 Storing & Manipulating Data: Relational Databases • • • • Query Language Datamodel! – a query returns a table – operations used in query correspond to operations on relations, e.g., select, project, join normal forms – methodology to achieve good behaviour/performance Schemas Languages and Integrity Constraints – declarative way to • describe meaningful entries and • to prevent ‘meaningless’ data entries – updates are checked against those ...a lot of research into – query optimisation – view maintenance – integrity constraints, – query languages 13 Friday, 30 September 2011 13 If you don’t know… • ...or don’t remember your UG database class, read a text book on databases • e.g., Ullman & Widom’s “A first course in database systems” • or don’t remember the difference between a – set, e.g., {a,b,c} – bag/multiset, e.g., {{a,a,b,c}} – list e.g., <a,c,b,a,c> ➡ ...read it up 14 Friday, 30 September 2011 14 Storing & Manipulating Data: Relational Databases • • • main goal: – efficient implementation of query answering over large DBs pressing data into tables is a non-trivial task & might cause difficulties – normalisation – assume/impose regularity or – ...think of storing people’s phone numbers/email addresses, etc. – if structure of data changes, tables need to change... – if you want to integrate with data from other tables, you need to find common keys information about the data is only in table ‘headers’ 15 Friday, 30 September 2011 15 From: Oracle9iAS TopLink Getting Started Release 2 (9.0.3) Part Number B10061-01 Friday, 30 September 2011 16 16 Storing & Manipulating Data: Relational Databases • • • • • main goal: – efficient implementation of query answering over large DBs pressing data into tables is a non-trivial task & might cause difficulties – normalisation – assume/impose regularity or – ...think of storing people’s phone numbers/email addresses, etc. – if structure of data changes, tables need to change... – if you want to integrate with data from other tables, you need to find common keys information about the data is only in table ‘headers’ – e.g., is the 3rd cell income or expenses? – metadata needs to be taken into account data integration requires a lot of handcrafting & data cleaning & ... used/accessed mainly by “insiders” 17 Friday, 30 September 2011 17 Alternative Database Models • • • Hierarchical model: tree of records Network model: DAG of records Object-oriented model/object-relational model: linked objects with attributes • Semi-structured model: – OEM – Lore – XML • • • • how does this work? what are the underlying principles & available technologies? how can we use these to use XML well? and why would we want to use XML? 18 Friday, 30 September 2011 18 19 Friday, 30 September 2011 19 Protein data from UniProt UniProt • provides a web query interface to Uniprot database • e.g., query http://www.uniprot.org/uniprot/ for ‘BRCA’ • • • ...biologists need to integrate, share, query, analyse, and search this data ...so what format is/should it be in? ...or what format should it be made available in to be integrated with other data? 20 Friday, 30 September 2011 20 Protein data from UniProt: an example <?xml version="1.0" encoding="UTF-8"?> <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> <entry dataset="Swiss-Prot" created="2005-01-04" modified="2010-08-10" version="80"> <accession>Q9BX63</accession> <accession>Q3MJE2</accession> <accession>Q8NCI5</accession> <name>FANCJ_HUMAN</name> <protein> <recommendedName ref="1"> <fullName>Fanconi anemia group J protein</fullName> <shortName>Protein FACJ</shortName> </recommendedName> <alternativeName> <fullName>ATP-dependent RNA helicase BRIP1</fullName> </alternativeName> <alternativeName> <fullName>BRCA1-interacting protein C-terminal helicase 1</fullName> <shortName>BRCA1-interacting protein 1</shortName> </alternativeName> <alternativeName> <fullName>BRCA1-associated C-terminal helicase 1</fullName> </alternativeName> </protein> <gene> <name type="primary">BRIP1</name> <name type="synonym">BACH1</name> <name type="synonym">FANCJ</name> </gene> Friday, 30 September 2011 21 21 ……. <organism> <name type="scientific">Homo sapiens</name> <name type="common">Human</name> <dbReference type="NCBI Taxonomy" id="9606" key="2"/> <lineage> <taxon>Eukaryota</taxon> <taxon>Metazoa</taxon> <taxon>Chordata</taxon> <taxon>Craniata</taxon> <taxon>Vertebrata</taxon> <taxon>Euteleostomi</taxon> <taxon>Mammalia</taxon> <taxon>Eutheria</taxon> <taxon>Euarchontoglires</taxon> <taxon>Primates</taxon> <taxon>Haplorrhini</taxon> <taxon>Catarrhini</taxon> <taxon>Hominidae</taxon> <taxon>Homo</taxon> </lineage> </organism> <reference key="3"> <citation type="journal article" date="2001" name="Cell" volume="105" first="149" last="160"> <title>BACH1, a novel helicase-like protein, interacts directly with BRCA1 and contributes to its DNA repair function.</title> <authorList> <person name="Cantor S.B."/> <person name="Bell D.W."/> <person name="Ganesan S."/> <person name="Kass E.M."/> <person name="Drapkin R."/> Friday, 30 September 2011 22 22 The Basics First: Semi-structured data {name: {first:”Uli”, last: “Sattler”}, tel: 56176, email:”sattler@cs.man.ac.uk”} Semi-structured data • predates XML • is an attempt to reconcile there is – (Web) document view and structure! – (DB) strict structures but not • is data organised in semantic entities, where too much structure! – similar entities are grouped together – entities in same group may not have same attributes • often defined as a possibly nested set of attribute-value pairs • order of attributes is not necessarily important – having sets or lists of telephone numbers makes a difference: – fixing an order allows to give meaning to rank • not all attributes may be required • carries its own description 23 Friday, 30 September 2011 23 The Basics First: Semi-structured data Example (ctd): Values can in turn be structured: {name: {first:”Uli”, last: “Sattler”}, tel: 56176, email:”sattler@cs.man.ac.uk”} And we can have several values for the same attribute: {name: {first:”Uli”, last: “Sattler”}, tel: 56176, tel: 56182, email:”sattler@cs.man.ac.uk”} 24 Friday, 30 September 2011 24 The Basics First: Semi-structured data (SSD) Graphical representation as a tree (where order of children doesn’t necessarily matter): tel. name 56176 first “Uli” tel. email 56182 “sattler@cs.man.ac.uk last “Sattler” {name: {first:”Uli”, last: “Sattler”}, tel: 56176, tel: 56182, email:”sattler@cs.man.ac.uk”} 25 Friday, 30 September 2011 25 The Basics First: Semi-structured data (SSD) Graphical representation as a tree (where order of children doesn’t necessarily matter): tel. name 56182 first “Uli” tel. email 56176 “sattler@cs.man.ac.uk last “Sattler” Is this the same or a different tree? Is this the same or different data? {name: {first:”Uli”, last: “Sattler”}, tel: 56182, tel: 56176, email:”sattler@cs.man.ac.uk”} 18 26 Friday, 30 September 2011 26 The Basics First: Semi-structured data (SSD) • In general, a piece of SSD/nested set of attribute-value pairs, – can be represented as a graph • leaf nodes standing for single data items • inner nodes carry no label • edges labelled with attribute names {name: {first:”Uli”, last: “Sattle tel: 56182, tel: 56176, email:”sattler@cs.man.ac.uk tel. name 56176 first “Uli” tel. email 56182 “sattler@cs.man.ac.uk last “Sattler” 27 Friday, 30 September 2011 27 Semi-structured data: tuples with variations We can easily represent nested tuples [[[Uli, Sattler], 56176, sattler@cs.man.ac.uk], [Bijan, 56183, 783 4672, bparsia@cs.man.ac.uk], [Leo, 8488342, leo@gmx.com]] as sets of attribute-value pairs even if they have missing or duplicated pairs ...best if we know which element belongs to what e.g., is “ 783 4672” Bijan’s telephone number? his email address? age? {person: {name: {first: “Uli”, last: “sattler}, tel: 56176, email: “sattler@cs.man.ac.uk”} person: {name: “Bijan”, tel: 56183, tel: 783 4672, email: “bparsia@cs.man.ac.uk”} person: {name: “Leo”, tel: 8488342, email: “leo@gmx.com”}} 28 Friday, 30 September 2011 28 Semi-structured data: tuples with variations We can easily represent nested tuples [[[Uli, Sattler], 56176, sattler@cs.man.ac.uk], [Bijan, 56183, 783 4672, bparsia@cs.man.ac.uk], [Leo, 8488342, leo@gmx.com]] as sets of attribute-value pairs even if they have missing or duplicated pairs ...but also without knowing role of elements: {1: {1: {1: “Uli”, 2: “sattler}, 2: 56176, 3: “sattler@cs.man.ac.uk”} 2: {1: “Bijan”, 2: 56183, 3: 783 4672, 4: “bparsia@cs.man.ac.uk”} 3: {1: “Leo”, 2: 8488342, 3: “leo@gmx.com”}} 29 Friday, 30 September 2011 29 Semi-structured data: tuples with variations {person: {name: {first: “Uli”, last: “sattler}, tel: 56176, email: “sattler@cs.man.ac.uk”} person: {name: “Bijan”, tel: 56183, tel: 783 4672, email: “bparsia@cs.man.ac.uk”} person: {name: “Leo”, tel: 8488342, email: “leo@gmx.com”}} SSD • • can be serialized: – convert SSD into a byte stream – for transmission is self-describing: – each data-item (e.g., 56175) is annotated with its description (e.g., tel.:) – space consuming, but enhances inter-operability 30 Friday, 30 September 2011 30 SSD: representing relational data R Consider two relations : c d c1 c2 d2 c2 c3 d3 c4 d4 a b c a1 b1 a2 b2 S and their tree representation: R row R row R a a1 S row row R S rowR b c a b1 c1 a2 b b2 S S row c c2 row row S row S row S c d c d c2 d2 c3 d3 c c4 d d4 ➔ we can represent relational data, though with an overhead 31 Friday, 30 September 2011 31 SSD: representing object databases • we can represent data from object-oriented DBMSs or SE as SSD – provided we have object identifiers, e.g., &o1 – so that objects can refer to each other Example: { persons: {person: person: &o1 { &o2 { person: &o3 { • name: “John”, age: 47, relatives: {child: &o2, child: &o3}} name: “Mary”, age: 21, relatives: {father: &o1, sister: &o3}} name: “Paula”, age: 23, relatives: {father: &o1, sister: &o2}}}} Draw a graph representation of this piece of semi-structured data! 32 Friday, 30 September 2011 32 SSD: how to represent/store • there have been various formalisms suggested to store semi-structured data – e.g., Object Exchange Model (OEM, close to previous examples) – e.g., Lore – e.g., XML – different mechanisms for self-describing – different datatypes supported – different description mechanisms for (semi) structure • which attributes are allowed/required where • which values allowed/required where – different query languages & manipulation mechanisms 33 Friday, 30 September 2011 33 2. XML - eXtensible Markup Language Q1-2 34 Friday, 30 September 2011 34 XML • • • is a format for the representation of semi-structured data is not designed to specify the lay-out of documents alone will not solve the problem of efficiently querying (web) data: we might have to use RDBMSs technology as well 35 Friday, 30 September 2011 35 A brief history of XML • • • GML (Generalised Markup Language), 60ies, IBM SGML (Standard Generalised Markup Language), 1985: – flexible, expressive, with DTDs – custom tags HTML (Hypertext Markup Language), early 1990ies: – application of SGML – designed for presentation of documents • – single document type, presentation-oriented tags, e.g., <h1>...</h1> – led to the web as we know it XML, 1998 first edition of XML 1.0 (now 4th edition) W3C?! – a W3C standard – subset/fragment of SGML – designed • to be “web friendly” • for the exchange/sharing of data • • to allow for the principled decentralized extension of HTML and • the elimination or radical reduction of errors on the web XHTML is an application of XML – almost a fragment of HTML Friday, 30 September 2011 36 36 A rough map of a small part of the acronym world HTML is an application of DTD describes SGML XML Schema is basically a restriction of XHTML describes is basically a restriction of is an application of describes describes Schematron RelaxNG XML queries queries queries XQueries part of XSLT part of XPath 37 Friday, 30 September 2011 37 An XML Example A snippet of XML describing the above Dilbert cartoon <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog> <panels> <panel colour="none"> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You havenʼt given me enough resources to do my project.</speech> </bubble> </bubbles> </panel> ... </panels> </cartoon> Friday, 30 September 2011 38 38 What is XML? • • • • • • Technical terms, when used for the first time, are marked red XML is a specialization of SGML XML is a W3C standard since 1998, see http://www.w3.org/XML/ XML was designed to be simple, generic, and extensible an XML document is a piece of text that tel. tel. nam – “contains” 561 561 • structure last first • data “Uli” “Sat – can be associated with a tree, its DOM tree or infoset an XML document is divided into smaller pieces called elements (associated with nodes in tree): – an XML document contains elements – elements can contain elements – with a non-ambiguous hierarchical structure amongst elements an XML document consists of – some administrative information followed by – a root element containing all other elements ema “sattler@cs.m 39 Friday, 30 September 2011 39 Example And here is the full XML document <?xml version="1.0" encoding="UTF-8"?> Administrative <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> Information <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> Root <series>Dilbert</series> element <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog> <panels> .... </panels> </cartoon> 40 Friday, 30 September 2011 40 What is XML? (ctd) The above mentioned administrative information of an XML document: 1. XML declaration, e.g., <?xml version=“1.0” encoding=“iso-8859-1”?> (optional) identifies the – XML version (1.0) and – character encoding (iso-8859-1) 2. document type declaration (optional) references a grammar describing document called Document Type Definition – e.g. <!DOCTYPE cartoon SYSTEM “cartoon.dtd”> 1. a DTD constrains the structure, content & tags of a document 2. can either be local or remote 3. then we find the root element -- also called document element 4. which in turn contains other elements with possibly more elements.... 41 Friday, 30 September 2011 41 XML elements • • • • • • elements are delimited by tags tags are enclosed in angle brackets, e.g., <panel>, </from> tags are case-sensitive, i.e., <FROM> is not the same as <from> we distinguish – start tags: <...>, e.g., <panel> – end tags: </...>, e.g., </from> a pair of matching start- and end tags delimits an element (like parentheses) attributes specify properties of an element e.g., <cartoon copyright=“United Feature Syndicate”> 42 Friday, 30 September 2011 42 Example And here is the full XML document <?xml version="1.0" encoding="UTF-8"?> Attributes <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="United Feature Syndicate" year="2000"> Start Tag <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> End Tag <character>Dilbert</character> </characters> </prolog> <panels> .... </panels> </cartoon> 43 Friday, 30 September 2011 43 XML Core Concepts: elements (the main concept) <element-name attr-decl1 ... attr-decln> element-content </element-name> • • • • • arbitrary number of attributes is allowed each attr-decli is of the form attr-name=“attr-value” but each attr-name occurs at most once in one element the element-content can be – empty simple content – text and/or mixed content – one or more elements element content an empty element can be abbreviated as <element-name attr-decl1 ... attr-decln/> 44 Friday, 30 September 2011 44 Example <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> Simple <series>Dilbert</series> content <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog> <panels> .... </panels> </cartoon> Element content 45 Friday, 30 September 2011 45 XML Core Concepts: Prologue -- XML declaration More at http://www.w3.org/TR/REC-xml/ <?xml param1 param2 ...?> Each parami is in the form parameter-name=“parameter-value” <?xml version=“1.0” encoding=“US-ASCII” standalone=“yes”?> Parameters for • the xml version used within document • the character encoding • whether document is standalone or uses external declarations (see validity constraint for when standalone=“yes” is required) An XML document should have an XML declaration (but does not need to) Friday, 30 September 2011 46 46 XML Core Concepts: Prologue -- doctype declaration <!DOCTYPE element-name PUBLIC “pub-id” “f-name.dtd” | SYSTEM “f-name.dtd” | [dt-declarations]> • • • • • • one such declaration, before root element element-name is the name of the root element of the document the optional dt-declarations is – called internal subset – a list of document type definitions the optional f-name.dtd refers to the external subset also containing document type definitions e.g., <!DOCTYPE html PUBLIC “http://www.abc.org/dtds/html.dtd” “http://www.abc.org/dtds/html.dtd” > more later... 47 Friday, 30 September 2011 47 What is XML? (ctd) • • in XML, the set of tags is not fixed – in HTML, the tag set is fixed – <h1>, <b>, <ul>,... elements can be nested, to arbitrary depth • the same element name can occur many times in a document, – e.g., <character> • XML itself is not a markup language, but we can specify markup languages with XML – an XML document can contain or refer to its specification: !DOCTYPE 48 Friday, 30 September 2011 48 How to view or edit XML? • • • • XML is not for human consumption – far to verbose – in contrast to HTML, your browser won’t help: you can only do a “view source” or – first style it (using XSLT or CSS, later more) to transform XML into HTML, then use your web browser to view it XML is text, so you can use your favourite editor, e.g., emacs in xml mode Or you can use an XML editor, e.g., XMLSpy, Stylus Studio, <oXygen/>, MyEclipse, and many more <oXygen/> runs on the lab machines – it supports many features – query languages – schemas, etc. – has been given to us for free – if you want to use it at home/on your laptop, use a free 30 day trial 49 Friday, 30 September 2011 49 XML and HTML • • • • • XML is always case sensitive, i.e., "Hello" is different from "hello" – HTML isn’t: it uses SGML's default "ignore case" in XML, all tags must be present – in HTML, some ”tag omission" may be permissible (e.g., <br>) in XML, we have a special way to write empty tags <myname/> – which can’t be used in HTML in XML, all attribute values must be quoted, e.g., <name lang= ”eng”>... – in SGML (and therefore in HTML) this is only required if value contains space in XML, attribute names cannot be omitted – in HTML they may be omitted using shorttags 50 Friday, 30 September 2011 50 When is an XML document well-formed? An XML document is well-formed if 1. there is exactly one root element 2. tags, <, and > are correct (incl. no unescaped < or & in character data) 3. tags are properly nested 4. attributes are unique for each tag and attribute values are quoted 5. no comments inside tags Q3-7 This is a very weak notion of well-formedness: basically, it only ensures that we can parse a document into a tree 51 Friday, 30 September 2011 51 Trees come in different shapes! tel. nam first “Uli” last 561 tel. ema 561 “sattler@cs.man.ac “Sattl Document nodeType = DOCUMENT_NOD Element nodeType = ELEMENT_NODE nodeName = mytext Element nodeType = ELEMENT_NOD E Element nodeType = ELEMENT_NOD E Text Text nodeType = TEXT_NODE nodeType = TEXT_NODE PI Attribute nodeType = 52 Friday, 30 September 2011 52 Interlude: Abstract trees - nodes as strings! A tree A tree with nodes as strings A tree over {A,B,C} ε B ε A 0 1 0 1,0 0,0 0,1 0,2 • so we can refer to nodes by names • order matters! • the node 0,0 is different from 0,1 A 1 B 0,0 A 0,1 B 1,0 B 0,2 • so we can distinguish • a node from • a node’s label 53 Friday, 30 September 2011 53 Interlude: Abstract trees - nodes as strings!ε ๏ We use ℕ for the non-negative integers (including 0) ๏ we use ℕ* for the set of all (finite) strings over ℕ A 0 B A 1 B • ε is used for the empty string 1,0 B A B • 0,1,0 is a string of length 3 • each string stands for a node 0,0 0,1 0,2 ๏ An alphabet is a finite set of symbols ๏ A tree T over an alphabet Σ is a mapping T: ℕ* → Σ whose domain is ๏ finite i.e., T(n) is defined for only finitely many strings over ℕ each tree has only finitely many nodes ๏ contains ε i.e., T(ε) is defined each tree has a root ε ๏ is prefixed-closed i.e., if T(w,n) is defined, then T(w) is as well the predecessor w of a node (w,n) is in T 54 Friday, 30 September 2011 54 Interlude: Abstract trees - nodes as strings! • Explanation: • the strings in the domain of T represent T’s nodes • (w,n) is the successor of w, • T(w) is the label of w (as shown in picture) • we use nodes(T) for the (finite) domain of/nodes in T • Is the following mapping T a tree? If yes, draw the tree T! Σ = {W, X, Y, Z} T(ε) = X T(0) = X T(1) = X T(2) = X T(3) = Z T(0,0) = Y T(0,0,0) = Y T(3,1) = Z Friday, 30 September 2011 ε X X 0 Y 0,0 Y 0,0,0 X 1 2 X 3 Z Z 3,1 55 55 A Datamodel for XML documents • An XML document is a piece of text – it has tags, etc. – it has no nodes, structure, successors, etc. • having a datamodel for XML documents makes many things easier: – talking about documents, elements, nodes, etc. – ignoring things like whitespace issues, etc. – implementing software that handles XML – specifying schema languages, other formalisms around it ➡ think of relational model as basis for rel. DBMSs • this has motivated the – XML Information Set recommendation, – Document Object Model (DOM), and others unsurprisingly, they model an XML document as a tree • 56 Friday, 30 September 2011 56 Level Data unit examples Information or Property required cognitive application tree adorned with... namespace schema tree token Element Element Element Attribute Element Element Element Attribute complex <foo:Name t=”8”>Bob simple <foo:Name t=”8”>Bob character < foo:Name t=”8”>Bob bit 10011010 nothing a schema well-formedness which encoding (e.g., UTF-8) 57 Friday, 30 September 2011 57 DOM: datamodel for XML documents • • we will use the DOM tree as a datamodel: it can be viewed as an implementation of the slightly more abstract infoset DOM is a platform & language independent specification of an API for accessing an XML document in the form of a tree – “DOM parser” is a parser that outputs a DOM tree – but DOM is much more strings XML document, i.e., text parser e.g., Dom parser serializer your standard API, eg. DOM tree application 58 Friday, 30 September 2011 58 Programmatic Manipulation of XML Documents As a rule, whenever we manipulate XML documents in an application, we should use standard APIs: strings XML document, i.e., text parser e.g., Dom parser serializer your standard API, eg. DOM application parser: analyses document, generates parse tree with nodes labelled with tags, text content, and attribute-value pairs serializer: takes a (tree) data structure and generates an XML document 59 Friday, 30 September 2011 59 Parsing & Serializing XML documents XML document parser standard API, your application serializer • • • parser: – reads & analyses XML document – may generate parse tree that reflect document’s element structure e.g., DOM tree • with nodes labelled with – tags, – text content, and – attributes and their values serializer: – takes a data structure, e.g., some trees, linked objects, etc. – generates an XML document round tripping: – XML ➙ tree ➙ XML – ...doesn’t have to lead to identical XML document...more later Friday, 30 September 2011 60 60 Level Data unit examples Information or Property required cognitive application tree adorned with... namespace schema Element Attribute Element Element Element Attribute complex <foo:Name t=”8”>Bob simple <foo:Name t=”8”>Bob character < foo:Name t=”8”>Bob bit 10011010 nothing a schema well-formedness serializing token Element parsing tree Element which encoding (e.g., UTF-8) 61 Friday, 30 September 2011 61 DOM trees as a datamodel for XML documents Document A simple example: nodeType = DOCUMENT_NODE nodeName = #document nodeValue = (null) <?xml version="1.0" encoding="UTF-8"?> <mytext content=“medium”> " <title>Hallo!</title> " <content>Bye!</content> </mytext> Element nodeType = ELEMENT_NODE nodeName = title nodeValue = (null) firstchild Element nodeType = ELEMENT_NODE nodeName = mytext nodeValue = (null) firstchild lastchild attributes Element nodeType = ELEMENT_NODE nodeName = content nodeValue = (null) firstchild Text Text nodeType = TEXT_NODE nodeType = TEXT_NODE nodeName = #text nodeName = #text nodeValue = Hallo! nodeValue = Bye! PI nodeType = Processing Instruction Attribute nodeType = ATTRIBUTE_NODE nodeName = content nodeValue = medium 62 Friday, 30 September 2011 62 DOM trees as a datamodel for XML documents • • In general, we have the following correspondence: – XML document D → tree t(D) – element e in D → node t(e) in t(D) – empty element → leaf node – root element e in D → not root node in t(D) but document node - see previous example! DOM’s Node interface provides the following attributes to navigate around a node in the DOM tree: parentNode previousSibling firstChild • Node ChildNodes nextSibling lastChild attributes and also methods such as appendChild, hasAttributes, insertBefore, etc. 63 Friday, 30 September 2011 63 DOM by example mydocument.xml: <mytext content=“medium”> " <title>Hallo!</title> " <body>Bye!</body> </mytext> A little Java example: find the content of 2nd child of mytexts if 1st child is “Hallo” 1. let a parser build the DOM of mydocument.xml factory = DocumentBuilderFactory.newInstance(); myParser = factory.newDocumentBuilder(); parseTree = myParser.parse(”mydocument.xml"); 2. Retrieve all “mytext” nodes into a NodeList interface: mytextNodes = parseTree.getElementsByTagName(“mytext”) 3. Navigate and retrieve all contents: for (int i=0; i < mytextNodes.getLength(); i++) { actmytextNode = mytextNodes.item(i); acttitleNode = actmytextNode.getFirstChild(); actstring = acttitleNode.getFirstChild().getNodeValue(); if (actstring.equals(“Hallo”)) { actcontentNode = acttitleNode.getNextSibling(); returnstring = actcontentNode.getFirstChild().getNodeValue(); break; } } Friday, 30 September 2011 64 64 Parsing XML • DOM parsers parse an XML document into a DOM tree – this might be huge/not fit in memory – your application may take a few relevant bits from it and build an own datastructure, so (DOM) tree was short-loved/built in vain strings your XML document, i.e., text parser serializer • standard API, eg. DOM application SAX parsers work very differently – they don’t build a tree but – go through document depth first and “shout out” their findings... 65 Friday, 30 September 2011 65 SAX parser in brief • • • • • “SAX” is short for Simple API for XML not a W3C standard, but “quite standard” there is SAX and SAX2, using different names originally only for Java, now supported by various languages can be said to be based on a parser that is – multi-step, i.e., parses the document step-by-step – push, i.e., the parser has the control, not the application a.k.a. event-based • in contrast to DOM, – no parse tree is generated/maintained ➥ useful for large documents – it has no generic object model ➥ no objects are generated & trashed 66 Friday, 30 September 2011 66 SAX in brief • how the parser (or XML reader) is in control and the application “listens” XML document • • info SAX parse parser start event handler application SAX creates a series of events based on its depth-first traversal of document E.g., <?xml version="1.0" encoding="UTF-8"?> <mytext content=“medium”> " " <title> " " " " Hallo! " " </title> " " " <content> " " " " Bye! " " </content> " " </mytext> Friday, 30 September 2011 start document start Element: mytext attribute content value medium start Element: title characters: Hallo! end Element: title start Element: content characters: Bye! end Element: content end Element: mytext " " 67 67 SAX in brief • • • • • SAX parser, when started, goes through document while “commenting” what it does application listens to these comments, i.e., to list of all pieces of an XML document – whilst “taking notes”: when it’s gone, it’s gone! the primary interface is the ContentHandler interface – provides methods for relevant structural types in an XML document, e.g. startElement(), endElement(), characters() we need implementations of these methods: – we can use DefaultHandler – we can create a subclass of DefaultHandler and re-use as much of it as we see fit let’s see a trivial example of such an application... from http://www.javaworld.com/javaworld/jw-08-2000/jw-0804-sax.html? page=4 68 Friday, 30 September 2011 68 import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class Example extends DefaultHandler { // Override methods of the DefaultHandler // class to gain notification of SAX Events. public void startDocument( ) throws SAXException { System.out.println( "SAX E.: START DOCUMENT" ); } public void endDocument( ) throws SAXException { System.out.println( "SAX E.: END DOCUMENT" ); } public void startElement( String namespaceURI, String localName, String qName, Attributes attr ) throws SAXException { System.out.println( "SAX E.: START ELEMENT[ " + localName + " ]" ); // and let's print the attributes! for ( int i = 0; i < attr.getLength(); i++ ){ System.out.println( " ATTRIBUTE: " + attr.getLocalName(i) + " VALUE: " + attr.getValue(i) ); } } The green parts are to be replaced with something more sensible, e.g.: if ( localName.equals( "FirstName" ) ) { cust.firstName = contents.toString(); ... } public void endElement( String namespaceURI, String localName, String qName ) throws SAXException { System.out.println( "SAX E.: END ELEMENT[ "localName + " ]" ); } public void characters( char[] ch, int start, int length ) throws SAXException { System.out.print( "SAX Event: CHARACTERS[ " ); try { OutputStreamWriter outw = new OutputStreamWriter(System.out); outw.write( ch, start,length ); outw.flush(); } catch (Exception e) { e.printStackTrace(); } System.out.println( " ]" ); } public static void main( String[] argv ){ System.out.println( "Example1 SAX E.s:" ); try { // Create SAX 2 parser... XMLReader xr = XMLReaderFactory.createXMLReader(); // Set the ContentHandler... xr.setContentHandler( new Example() ); // Parse the file... xr.parse( new InputSource( new FileReader( ”myexample.xml" ))); }catch ( Exception e ) { e.printStackTrace(); } } 69 Friday, 30 September 2011 69 • when applied to • this program results in <?xml version="1.0"?> <simple date="7/7/2000" > <name> Bob </name> <location> New York </location> </simple> Example1 SAX Events: SAX E.: START DOCUMENT SAX E.: START ELEMENT[ simple ] ATTRIBUTE: date VALUE: 7/7/2000 SAX E.: CHARACTERS[ ] SAX E.: START ELEMENT[ name ] SAX E.: CHARACTERS[ Bob ] SAX E.: END ELEMENT[ name ] SAX E.: CHARACTERS[ ] SAX E.: START ELEMENT[ location ] SAX E.: CHARACTERS[ New York ] SAX E.: END ELEMENT[ location ] SAX E.: CHARACTERS[ ] SAX E.: END ELEMENT[ simple ] SAX E.: END DOCUMENT 70 Friday, 30 September 2011 70 SAX: some pros and cons + fast: we don’t need to wait until XML document is parsed before we start doing things + memory efficient: the parser does not keep the parse tree in memory + we might create our own structure anyway, so why duplicate effort?! – we cannot “jump around” in the document; it might be tricky to keep track of the document’s structure – unusual concept, so it might take some time to get used to using a SAX parser 71 Friday, 30 September 2011 71 DOM and SAX -- summary • • so, if you are developing an application that needs to extract information from an XML document, you have the choice: 1. write your own XML reader 2. use some other XML reader 3. use DOM 4. use SAX all have pros and cons, e.g., 1. might be time-consuming but may result in something really efficient because it is application specific 2. might be less time-consuming, but is it portable? supported? re-usable? 3. relatively easy, but possibly memory-hungry 4. a bit tricky to grasp, but memory-efficient 72 Friday, 30 September 2011 72 Self-describing?! • XML is said to be self-describing...what does this mean? <a123> <b345 b345="$%#987">Hi there!</b345> </a123> • • ...can you understand what this is about? Let’s compare to CSV (comma separated values): – each line is a record – commas separate fields (and no commas in fields!) – each record has the same number of fields Bijan,Parsia,2.32 Uli,Sattler,2.24 – ...can you understand what this is about? 73 Friday, 30 September 2011 73 Self-describing?! • One way of translating our example into XML – ...can you understand what this is about? Bijan,Parsia,2.32 Uli,Sattler,2.24 <csvFile> <record> <field>Bijan</field> <field>Parsia</field> <field>2.32</field> </record> <record> <field>Uli</field> <field>Sattler</field> <field>2.21</field> </record> </csvFile> 74 Friday, 30 September 2011 74 Self-describing?! Name,Surname,Room Bijan,Parsia,2.32 Uli,Sattler,2.24 • Let’s consider a self-describing CSV (ExCSV) – first line is header with field names – ...can you understand what this is about? • We could even generically translate such CSVs in XML: <csvFile> <record> <name>Bijan</name> <surname>Parsia</surname> <room>2.32</room> </record> <record> <record>Uli</name> <surname>Sattler</surname> <room>2.21</room> </record> </csvFile> Friday, 30 September 2011 <addresses> <address> <name>Bijan</name> <surname>Parsia</surname> or, <room>2.32</room> manually, </address> even <address> better: <name>Uli</name> <surname>Sattler</surname> <room>2.21</room> </address> </addresses> 75 75 Self-describing versus Guessability • We can go a long way by guessing Bijan,Parsia,2.32 Uli,Sattler,2.24 – CSV was less guessable • requires background knowledge Name,Surname,Room – ExCSV was more guessable Bijan,Parsia,2.32 Uli,Sattler,2.24 • still some guessing • could read the field tags & <address> <name>Bijan</name> • guess intent <surname>Parsia</surname> • had to guess the <room>2.32</room> </address> record type – Guessability is tricky • Is self-describing just being more or less guessable? 76 Friday, 30 September 2011 76 Self-describing The Essence of XML (Siméon and Walder 2003): “From the external representation one should be able to derive the corresponding internal representation.” • • • • • Internal: e.g., the DOM tree, our application’s interpretation of the content External: the XML document, i.e., text! Are CSV, ExCSV, XML self-describing? Which is more self-describing? Given 1. a base format, e.g., ExCSV Name,Surname,Room 2. a/some specific document(s), e.g., • what data structured can we extract? Bijan,Parsia,2.32 Uli,Sattler,2.24 77 Friday, 30 September 2011 77 Self-describing • Given 1. a base format, e.g., ExCSV 2. a/some specific document(s), e.g., • Name,Surname,Room Bijan,Parsia,2.32 Uli,Sattler,2.24 what data structured can we extract? • • CSV, ExCSV: tables, flat records, arrays, lists, etc. XML: labelled, ordered trees of unbounded depth! • Clearly, you could parse specific CSV files into trees, but you’d need to use extra-CSV rules for that. 78 Friday, 30 September 2011 78 Schemas: why? • • • SGML – Parsing & Validation – User/developer documentation RDBMS – No database without schema – DB schema determines tables, attributes, names, etc. – Query optimization, integrity, etc. XML – No schema needed at all! – Well-formed XML can be • parsed to yield data that can be • manipulated, queried, etc. – Non-well formed XML....not so much – Well-formedness is a universal minimal schema 79 Friday, 30 September 2011 79 Schemas for XML: why? • • • Well-formedness is minimal – any name can appear as an element or attribute name – any shape of content/structure of nesting is permitted Few applications want that… we’d like to rely on a format with – core concepts that result in – core (tag & attribute) names and – intended structure – intended datatypes e.g., string for names, integer for age – although you might want to keep it extensible & flexible Friday, 30 September 2011 <addresses> <name> <address>Bijan</address> <surname>Parsia</surname> <room>2.32</room> </name> <room> <room><room> Uli</room> </room> <room>Sattler</room> <room>2.21</room> </room> </addresses> <addresses> <address> <name>Bijan</name> <surname>Parsia</surname> </address> <address> <name>Uli</name> <minit>M<minit> <surname>Sattler</surname> <room>2.21</room> </address> </addresses> 80 80 Schemas for XML: why? • • A schema describes aspects of data: – what’s legal (what a document can contain) – what’s expected (what a document must contain) – what’s assumed (default values) Two modes for using a schema – descriptive: • describing documents • for other people • so that they know how to serialize – prescriptive: • prevent your application from using wrong documents <addresses> <address> <name>Bijan</name> <surname>Parsia</surname> </address> <address> <name>Uli</name> <minit>M<minit> <surname>Sattler</surname> <room>2.21</room> </address> </addresses> 81 Friday, 30 September 2011 81 Benefits of an (XML) schema • • • • • Specification – you document/describe/publish your format – so that it can be used across multiple implementations Document for applications – applications can do error-checking in a format independent way • checking whether ax XML document conforms to a schema can be done by a generic tool, • no need to be changed when schema changes • automatically! Optimization (a la RDBMS) – query answering can take schema into account to improve performance Extra support for authoring – Default values & auto-completion – Nicer queries (see coursework week 3) – see <Oxygen/> Key questions: when are these benefits worth the hassle? Which schema language to choose (there are many!)? 82 Friday, 30 September 2011 82 Things an XML schema describes • • At least: – legal names • elements • attributes • entities, etc. – legal relationships between items • content model for elements and attributes describing what is allowed as contents • value of an entity Many XML schema languages: – DTDs (old but interesting) – W3C XML Schema (data champion) – RelaxNG (document champion) – Schematron (error handling star) – and many more Grammar based OO/type based Rule based 83 Friday, 30 September 2011 83 DTDs: a starter! • • <addresses> <address type="acad"> <name>Bijan</name> <surname>Parsia</surname> <room loc="Kilb">2.32</room> </address> <address type="acad"> <name>Uli</name> <surname>Sattler</surname> <room loc="Kilb">2.21</room> </address> </addresses> A DTD is a collection of declarations that specify – which elements are allowed, – how elements can be nested, i.e., what child elements an element can have, incl. their • type, order, and (weakly) number – what attributes an element can have, incl. their • attribute name, type, number (required or optional) – what entities may appear, incl. their value DTDs are standardized in XML 1.0 – widely implemented, but diminishingly used 84 Friday, 30 September 2011 84 DTDs: a starter! • • • Derived from SGML, but simplified non-XML syntax – not parseable/manipulatable by a DOM – not extensible • you can’t ‘import’ one DTD into another one – can be included in XML document as internal subset – human readable/writable limited expressivity – limited for describing structure of documents (more later) – limited for describing data • e.g., no ‘date’, ‘real’, etc. 85 Friday, 30 September 2011 85 DTD Declarations • To describe logical structure: – elements <!ELEMENT name (#PCDATA)> <!ELEMENT person (name,address+,email*)> <!ELEMENT address (city,(nr,street)?)> – attributes <!ATTLIST name type (family|personal|place) "personal"> “as attributes for elements name you can use type and its value is either “family” or “personal” or “place”, with “personal” being default ok are: <name>Bijan</name> <name type=”personal”>Bijan</name> <name type=”family”>Bijan</name> not ok is: <name type=”DontKnow”>Bijan</name> 86 Friday, 30 September 2011 86 Element content model in DTDs & regular expressions • In a DTD, we have 1 element declaration per element name <!ELEMENT element-name (element-content)> • and element-content is a regular expression over element names • Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e1 and e2 ∈ regexp(N), then so are • e1,e2 (concatenation) • e1|e2 (choice) • e1* (repetition) • In DTDs, we use – EMPTY instead of ε – ANY instead of N* 87 Friday, 30 September 2011 87 Regular expressions • Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e1 and e2 ∈ regexp(N), then so are • e1,e2 (concatenation) • e1|e2 (choice) • e1* (repetition) • • • • • • Given a regular expression e, a string w matches e, if w = ε = e or w = n = e for some n in N, or if w = w1 w2 and e = (e1 , e2) and w1 matches e1 and w2 matches e2 , or if e = (e1 | e2) and w matches e1 or w matches e2 if w = ε and e = e1* if w = w1 w2... wn and e = e1* and each wi matches e1 88 Friday, 30 September 2011 88 Regular expressions • Hence we can use – e+ as abbreviation for (e,e*) – e? as abbreviation for (e|ε) 89 Friday, 30 September 2011 89 DTD Declarations • To describe physical structure: – entities <!ENTITY WelshPlace "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch"> – notations • can be ignored or read up • for non-textual data 90 Friday, 30 September 2011 90 XML with external subset 1 Dilbert cartoon, in say Dilbert678.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog> <panels> <panel colour="none"> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You havenʼt given me enough resources to do my project.</speech> </bubble> </bubbles> </panel> ... </panels> </cartoon> Friday, 30 September 2011 using DTD in cartoon.dtd: <?xml version="1.0" encoding="UTF-8"?> <!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon copyright CDATA #REQUIRED> <!ATTLIST cartoon year CDATA #REQUIRED> <!ELEMENT prolog (series, author, characters)> <!ELEMENT series (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT characters (character)*> <!ELEMENT character (#PCDATA)> <!ELEMENT panels (panel)+> <!ELEMENT panel (scene, bubbles)> <!ATTLIST panel colour CDATA #IMPLIED> <!ELEMENT scene (#PCDATA)> <!ELEMENT bubbles (bubble)*> <!ELEMENT bubble (speaker, speech)*> <!ELEMENT speaker (#PCDATA)> <!ELEMENT speech (#PCDATA)> 91 91 XML with internal subset 1 Dilbert cartoon, in say Dilbert678.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cartoon [ <!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon copyright CDATA #REQUIRED> <!ATTLIST cartoon year CDATA #REQUIRED> <!ELEMENT prolog (series, author, characters)> <!ELEMENT series (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT characters (character)*> <!ELEMENT character (#PCDATA)> <!ELEMENT panels (panel)+> <!ELEMENT panel (scene, bubbles)> <!ATTLIST panel colour CDATA #IMPLIED> <!ELEMENT scene (#PCDATA)> <!ELEMENT bubbles (bubble)*> <!ELEMENT bubble (speaker, speech)*> <!ELEMENT speaker (#PCDATA)> <!ELEMENT speech (#PCDATA)>]> <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> Friday, 30 September 2011 </characters> </prolog> <panels> <panel colour="none"> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You havenʼt given me enough resources to do my project.</speech> </bubble> </bubbles> </panel> ... </panels> </cartoon> 92 92 XML with mixed internal and external subset 1 Dilbert cartoon, in say Dilbert678.xml: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cartoon SYSTEM "cartoon.dtd" [<!ATTLIST cartoon oneMore CDATA #IMPLIED>]> <cartoon copyright="United Feature Syndicate" year="2000"> <prolog> <series>Dilbert</series> <author>Scott Adams</author> <characters> <character>The Pointy-Haired Boss</character> <character>Dilbert</character> </characters> </prolog> <panels> <panel colour="none"> <scene> Pointy-Haired Boss and Dilbert sitting at table. </scene> <bubbles> <bubble> <speaker>Dilbert</speaker> <speech>You havenʼt given me enough resources to do my project.</speech> </bubble> </bubbles> </panel> ... </panels> Friday, 30 September 2011 </cartoon> using DTD in cartoon.dtd: <?xml version="1.0" encoding="UTF-8"?> <!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon copyright CDATA #REQUIRED> <!ATTLIST cartoon year CDATA #REQUIRED> <!ELEMENT prolog (series, author, characters)> <!ELEMENT series (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT characters (character)*> <!ELEMENT character (#PCDATA)> <!ELEMENT panels (panel)+> <!ELEMENT panel (scene, bubbles)> <!ATTLIST panel colour CDATA #IMPLIED> <!ELEMENT scene (#PCDATA)> <!ELEMENT bubbles (bubble)*> <!ELEMENT bubble (speaker, speech)*> <!ELEMENT speaker (#PCDATA)> <!ELEMENT speech (#PCDATA)> 93 93 Validity of XML documents w.r.t. DTDs A document • can have a doctype declaration, and this declaration specifies the – external subset or – internal subset or – both. – ...the DTD associated with a document is the union of both subsets. • is valid w.r.t. a DTD if it satisfies all constraints in DTD • In particular, each element X in doc – must have an element declaration for X’s name n in DTD and – must conform to it, i.e., <!ELEMENT n (element-content)> • its childnodes must match element-content, i.e., – if element-content = #PCDATA, then X must have simple (text) content – if element-content is regular expression e, then the sequence of X’s child nodes’ names must match e • its attributes must conform to n’s !ATTLIST declaration(s) Friday, 30 September 2011 94 94 Validity of XML documents w.r.t. DTDs • A document D is valid if it – D is associated with a DTD (internal or external or both) and – D is valid w.r.t. that DTD and – the declaration element is D’s root element <!DOCTYPE cartoon SYSTEM "cartoon.dtd"> • • Note: a document can be valid, and also valid w.r.t. another DTD – which might be more or less strict than the associated DTD Try <oXygen/> – for your coursework – to write XML documents and DTDs – it automatically checks • whether your document is well-formed and • whether your document conforms to your DTD! 95 Friday, 30 September 2011 95 Pfew - Summary of today • • • • • • Semi-structured data – datamodel – XML – trees Parsing & serializing Dom & SAX Self-describing Why schemas A first schema language: DTDs • ...see you in the labs - now - to get you started on the coursework! 96 Friday, 30 September 2011 96