XML 1 Introduction Recycled from Gill Windall’s notes © 2004 University of Greenwich 1 XML Basics • This lecture aims to cover: – What is XML and why it is significant – Content versus presentation – Displaying XML documents – Well-formed XML documents – Further XML syntax – What XML is actually used for – Technologies related to XML – Introduction to DTDs and Schemas – Introduction to namespaces © 2004 University of Greenwich 2 What is XML? 1. A revolutionary and pervasive technology "XML is what we should be focussing on in the industry for the next 2 to 4 years" "XML gives us the freedom to do what we want" Don Box - IT Guru - Dec 2001 – but pervasive things can be a bit difficult to get a handle on ... © 2004 University of Greenwich 3 What is XML? 2. eXtensible Markup Language – HTML tags and attributes are restricted to those that the browser has been coded to recognise – XML is extensible because tags and attributes can be invented to suit any application e.g. <book> <ISBN>1-34565-79-8</ISBN> <date>2001-07-03</date> <title> Hamsters and other Furry Rodents </title> </book> © 2004 University of Greenwich 4 What is XML? 3. A simplified version of SGML (Standardised General Markup Language) - a language for defining mark-up languages – XML and HTML are related (hence the family likeness) via SGML is defined using is a subset of SGML HTML Other SGML languages XHTML © 2004 University of Greenwich XML Other XML languages 5 What is XML? – SGML is too complex for easy automatic processing. Generic tools for manipulating SGML documents are expensive and large. – XML is designed for easy automatic processing. Generic tools for manipulating XML documents are relatively cheap and efficient. 4. A W3C standard - the core specification is XML 1.0 5. More than just hype (although it has been heavily hyped) © 2004 University of Greenwich 6 W3C Design Goals of XML 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. XML shall be straightforwardly usable over the Internet. XML shall support a wide variety of applications. XML shall be compatible with SGML. It shall be easy to write programs which process XML documents. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. XML documents should be human-legible and reasonably clear. The XML design should be prepared quickly. The design of XML shall be formal and concise. XML documents shall be easy to create. Terseness in XML markup is of minimal importance. http://www.w3.org/TR/REC-xml/#sec-origin-goals © 2004 University of Greenwich 7 Why XML? • HTML tags and attributes are pre-defined in the HTML (XHTML) standard and describe presentation • XML tags and attributes are defined to describe content and structure XML separates content from presentation © 2004 University of Greenwich 8 Separation of Content and Presentation <tr> <td>1-56543-87-9</td> <td>1998-03-07</td> <td>Frogs and Toads of the British Isles </td> </tr> content meaning ????? <book> <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> <title>Frogs and Toads of the British Isles </title> </book> content meaning clear presentation ????? presentation defined © 2004 University of Greenwich 9 Separation of Content and Presentation Presentation can be rendered differently for different devices and needs catalogue web browser on a PC <book> <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> <title>Frogs and Toads of the British Isles </title> </book> tablet printed paper © 2004 University of Greenwich audio advert mobile phone 10 Separation of Content and Presentation Enables meaningful searches <book> <book> <ISBN>1-56543-87-9</ISBN> <book> <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> <book> <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> <title>Frogs and Toads of <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> <title>Frogs ToadsIsles of theand British <date>1998-03-07</date> <title>Frogs ToadsIsles of theand British </title> <title>Frogs ToadsIsles of theand British </title> </book> the British Isles </title> </book> </title> </book> </book> query: FIND book WHERE ISBN= XML search engine © 2004 University of Greenwich 11 Separation of Content and Presentation A universal format for data exchange and communication Book retailer SQL Server on Windoze © 2004 Book publisher XML University of Greenwich Oracle server on UNIX 12 Separation of Content and Presentation Data storage An alternative to Database technology? – Not really, XML is not a replacement for a RDBMS but may be used in places where a full RDBMS may be overkill. – XML schemas are well established but research is ongoing in the development of XML ontologies • © 2004 ontology: classification of categories of being University of Greenwich 13 Displaying XML documents • XML documents define content but not presentation • The more recent browsers can display XML documents as a hierarchical structure Displaying XML documents • So how do you tell browsers (or other presentation software) how to display document that use XML defined tags? – Using style sheets of course: XML document + style sheet = presentable document • There are two main style sheet languages CSS – Cascading Style Sheets XSL – eXtensible Stylesheet Language • XSL is much more complex and powerful XSL-FO and XSLT • For now we'll just use CSS to explore some possibilities © 2004 University of Greenwich 15 Displaying XML documents books.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/css" href="books.css"?> <booklist> book { <book> <ISBN>1-34565-79-8</ISBN> ISBN { <date>2001-07-03</date> <title>Hamsters and other Furry Rodents</title> </book> <book> <ISBN>1-56543-87-9</ISBN> <date>1998-03-07</date> title { <title>Frogs and Toads of the British Isles</title> </book> </booklist> date { © 2004 University of Greenwich books.css display:block } display:inline; font-family:arial; color:blue; font-size:10pt; font-weight:bold } display:inline; font-family:arial; } display:none} 16 Well Formed and Valid XML Documents • An XML document that conforms to the strict syntax rules in the XML 1.0 specification can be considered to be well-formed. • In addition, an XML document can be considered as valid if it conforms to a set of grammar rules defined in: – a Document Type Definition (DTD) or… – an XML Schema (XSD). • XML documents don't need to have an associated DTD or Schema – in which case they can only be checked for being well formed but not for validity. © 2004 University of Greenwich 17 XML Syntax Rules 1. Document has a single root element 2. Tags must be properly nested • no overlapping tag pairs 3. All tags must have a closing tag • or be self closing 4. Tag names are case sensitive 5. Tag attributes are in the opening tag • • © 2004 unique attribute name attribute value must be quoted University of Greenwich 18 XML Syntax Rules 1. Only one root element is allowed in a document This is called the document element <head> <title>Some HTML doc</title> </head> <body> A bit of text </body> not well formed <html> <head> <title>Some HTML doc</title> </head> <body> A bit of text </body> well formed </html> To be well-formed an XML document must have a document element that encloses all the other elements © 2004 University of Greenwich 19 XML Syntax Rules 2. All elements must be "properly nested" • Any element contained inside another element has to be completely contained within it – you can't have one element partly within another • The following may work as XHTML but it is not well formed XML <b>bold text <i>bold italic text</b> italic text</i> • Whereas this is well formed XML (XHML) <b>bold text <i>bold italic text</i></b><i> italic text</i> © 2004 University of Greenwich 20 XML Syntax Rules Rules 1 and 2 combined mean that it is always possible to represent an XML document as a simple hierarchical tree <html> <head><title>Some HTML doc</title></head> <body><p>A bit of text</p></body> </html> head title Some HTML doc body p A bit of text html © 2004 University of Greenwich 21 XML Syntax Rules Quick Quiz Draw a hierarchical tree to represent the following document <html><head><title>Flowers</title></head> <body> <p>List of <b>flowers</b></p> <ul> <li>daisy</li><li><i>buttercup</i></li> </ul> <hr></hr> </body></html> © 2004 University of Greenwich 22 XML Syntax Rules 3. All elements must have a closing tag • The following acceptable HTML is not well-formed XML <p>first paragraph <p>second paragraph • Whereas this is <p>first paragraph</p> <p>second paragraph</p> • If the tag is truly empty (i.e. it has no content) then the empty tag notation may be used so… <hr></hr> • may be rewritten as <hr /> © 2004 University of Greenwich 23 XML Syntax Rules 4. Tag names are case sensitive • <title> is different to <Title> is different to <TITLE> • closing tags must match case – of course <title>Hamsters and other Furry Rodents</TITLE> • would be wrong © 2004 University of Greenwich 24 XML Syntax Rules 5. Some rules concerning attributes • Start tags and empty tags but not end tags can contain attributes • Attributes always exists as name=“value” pairs • The attribute value must always be quoted with " or ' • The attribute name must be unique within the tag • Some bad attribute examples: <film rating=PG>Snow White turns ugly</film> <car colour='silver trim' colour="red body">KKE 763L</car> <transaction>credit</transaction id="12543"> <transaction synchronised>close account</transaction> © 2004 University of Greenwich 25 Some More XML Syntax • Knowing about elements (i.e. tags), attributes and well-formed documents allows you create basic XML documents • Other aspects of XML syntax include – – – – – – © 2004 XML declaration Processing instructions Comments Character references and Entities Special symbols CDATA sections University of Greenwich 26 XML Declaration • Ideally all XML documents should start with an XML declaration (SGML processing instruction) <?xml version="1.0" encoding="UTF-8"?> • If included the declaration must: – be the first line in the document – be on a single line beginning with <?xml and ending with ?> – include version= to indicate the version of xml • currently this must be "1.0" – the declaration may optionally include: • encoding= indicates the encoding used to store the file typically this is "UTF-8" (8 bit Unicode) • standalone="[yes|no]" does the document depend on external markup declarations? © 2004 University of Greenwich 27 Processing Instructions • Instructions intended for an application processing the XML document • PIs have the form <?target instruction ?> – target identifies the program that the instruction is intended for – instruction is the instruction to the target program • A very common PI is <?xml-stylesheet href="mystyle.css" type="text/css"?> target © 2004 instruction University of Greenwich 28 Character References • As in HTML these can be used to include nonstandard characters in the document – i.e. things that can be displayed but not easily entered from a standard keyboard • Format is: – &#NNN; &#xHHH; – NNN is the decimal number or HHH is the hex number representing the character in the Unicode character set. <test>it's Greek to me &#934; &#916; &#x394;</test> • it's Greek to me Φ Δ Δ © 2004 University of Greenwich 29 Entities • Some symbols have a special meaning in XML and must be entered as entities (or character references) • Standard symbols – – – – – – Less than symbol (<) - &lt; Greater than symbol (>) - &gt; Quotation mark (“) - &quot; Apostrophe (‘) - &apos; Ampersand (&) - &amp; Copyright (©) - &copy; • Customised ones e.g. &copyw; to insert a predefined (e.g. in a DTD) copyright statement. © 2004 University of Greenwich 30 CDATA Sections • A way of including data that you don't want interpreted as XML • Form is <![CDATA[the data not to be interpreted as XML]]> • Why would you do this? – Perhaps to include examples of XML in a document which you don't want processed as XML e.g. <![CDATA[ <wrong attr=val />]]> • Comments like HTML use <!-- and --> © 2004 University of Greenwich 31 XML Applications Standard vocabularies for representing and exchanging specialist data e.g. legal, scientific, medical, mathematical vocabularies <molecule convention="MDLMol" id="dopamine" title="DOPAMINE"> <date day="22" month="11" year="1995"></date> <atomArray> <atom id="a1"> <string builtin="elementType">C</string> <float builtin="x2">0.0222</float> <float builtin="y2">0.8115</float> </atom> © 2004 University of Greenwich 32 XML Applications • Used by human-facing client software e.g. – eXtensible Hypertext Markup Language XHTML – Wireless Markup Language - WML – Synchronised Multimedia Integration Language - SMIL – Scalable Vector Graphics - SVG – MathML – Voice over XML - VoiceXML © 2004 University of Greenwich 33 XML Applications • Meta data (data about data) to describe resources e.g. – – – – – Resource Description Framework RDF Really Simple Syndication RSS DARPA Agent Markup Language DAML Ontology Integration Language OIL Web Ontology Language OWL <rdf:Description about="http://www.gre.ac.uk/examregs.html"> <cd:Creator>Fred Bloggs</cd:Creator> <cd:Date>20021212</cd:Date> </rdf:Description> © 2004 University of Greenwich 34 XML Applications • Web services • Buried deep in computer to computer communications – XML-RPC, SOAP, WSDL, UDDI <SOAP-ENV:Body> <proc:GetCurrentPrice xmlns:proc="proc-URI"/> • Business to business (B2B) data exchange – BizTalk, ebXML <BusinessPartnerRole name="Buyer"> <Performs initiatingRole="Buyer"/> • More B2B than B2C © 2004 University of Greenwich 35 XHTML WML HTML VoiceXML Web Site XML documents transformed using XSLT for multi-channel delivery XML multimedia XML aware search engines Enterprise Systems XML communication within a distributed system (SOAP, XMLRPC) B2B links XML data exchange XML based web services Call to third party services e.g. Microsoft Passport XML in the Enterprise XML enabled databases e.g. Oracle, DB2, SQL Server XML Technologies Applications of XML CML MathML WML VoiceML XHTML SMIL SVG RDF SOAP UDDI WSDL ebXML etc. etc. Supporting Specifications Supporting Tools Xpath Xlink Browsers – IE Mozilla Xpointer Xquery APIs – DOM SAX XSLT XSL-FO Parsers – Expat MSXML Xerces CSS DOM etc. IDEs – XMLSpy Stylus Core XML Syntax DTD XSD Namespaces © 2004 University of Greenwich 37 DTDs and Schemas • DTDs and schemas (XSD) are alternative ways of defining an XML language. • They contain rules that specify things such as – – – – the tags in the vocabulary which tags are allowed to be nested in other tags which tags and attributes are optional / mandatory which values are allowed for attributes • XML languages defined by a DTDs or schemas are used to create valid XML documents. © 2004 University of Greenwich 38 DTDs and Schemas • For an XML document to be valid it must conform to the rules specified in its DTD or Schema XML documents that use the language defined in the DTD or Schema DTD or Schema defines an XML language © 2004 University of Greenwich 39 Example XML with DTD transactions.xml <?xml version="1.0" encoding="UTF-8"?> the DOCTYPE declaration <!DOCTYPE transactions SYSTEM "translang.dtd"> associates a DTD in a separate <transactions> file (translang.dtd) with this <transaction> document <trantype>credit</trantype> <amount>2000</amount> </transaction> <transaction> <trantype>debit</trantype> translang.dtd says that: <amount>1000</amount> </transaction> • the transactions element contains zero <transaction> or more transaction elements <trantype>credit</trantype> • each transaction element contains a <amount>300</amount> trantype element followed by an </transaction> amount element </transactions> • each trantype element contains data • each amount element contains data translang.dtd <?xml version="1.0" encoding="UTF-8"?> <!ELEMENT transactions (transaction*)> <!ELEMENT transaction (trantype, amount)> <!ELEMENT trantype (#PCDATA)> <!ELEMENT amount (#PCDATA)> XML Schema • DTDs: – easy for humans to cope with – older than schemas • supported by a much wider range of XML tools and software – have poor support for namespaces • Schemas: – more verbose – much more expressive than DTDs • data types, constraints on values – an XML based vocabulary • can be manipulated with general purpose XML tools – support namespaces – declared in the root element of the XML document <transactions xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="translang.xsd"> © 2004 University of Greenwich 41 <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2000/10/XMLSchema" elementFormDefault="qualified"> <xs:element name="transactions"> <xs:complexType> <xs:sequence> <xs:element ref="transaction" minOccurs="0" maxOccurs="100"/> </xs:sequence> </xs:complexType> the transactions element contains </xs:element> between 0 and 100 transaction <xs:element name="transaction"> elements <xs:complexType> <xs:sequence> <xs:element ref="trantype"/> the transaction element contains a <xs:element ref="amount"/> </xs:sequence> trantype element followed by an </xs:complexType> amount element </xs:element> <xs:element name="trantype"> <xs:simpleType> <xs:restriction base="xs:string"> the trantype element contains <xs:enumeration value="credit"/> <xs:enumeration value="debit"/> a string with either the value </xs:restriction> "credit" or "debit" </xs:simpleType> </xs:element> <xs:element name="amount" type="xs:integer"/> the trantype element </xs:schema> translang.xsd contains an integer Quick Quiz • Is the following document valid according to either or both of the DTD or Schema above? <transactions> <transaction> <trantype>credit</trantype><amount>24.75</amount> </transaction> <transaction> <trantype>credit</trantype><amount>650</amount> </transaction> </transactions> © 2004 University of Greenwich 43 Namespaces • Namespaces are a way of avoiding name conflicts – where different XML vocabularies use the same element names to mean different things. • Consider two hypothetical XML languages; ShoeML and PicML – in the language ShoeML the <size> element refers to shoe size – in PicML the <size> element refers to the size of an image. • The problem comes when you want to mix several vocabularies what does size mean? © 2004 <shoe> <style>SupaFeet</style> <size>39</size> <image> <filename>supafeet.jpg</filename> <size>100kb</size> </image> </shoe> University of Greenwich 44 Namespaces • The previous example is well-formed XML but it is difficult for applications to know how to process <size>. • The solution is to use prefixes for the element names to distinguish between them – can also be used for attributes • Here shoe vocabulary element names are prefixed by shu: and images element names are prefixed by img: <shu:shoe> <shu:style>SupaFeet</shu:style> <shu:size>39</shu:size> <img:image> <img:filename>supafeet.jpg</img:filename> <img:size>100 kb</img:size> </img:image> </shu:shoe> © 2004 University of Greenwich 45 References • There are masses of XML books and websites. – "Professional XML" - Birbeck et al, Wrox Press • Very comprehensive book. • This lecture covers much of the material in chapters 1 and 2 – “SAMS Teach Yourself XML in 24 hours” - Morrison • Cheap as chips, good scope but little depth • W3Schools online tutorial http://www.w3schools.com – Try their online XML test • World Wide Web consortium at http://www.w3.org – The home of the XML specification and so much more. • XML in practice from http://www.xml.org – Articles, white papers, user groups and more • XML resources and information from http://www.xml.org – Provided by Tim O’Reilly © 2004 University of Greenwich 46 Summary • XML is a meta-language used to define application specific markup languages – XHTML, MathML, CML, WML, ShoeML, etc. • XML is designed to be straightforward and easy to use • XML provides simple syntactic rules that result in wellformed hierarchically structured documents • DTDs or Schemas are used to define valid XML languages – namespaces avoid conflicts between XML languages • XML separates content from presentation – CSS and XSL can be used to render XML documents in a readable form © 2004 University of Greenwich 47