xml

advertisement
XML 1
Introduction
Recycled from Gill Windall’s notes
© 2004
University of Greenwich
1
XML Basics
• This lecture aims to cover:
– What is XML and why it is significant
– Content versus presentation
– Displaying XML documents
– Well-formed XML documents
– Further XML syntax
– What XML is actually used for
– Technologies related to XML
– Introduction to DTDs and Schemas
– Introduction to namespaces
© 2004
University of Greenwich
2
What is XML?
1. A revolutionary and pervasive technology
"XML is what we should be focussing on
in the industry for the next 2 to 4 years"
"XML gives us the freedom to do what we
want"
Don Box - IT Guru - Dec 2001
– but pervasive things can be a bit difficult to get
a handle on ...
© 2004
University of Greenwich
3
What is XML?
2. eXtensible Markup Language
– HTML tags and attributes are restricted to those that
the browser has been coded to recognise
– XML is extensible because tags and attributes can be
invented to suit any application e.g.
<book>
<ISBN>1-34565-79-8</ISBN>
<date>2001-07-03</date>
<title>
Hamsters and other Furry Rodents
</title>
</book>
© 2004
University of Greenwich
4
What is XML?
3. A simplified version of SGML (Standardised
General Markup Language) - a language for
defining mark-up languages
– XML and HTML are related (hence the family likeness)
via SGML
is defined using
is a subset of
SGML
HTML
Other SGML languages
XHTML
© 2004
University of Greenwich
XML
Other XML languages
5
What is XML?
– SGML is too complex for easy automatic processing.
Generic tools for manipulating SGML documents are
expensive and large.
– XML is designed for easy automatic processing.
Generic tools for manipulating XML documents are
relatively cheap and efficient.
4. A W3C standard - the core specification is XML
1.0
5. More than just hype (although it has been
heavily hyped)
© 2004
University of Greenwich
6
W3C Design Goals of XML
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the
absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
http://www.w3.org/TR/REC-xml/#sec-origin-goals
© 2004
University of Greenwich
7
Why XML?
• HTML tags and attributes are pre-defined
in the HTML (XHTML) standard and
describe presentation
• XML tags and attributes are defined to
describe content and structure
XML separates content
from presentation
© 2004
University of Greenwich
8
Separation of Content and Presentation
<tr>
<td>1-56543-87-9</td>
<td>1998-03-07</td>
<td>Frogs and Toads of
the British Isles
</td>
</tr>
content
meaning ?????
<book>
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
<title>Frogs and Toads of
the British Isles
</title>
</book>
content
meaning clear
presentation ?????
presentation defined
© 2004
University of Greenwich
9
Separation of Content and Presentation
Presentation can be rendered differently
for different devices and needs
catalogue
web browser
on a PC
<book>
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
<title>Frogs and Toads of
the British Isles
</title>
</book>
tablet
printed paper
© 2004
University of Greenwich
audio
advert
mobile
phone
10
Separation of Content and Presentation
Enables meaningful searches
<book>
<book>
<ISBN>1-56543-87-9</ISBN>
<book>
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
<book>
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
<title>Frogs
and Toads of
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
<title>Frogs
ToadsIsles
of
theand
British
<date>1998-03-07</date>
<title>Frogs
ToadsIsles
of
theand
British
</title>
<title>Frogs
ToadsIsles
of
theand
British
</title>
</book>
the British Isles
</title>
</book>
</title>
</book>
</book>
query:
FIND book
WHERE ISBN=
XML search
engine
© 2004
University of Greenwich
11
Separation of Content and Presentation
A universal format for data exchange and
communication
Book retailer
SQL Server
on Windoze
© 2004
Book publisher
XML
University of Greenwich
Oracle server
on UNIX
12
Separation of Content and Presentation
Data storage
An alternative to Database technology?
– Not really, XML is not a replacement for a
RDBMS but may be used in places where a
full RDBMS may be overkill.
– XML schemas are well established but
research is ongoing in the development of
XML ontologies
•
© 2004
ontology: classification of categories of being
University of Greenwich
13
Displaying XML documents
• XML documents define content but not presentation
• The more recent browsers can display XML documents
as a hierarchical structure
Displaying XML documents
• So how do you tell browsers (or other presentation
software) how to display document that use XML defined
tags?
– Using style sheets of course:
XML document + style sheet = presentable document
• There are two main style sheet languages
CSS – Cascading Style Sheets
XSL – eXtensible Stylesheet Language
• XSL is much more complex and powerful
XSL-FO and XSLT
• For now we'll just use CSS to explore some possibilities
© 2004
University of Greenwich
15
Displaying XML documents
books.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="books.css"?>
<booklist>
book {
<book>
<ISBN>1-34565-79-8</ISBN>
ISBN {
<date>2001-07-03</date>
<title>Hamsters and other Furry Rodents</title>
</book>
<book>
<ISBN>1-56543-87-9</ISBN>
<date>1998-03-07</date>
title {
<title>Frogs and Toads of the British Isles</title>
</book>
</booklist>
date {
© 2004
University of Greenwich
books.css
display:block }
display:inline;
font-family:arial;
color:blue;
font-size:10pt;
font-weight:bold }
display:inline;
font-family:arial; }
display:none}
16
Well Formed and Valid XML Documents
• An XML document that conforms to the strict
syntax rules in the XML 1.0 specification can be
considered to be well-formed.
• In addition, an XML document can be
considered as valid if it conforms to a set of
grammar rules defined in:
– a Document Type Definition (DTD) or…
– an XML Schema (XSD).
• XML documents don't need to have an
associated DTD or Schema
– in which case they can only be checked for being well
formed but not for validity.
© 2004
University of Greenwich
17
XML Syntax Rules
1. Document has a single root element
2. Tags must be properly nested
•
no overlapping tag pairs
3. All tags must have a closing tag
•
or be self closing
4. Tag names are case sensitive
5. Tag attributes are in the opening tag
•
•
© 2004
unique attribute name
attribute value must be quoted
University of Greenwich
18
XML Syntax Rules
1. Only one root element is allowed in a document
This is called the document element
<head>
<title>Some HTML doc</title>
</head>
<body>
A bit of text
</body>
not well formed
<html>
<head>
<title>Some HTML doc</title>
</head>
<body>
A bit of text
</body>
well formed
</html>
To be well-formed an XML document must have a
document element that encloses all the other elements
© 2004
University of Greenwich
19
XML Syntax Rules
2. All elements must be "properly nested"
• Any element contained inside another element has to be
completely contained within it
– you can't have one element partly within another
• The following may work as XHTML but it is not well
formed XML
<b>bold text <i>bold italic text</b> italic text</i>
• Whereas this is well formed XML (XHML)
<b>bold text <i>bold italic text</i></b><i> italic text</i>
© 2004
University of Greenwich
20
XML Syntax Rules
Rules 1 and 2 combined mean that it is
always possible to represent an XML
document as a simple hierarchical tree
<html>
<head><title>Some HTML doc</title></head>
<body><p>A bit of text</p></body>
</html>
head
title
Some HTML doc
body
p
A bit of text
html
© 2004
University of Greenwich
21
XML Syntax Rules
Quick Quiz
Draw a hierarchical tree to
represent the following document
<html><head><title>Flowers</title></head>
<body>
<p>List of <b>flowers</b></p>
<ul>
<li>daisy</li><li><i>buttercup</i></li>
</ul>
<hr></hr>
</body></html>
© 2004
University of Greenwich
22
XML Syntax Rules
3. All elements must have a closing tag
• The following acceptable HTML is not well-formed XML
<p>first paragraph
<p>second paragraph
• Whereas this is
<p>first paragraph</p>
<p>second paragraph</p>
• If the tag is truly empty (i.e. it has no content) then the empty tag
notation may be used so…
<hr></hr>
• may be rewritten as
<hr />
© 2004
University of Greenwich
23
XML Syntax Rules
4. Tag names are case sensitive
• <title> is different to <Title> is
different to <TITLE>
• closing tags must match case – of course
<title>Hamsters and other Furry Rodents</TITLE>
• would be wrong
© 2004
University of Greenwich
24
XML Syntax Rules
5. Some rules concerning attributes
• Start tags and empty tags but not end tags can contain
attributes
• Attributes always exists as name=“value” pairs
• The attribute value must always be quoted with " or '
• The attribute name must be unique within the tag
• Some bad attribute examples:
<film rating=PG>Snow White turns ugly</film>
<car colour='silver trim' colour="red body">KKE 763L</car>
<transaction>credit</transaction id="12543">
<transaction synchronised>close account</transaction>
© 2004
University of Greenwich
25
Some More XML Syntax
• Knowing about elements (i.e. tags), attributes
and well-formed documents allows you create
basic XML documents
• Other aspects of XML syntax include
–
–
–
–
–
–
© 2004
XML declaration
Processing instructions
Comments
Character references and Entities
Special symbols
CDATA sections
University of Greenwich
26
XML Declaration
• Ideally all XML documents should start with an
XML declaration (SGML processing instruction)
<?xml version="1.0" encoding="UTF-8"?>
• If included the declaration must:
– be the first line in the document
– be on a single line beginning with <?xml and ending
with ?>
– include version= to indicate the version of xml
• currently this must be "1.0"
– the declaration may optionally include:
• encoding= indicates the encoding used to store the file
typically this is "UTF-8" (8 bit Unicode)
• standalone="[yes|no]" does the document depend on
external markup declarations?
© 2004
University of Greenwich
27
Processing Instructions
• Instructions intended for an application
processing the XML document
• PIs have the form
<?target
instruction ?>
– target identifies the program that the instruction is
intended for
– instruction is the instruction to the target program
• A very common PI is
<?xml-stylesheet href="mystyle.css" type="text/css"?>
target
© 2004
instruction
University of Greenwich
28
Character References
• As in HTML these can be used to include nonstandard characters in the document
– i.e. things that can be displayed but not easily entered
from a standard keyboard
• Format is:
– &#NNN;
&#xHHH;
– NNN is the decimal number or HHH is the hex
number representing the character in the Unicode
character set.
<test>it's Greek to me Φ Δ Δ</test>
• it's Greek to me Φ Δ Δ
© 2004
University of Greenwich
29
Entities
• Some symbols have a special meaning in XML
and must be entered as entities (or character
references)
• Standard symbols
–
–
–
–
–
–
Less than symbol (<) - <
Greater than symbol (>) - >
Quotation mark (“) - "
Apostrophe (‘) - '
Ampersand (&) - &
Copyright (©) - ©
• Customised ones e.g. &copyw; to insert a
predefined (e.g. in a DTD) copyright statement.
© 2004
University of Greenwich
30
CDATA Sections
• A way of including data that you don't want
interpreted as XML
• Form is
<![CDATA[the data not to be interpreted as XML]]>
• Why would you do this?
– Perhaps to include examples of XML in a document
which you don't want processed as XML e.g.
<![CDATA[ <wrong attr=val />]]>
• Comments like HTML use <!-- and -->
© 2004
University of Greenwich
31
XML Applications
Standard vocabularies for representing and
exchanging specialist data
e.g. legal, scientific, medical, mathematical
vocabularies
<molecule convention="MDLMol" id="dopamine" title="DOPAMINE">
<date day="22" month="11" year="1995"></date>
<atomArray>
<atom id="a1">
<string builtin="elementType">C</string>
<float builtin="x2">0.0222</float>
<float builtin="y2">0.8115</float>
</atom>
© 2004
University of Greenwich
32
XML Applications
• Used by human-facing client software e.g.
– eXtensible Hypertext Markup Language XHTML
– Wireless Markup Language - WML
– Synchronised Multimedia Integration
Language - SMIL
– Scalable Vector Graphics - SVG
– MathML
– Voice over XML - VoiceXML
© 2004
University of Greenwich
33
XML Applications
• Meta data (data about data) to describe
resources e.g.
–
–
–
–
–
Resource Description Framework RDF
Really Simple Syndication RSS
DARPA Agent Markup Language DAML
Ontology Integration Language OIL
Web Ontology Language OWL
<rdf:Description about="http://www.gre.ac.uk/examregs.html">
<cd:Creator>Fred Bloggs</cd:Creator>
<cd:Date>20021212</cd:Date>
</rdf:Description>
© 2004
University of Greenwich
34
XML Applications
• Web services
• Buried deep in computer to computer
communications
– XML-RPC, SOAP, WSDL, UDDI
<SOAP-ENV:Body>
<proc:GetCurrentPrice xmlns:proc="proc-URI"/>
• Business to business (B2B) data exchange
– BizTalk, ebXML
<BusinessPartnerRole name="Buyer">
<Performs initiatingRole="Buyer"/>
• More B2B than B2C
© 2004
University of Greenwich
35
XHTML
WML
HTML
VoiceXML
Web Site
XML documents transformed using
XSLT for multi-channel delivery
XML multimedia
XML aware
search engines
Enterprise Systems
XML communication within a
distributed system (SOAP, XMLRPC)
B2B links
XML data
exchange
XML based web services
Call to third party services
e.g. Microsoft Passport
XML in the Enterprise
XML enabled
databases e.g.
Oracle, DB2,
SQL Server
XML Technologies
Applications of XML
CML MathML WML VoiceML XHTML SMIL SVG
RDF SOAP UDDI WSDL ebXML etc. etc.
Supporting Specifications
Supporting Tools
Xpath Xlink
Browsers – IE Mozilla
Xpointer Xquery
APIs – DOM SAX
XSLT XSL-FO
Parsers – Expat MSXML
Xerces
CSS DOM etc.
IDEs – XMLSpy Stylus
Core XML
Syntax DTD XSD Namespaces
© 2004
University of Greenwich
37
DTDs and Schemas
• DTDs and schemas (XSD) are alternative ways
of defining an XML language.
• They contain rules that specify things such as
–
–
–
–
the tags in the vocabulary
which tags are allowed to be nested in other tags
which tags and attributes are optional / mandatory
which values are allowed for attributes
• XML languages defined by a DTDs or schemas
are used to create valid XML documents.
© 2004
University of Greenwich
38
DTDs and Schemas
• For an XML document to be valid it must
conform to the rules specified in its DTD or
Schema
XML documents that use
the language defined in the
DTD or Schema
DTD or Schema
defines an XML
language
© 2004
University of Greenwich
39
Example XML with DTD
transactions.xml
<?xml version="1.0" encoding="UTF-8"?>
the DOCTYPE declaration
<!DOCTYPE transactions SYSTEM "translang.dtd">
associates a DTD in a separate
<transactions>
file (translang.dtd) with this
<transaction>
document
<trantype>credit</trantype>
<amount>2000</amount>
</transaction>
<transaction>
<trantype>debit</trantype>
translang.dtd says that:
<amount>1000</amount>
</transaction>
• the transactions element contains zero
<transaction>
or more transaction elements
<trantype>credit</trantype> • each transaction element contains a
<amount>300</amount>
trantype element followed by an
</transaction>
amount element
</transactions>
• each trantype element contains data
• each amount element contains data
translang.dtd
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT transactions (transaction*)>
<!ELEMENT transaction (trantype, amount)>
<!ELEMENT trantype (#PCDATA)>
<!ELEMENT amount (#PCDATA)>
XML Schema
• DTDs:
– easy for humans to cope with
– older than schemas
• supported by a much wider range of XML tools and software
– have poor support for namespaces
• Schemas:
– more verbose
– much more expressive than DTDs
• data types, constraints on values
– an XML based vocabulary
• can be manipulated with general purpose XML tools
– support namespaces
– declared in the root element of the XML document
<transactions
xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="translang.xsd">
© 2004
University of Greenwich
41
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2000/10/XMLSchema"
elementFormDefault="qualified">
<xs:element name="transactions">
<xs:complexType>
<xs:sequence>
<xs:element ref="transaction" minOccurs="0" maxOccurs="100"/>
</xs:sequence>
</xs:complexType>
the transactions element contains
</xs:element>
between 0 and 100 transaction
<xs:element name="transaction">
elements
<xs:complexType>
<xs:sequence>
<xs:element ref="trantype"/>
the transaction element contains a
<xs:element ref="amount"/>
</xs:sequence>
trantype element followed by an
</xs:complexType>
amount element
</xs:element>
<xs:element name="trantype">
<xs:simpleType>
<xs:restriction base="xs:string">
the trantype element contains
<xs:enumeration value="credit"/>
<xs:enumeration value="debit"/>
a string with either the value
</xs:restriction>
"credit" or "debit"
</xs:simpleType>
</xs:element>
<xs:element name="amount" type="xs:integer"/>
the trantype element
</xs:schema>
translang.xsd
contains an integer
Quick Quiz
• Is the following document valid according
to either or both of the DTD or Schema
above?
<transactions>
<transaction>
<trantype>credit</trantype><amount>24.75</amount>
</transaction>
<transaction>
<trantype>credit</trantype><amount>650</amount>
</transaction>
</transactions>
© 2004
University of Greenwich
43
Namespaces
• Namespaces are a way of avoiding name conflicts
– where different XML vocabularies use the same element names
to mean different things.
• Consider two hypothetical XML languages; ShoeML and
PicML
– in the language ShoeML the <size> element refers to shoe size
– in PicML the <size> element refers to the size of an image.
• The problem comes when you want to mix several
vocabularies
what does
size mean?
© 2004
<shoe>
<style>SupaFeet</style>
<size>39</size>
<image>
<filename>supafeet.jpg</filename>
<size>100kb</size>
</image>
</shoe>
University of Greenwich
44
Namespaces
• The previous example is well-formed XML but it is
difficult for applications to know how to process <size>.
• The solution is to use prefixes for the element names to
distinguish between them
– can also be used for attributes
• Here shoe vocabulary element names are prefixed by
shu: and images element names are prefixed by img:
<shu:shoe>
<shu:style>SupaFeet</shu:style>
<shu:size>39</shu:size>
<img:image>
<img:filename>supafeet.jpg</img:filename>
<img:size>100 kb</img:size>
</img:image>
</shu:shoe>
© 2004
University of Greenwich
45
References
• There are masses of XML books and websites.
– "Professional XML" - Birbeck et al, Wrox Press
• Very comprehensive book.
• This lecture covers much of the material in chapters 1 and 2
– “SAMS Teach Yourself XML in 24 hours” - Morrison
• Cheap as chips, good scope but little depth
• W3Schools online tutorial http://www.w3schools.com
– Try their online XML test
• World Wide Web consortium at http://www.w3.org
– The home of the XML specification and so much more.
• XML in practice from http://www.xml.org
– Articles, white papers, user groups and more
• XML resources and information from http://www.xml.org
– Provided by Tim O’Reilly
© 2004
University of Greenwich
46
Summary
• XML is a meta-language used to define application
specific markup languages
– XHTML, MathML, CML, WML, ShoeML, etc.
• XML is designed to be straightforward and easy to use
• XML provides simple syntactic rules that result in wellformed hierarchically structured documents
• DTDs or Schemas are used to define valid XML
languages
– namespaces avoid conflicts between XML languages
• XML separates content from presentation
– CSS and XSL can be used to render XML documents in a
readable form
© 2004
University of Greenwich
47
Download