XML & Related Languages
Unit 1
Introduction & XML Essentials
XML & Related Languages
1
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Introduction: XML
• XML = eXtensible Markup Language
 “… the universal format for structured documents
and data on the Web.”
 www.w3c.org/XML
 “… simple, very flexible text format derived from
SGML (ISO 8879).”
 Originally designed to meet challenges of largescale electronic publishing
 Increasingly important role for exchanging a wide
variety of data on the Web (and not on the Web)
XML & Related Languages
2
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML: an example
<?xml version="1.0" ?>
<Course CatalogId="RSEG-0151-G1">
<Title Nickname='XML Course'>XML and Related Languages
</Title>
<Credits>3</Credits>
<Offering>
<Term>Fall 2003</Term>
<Instructor>John Arnold</Instructor>
<Location OnCampus="Y">Waltham</Location>
<Schedule>
<Weekly>
<DayOfWeek>Tuesday</DayOfWeek>
<StartTime>1800</StartTime>
<EndTime>2100</EndTime>
</Weekly>
</Schedule>
</Offering>
</Course>
XML & Related Languages
3
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML: Key Goals
• Simplicity
 Strictly and simply structured
 Easy to get started ‘reading’ XML
 All features recognized by all XML-supporting tools & applications
• Compatibility
 Platform-independent
 No reliance on hardware endian-ness, etc.
 Support a wide variety of applications
 Easy to adapt to a number of problem domains, programming
environments, etc.
• Legibility
 Human-readable
 XML-literate person can look at an XML document and figure it out
XML & Related Languages
4
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 Points
(http://www.w3c.org/XML/1999/XML-in-10-points)
• XML is for structuring data
 Structured data of just about anything you can think of
 Address books, database records, vector graphics, etc.
 A set of rules (or conventions or guidelines) for designing
text formats for structured data
 Extensible = you (or a group) decide on the meaning
 Not a programming language
 Makes it easy for computer programs to generate data, read
data, and ensure the data is not ambiguous
 Unicode-compliant
 Others have taken care of multi-language issues (e.g., 2-byte
characters)
XML & Related Languages
5
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML looks a bit like HTML
 XML and HTML each have elements and attributes
 … but HTML is not XML
 HTML specifies meaning of each element and each attribute
and how it will look in a browser
 Ex: <p> is a paragraph
 XML uses elements to delimit pieces of data. The
application decides what it means
 Ex: <p> is a ???
In an XML document, a single element name can mean
different things in different contexts!
XML & Related Languages
6
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is text… but not meant to be read
 It allows the programmer to look at the data
 Especially helpful during design and debugging
 Don’t need a working program to look at the data
 It works around hardware end-ian differences
 XML parsing rules are strict
 No need for each application to determine whether a
data file is ‘broken’ (legally defined)
 No second-guessing a ‘broken’ file’s meaning
XML & Related Languages
7
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is verbose by design
 Yes, typical XML data files are bigger than an
equivalent binary representation
 Advantages (see previous slide) are believed to
outweigh disadvantages
 Disk space is getting cheaper
 Compression can be good and fast
 Protocols (e.g., HTTP/1.1) can compress on the fly and
save bandwidth as needed
XML & Related Languages
8
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is a family of technologies
 XML 1.0 is a specification that defines elements, attributes,
etc.
 Technologies based on XML 1.0 define a growing set of
modules, services, etc. for common tasks
 XLink – add hyperlinks to XML file
 XPointer – access to parts of an XML file
 XSLT – a transformation language for rearranging, adding,
deleting elements and attributes
 DOM – a standard set of function calls for accessing XML from
a programming language
 XML Schema – help developers precisely define the structures
of their own XML documents
 …
XML & Related Languages
9
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is new… but not that new
 Development started in 1996
 A W3C Recommendation since February 1998
 SGML has been an ISO standard since 1986
 HTML started in 1990
 Best parts of SGML + lessons learned from the
HTML experience (the good and the bad) = XML
 Powerful, more regular, and simple to use and
understand
XML & Related Languages
10
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML leads HTML to XHTML
 XHTML is an important XML application (a
document format with a specific purpose):
 Most of the same elements as HTML
 Slight syntax changes to conform to XML rules
Results in
 Syntax that is correct (well-formed)
 Adds meaning (semantics) to the syntax
» Ex: <p> means paragraph
XML & Related Languages
11
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is modular
 You can define a new document format by
combining and reusing existing formats
 Namespace mechanism
» Avoids confusion that can arise from use of same basic
name
 XML Schema
» Defines document structure
» Provides a mechanism for combining existing schemas
into a new schema
XML & Related Languages
12
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is the basis for RDF and ‘the
semantic web’
 RDF (Resource Description Framework)
 XML format that supports resource description
and metadata applications
» Ex: music playlists, image collections, bibliographies
 RDF integrates applications and agents into
one semantic web
» Content and a description of the content (metacontent)
XML & Related Languages
13
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML in 10 points
• XML is license-free, platform-independent,
and very well supported
 Define your own document structures
 Choose from a growing number of industryconsortia, agreed upon formats
 Wide variety of tools
 Works on Linux, Unix, Windows, Mac, …
 Lets you focus on applications rather than
infrastructure!
XML & Related Languages
14
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Building Applications with XML
• XML = low-level syntax
for representing structured data
• Supports a wide variety
of applications
• Simple data representation
and organization model
reduces data incompatibility,
need for re-keying, etc.
 Database queries output in XML
 Transformed from XML
(using XSLT) into HTML
 Separate data from presentation
XML & Related Languages
15
http://www.w3.org/XML/Activity
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Other uses of XML
• Method invocation on remote servers through a
firewall
 SOAP: Simple Object Access Protocol
• Storing configuration and deployment data for
applications
 OS-independent formats for .ini, .config files
• Templates describing various fields and attributes of
business forms
 Again separating the meaning of the data on the form (e.g.,
name) from the layout and look of the form
XML & Related Languages
16
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML Tools
•
•
•
•
Text Editors & Browsers
XML Parsers
XSLT Processors
XML Validation, etc.
 References:
» http://www.xml.com/buyersguide
» http://www.w3c.org/XML/Schema
» http://www.codenotes.com (CN: XM000101)
XML & Related Languages
17
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Text Editors & Browsers
• Since XML is text only, you can get
started with a simple text editor &
browser
 emacs, vi, Notepad, etc.
 Just edit a file and save it with the .xml
extension
 Netscape, Internet Explorer, and Opera
 Can process an XML file
 Useful for viewing or checking basic syntax
XML & Related Languages
18
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Text Editors & Browsers (part 2)
• Many more advanced editors are
available
 XML Spy: Popular, very complete XML editor,
validator, etc. with add-ons for XSLT, etc.
(Windows)
 Not free, but 30-day free download/trial
» http://www.xmlspy.com




SoftQuad XMetaL (Windows)
ChannelPoint Merlot (Any OS with Java)
Tibco Turbo
Microsoft XML Notepad
XML & Related Languages
19
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML Parsers
• APIs used to read and navigate XML
files
 Most browsers have a simple one built in
 Two popular, free parsers
 Java Web Services Developer Pack 1.2
» Includes JAXP (Java APIs for XML Processing)
» http://java.sun.com/webservices/webservicespack.html
 Microsoft XML Core Services (MSXML v4.0 SP2)
» http://msdn.microsoft.com/xml (look for downloads)
XML & Related Languages
20
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XSLT Processors
• Transform XML to
 Another XML format
 A non-XML format
• Typical use: transform XML data to
presentation (UI)
• JAXP and MSXML have XSLT support
• SAXON: XSLT and XML SAX parser (v 6.5.2)
 http://saxon.sourceforge.net
XML & Related Languages
21
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML Validation, etc.
• XSV
 Command line XML validator (for Win32)
 ftp://ftp.cogsci.ed.ac.uk/pub/XSV
 Web-based XML validator
 http://www.w3c.org/2001/03/webdata/xsv
XML & Related Languages
22
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML Essentials (part 1)
• XML = a way of presenting structured
information in a text-based document
• A ‘meta-markup’ language
 It can be used to define other mark-up
grammars
 What’s legal syntax for a particular XML
application
XML & Related Languages
23
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML: 2 Key Concepts
• Well-formed
 Basic, overall syntax rules for all XML
documents
• Valid
 Additional rules for an XML Application:
 A particular ‘family’ of XML documents
 Document structure conforms to a DTD or
an XML Schema (more later)
XML & Related Languages
24
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: XML Documents
• Structuring or ‘marking up’ data
 Putting data in XML format…
 … in an XML document
• An XML document is all text
 Some text is ‘mark up’ data: providing the
structure
 Some text is ‘parsed character data’ (PCDATA):
providing the data values in the context of the
structure
XML & Related Languages
25
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: XML Documents
(part 2)
• An XML document contains exactly
one root element (called the root element or
document element)…
 The root element is the tag that appears at
the beginning and end of the document
• … all other elements are nested inside
the root element
 Nesting can be as deep as necessary
XML & Related Languages
26
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: XML Documents
(part 3)
• An XML Document may contain a
prolog…
 Prolog = text before the root element
 Not part of the structured data
 Typically, an XML declaration and/or
processing instructions
» references to grammar, etc.
 more later
XML & Related Languages
27
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: XML Documents
(part 4)
• Basic XML document
<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
<root>
<tag>Parsed Character Data</tag>
</root>
XML & Related Languages
28
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements
• Elements
 Primary organizational mechanism in XML
 Containers that hold and organize
information
 XML has no pre-defined elements
 Each element must have a start-tag and
a matching end-tag (with leading slash)
 Start tag:
 End tag:
XML & Related Languages
<tag>
</tag>
29
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 2)
• Inside the element’s start- and end-tag
 Any combination of character data and other
elements
• Mixed-content element: an element that contains
both other elements and text
• Empty-element: contains neither other elements
nor text
 Ex:
XML & Related Languages
<empty></empty> or <empty/>
30
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 3)
• Find the mixed-content element(s) and empty element(s)
<element1>Character data</element1>
<element2>
<tag1>Some more</tag1>
<tag2>character data</tag2>
</element2>
<element3></element3>
<element3>
Even <element4>more</element4> character data
</element3>
<element4/>
• Would this be ‘well-formed’ XML?
XML & Related Languages
31
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 4)
• Relationships between elements
 Root (or document) element
 Ancestor (parent, parent of parent, etc.)
 Parent
 Sibling
 Child
 Descendent (children, children of children, etc.)
XML & Related Languages
32
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 5)
• Properties of element names
 Case sensitive: Element ≠ element
 Cannot contain spaces!
 Cannot start with letters ‘xml’ in any combination
of upper/lower case
 Reserved for use by the XML spec
 1st character must be a letter or underscore (“_”)
 ‘letter’ is broad definition. Not just English letter
 Can contain numerals (0-9), hyphen (“-”), and
period (“.”) in any position except the 1st character
 Colon (“:”) allowed only for declaring namespaces
XML & Related Languages
33
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 6)
• Examples: Good or bad?
<RootElement>
<My Element>data</My Element>
<TagName>data</TagName>
<3rdRock>data</3rdRock>
<xMLKing>I have a dream</xMLKing>
</RootElement>
XML & Related Languages
34
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Elements (part 7)
• Examples: Good or bad?
<RootElement>
<My Element>data</My Element>
<TagName>data</TagName>
<3rdRock>data</3rdRock>
<xMLKing>I have a dream</xMLKing>
</RootElement>
 ok.
XML & Related Languages
Introduction & XML Essentials
35
 no space allowed
 ok.
 no leading digit
 use of XML reserved
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Attributes
• Attribute
 A name/value pair listed in an element’s starttag. Name associated to value with equals sign (=)
 An element can have 0 or more attributes
 Each attribute must be unique for that element
 That is, the same attribute can’t appear twice in the
element’s start-tag
 Attribute value must be enclosed in ‘single’ or
“double” quotes
 Remember, attribute values are text, too!
 Attributes cannot appear in end-tag
XML & Related Languages
36
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Attributes (part 2)
• Examples
<examples>
<stock symbol="EMC" price="10.00">EMC Corporation</stock>
<auto year="2002" make="Toyota" model='Corolla'>
<color>Maroon</color>
<VIN>XXYYZZ123456789</VIN>
</auto>
<department cost_center="123">
<employees>
<employee badge="1234">John Doe</employee>
<employee badge="5678">Jane Smith</employee>
</employees>
</department>
</examples>
XML & Related Languages
37
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Attributes (part 3)
• Properties of attribute names
 Like elements
 Case-sensitive
 Cannot contain spaces
 Cannot start with ‘xml’
 Must start with letter or underscore
XML & Related Languages
38
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Basic Syntax: Attributes (part 4)
• Why attributes?
 Often used to contain metadata about the element
or hold key values, but…
 … no firm rules
 One of the design decisions you will make is
whether, when, and how to use attributes.
 We’ll see some reasons for them when we talk
about DTDs. But, for now, we’ll just think about it.
XML & Related Languages
39
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML
• To be considered an XML document, the document
must be well-formed:
 Syntactically correct
 If not well-formed, parsers will fail to read the document
 No almost correct…
 It’s well-formed or it’s not an XML document
“A data object is an XML document if it is well-formed, as defined
in this specification.”
• Well-formed in XML has rules that are more strict
than HTML
XML & Related Languages
40
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 2)
• Every element must have a start-tag
and an end-tag
<elementX>any mix of markup and/or character
data</elementX>
XML & Related Languages
41
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 3)
• XML elements cannot overlap
 End-tag of an inner element must be
present before the end-tag of the parent
element
<parent>This is an outer element
<child>with a properly enclosed inner
element</child>.
</parent>
XML & Related Languages
42
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 4)
• Every XML document must have exactly one
root element (also called the document
element)
 No special name
 Can be any legal element name
<anyElementNameWeWant>
<data>some of our data</data>
<data>more data.</data>
<otherData>Other data with a different
element name</otherData>
</anyElementNameWeWant>
XML & Related Languages
43
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 5)
• Attributes
 Any specific attribute can appear only once for
any given element
 Can’t model 2 values of an attribute with the same
attribute appearing twice
 Attribute name is separated from the value with
an equals sign (=)
 Whitespace around the = is optional
 Attribute values must be enclosed in single or
double quotes and they must match
 No difference in meaning… the parsers won’t even tell
you which was used
XML & Related Languages
44
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 6)
• Attributes (continued)
 So… how do suppose we model an attribute
value that contains quotation marks?
 One way is to use the alternate quotation mark for
the value delimiter
 Ex:
character=‘Peter “Spider-Man” Parker’
or
character=“Peter ‘Spider-Man’ Parker”
 Another way is to use an entity reference…
XML & Related Languages
45
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 7)
• Legal (and illegal) characters in
character data
 We’ve seen that < and > have a special
function in XML markup.
 Other characters that have special
function: & ″ ′
 You can always get these characters to
appear in your data using an entity
reference
XML & Related Languages
46
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 8)
• Entity Reference
 Escape sequence for reserved characters
 General form:
&refname;
Reserved character
<
>
″
′
&
XML & Related Languages
Entity Reference
<
>
"
'
&
47
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Well-Formed XML (part 9)
• Putting it all together in an example
<?xml version=“1.0”?>
<question instruction=‘Press “ENTER” for the
answer . . .’>
<content>True or false:</content>
<content>6 < 7</content>
</question>
XML & Related Languages
48
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Other XML Syntax
• Some features added to the basic XML
syntax of elements and attributes to
provide a fully functional markup
language:
 XML declaration
 Processing Instructions
 Comments
 CDATA
XML & Related Languages
49
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML declaration
• Identifies intent that a text file is
(supposed to be) an XML document
 Not strictly required… but a ‘best practice’
 If present, it must be the 1st line of the
prolog
 Before any comments, processing instructions,
and the root element
<?xml version=“1.0” encoding=“UTF-16”
standalone=“yes”?>
XML & Related Languages
50
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
XML declaration (part 2)
• Note that this is not an element
• version attribute is required
 Value must be 1.0 (until a new standard is released)
• encoding attribute is optional (default: UTF-8)
 Describes how text in document is encoded
 Typical values are UTF-8 (universal transformation format –
8 bit byte – ASCII) or UTF-16 (Unicode)
 an incorrect value can cause your document to be read (or
displayed) incorrectly
• standalone attribute is optional (default: no)
 Indicates whether the document relies on an external DTD
(more later) or not
 Really a hint since parsers decide what to do with this value
XML & Related Languages
51
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Processing Instructions
• An XML document may include processing
instructions
 Intent is that some application will read the
document and interpret these instructions as
some kind of command or guidance
 General form:<?target instructions?>
 Typically used to inform parser/browser that XML
document is associated with a particular CSS or
XSL file
 Triggers the transformation
XML & Related Languages
52
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Processing Instructions (part 2)
• Not restricted to prolog
 But can’t appear inside an element’s tag
• Example
<?xml-stylesheet type=“text/css”
href=“mysheet.css”?>
• More about these when we look at CSS
and XSLT
XML & Related Languages
53
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Comments
• Provide additional information about the
document’s contents
• Parser will ignore the comments
 Really exist only for human reader
 General form:
<!-- comment here -->
• Be very careful to close the comment!
XML & Related Languages
54
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
CDATA
• Forces text (including markup) to be treated
as character data
• Easiest way to handle element text that
contain a lot of illegal characters
 So you don’t have to use entity references for all
of them
• Can occur anywhere in the root element or its
children
• Cannot be nested
XML & Related Languages
55
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
CDATA (part 2)
• General form:
<![CDATA[
your raw character data here
]]>
XML & Related Languages
56
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
CDATA (part 2)
• Example:
<example>
This is some raw HTML: <![CDATA[<html>
<head><title>XML Course</title></head>
<body bgcolor="blue"><p>This is some
text.</p></body></html>]]>
</example>
XML & Related Languages
57
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Unit 1: Summary
• XML is a low-level syntax used to represent
structured data in text
• The basis for many technologies that build on
XML 1.0 to solve particular problems for
general or specific domains
• Platform-independent with broad vendor
support and a lot of tools
XML & Related Languages
58
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
Unit 1: Summary (continued)
• Primary building blocks
 Elements
 Attributes – applied to element – appear in start-tag
• XML document must be well-formed:





Exactly one root element (a.k.a. document element)
Every element has a start-tag and an end-tag
Element tags cannot overlap
Attribute values must be enclosed in single or double quotes
Reserved characters (< > & " ') need to be replaced with
entity references
XML & Related Languages
59
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved
For next unit…
• Class readings: see Syllabus
 Namespaces
 DTDs
 Validating XML documents
• Also:
 Visit www.w3c.org and see the breadth of technologies that
are related to XML
 Choose an XML parser, XSLT tool, etc.
 Install it
 Try it out on some XML examples to get comfortable with it
XML & Related Languages
60
Introduction & XML Essentials
© 2003 John E. Arnold All Rights Reserved