XML & Related Languages Unit 1 Introduction & XML Essentials XML & Related Languages 1 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Introduction: XML • XML = eXtensible Markup Language “… the universal format for structured documents and data on the Web.” www.w3c.org/XML “… simple, very flexible text format derived from SGML (ISO 8879).” Originally designed to meet challenges of largescale electronic publishing Increasingly important role for exchanging a wide variety of data on the Web (and not on the Web) XML & Related Languages 2 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML: an example <?xml version="1.0" ?> <Course CatalogId="RSEG-0151-G1"> <Title Nickname='XML Course'>XML and Related Languages </Title> <Credits>3</Credits> <Offering> <Term>Fall 2003</Term> <Instructor>John Arnold</Instructor> <Location OnCampus="Y">Waltham</Location> <Schedule> <Weekly> <DayOfWeek>Tuesday</DayOfWeek> <StartTime>1800</StartTime> <EndTime>2100</EndTime> </Weekly> </Schedule> </Offering> </Course> XML & Related Languages 3 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML: Key Goals • Simplicity Strictly and simply structured Easy to get started ‘reading’ XML All features recognized by all XML-supporting tools & applications • Compatibility Platform-independent No reliance on hardware endian-ness, etc. Support a wide variety of applications Easy to adapt to a number of problem domains, programming environments, etc. • Legibility Human-readable XML-literate person can look at an XML document and figure it out XML & Related Languages 4 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 Points (http://www.w3c.org/XML/1999/XML-in-10-points) • XML is for structuring data Structured data of just about anything you can think of Address books, database records, vector graphics, etc. A set of rules (or conventions or guidelines) for designing text formats for structured data Extensible = you (or a group) decide on the meaning Not a programming language Makes it easy for computer programs to generate data, read data, and ensure the data is not ambiguous Unicode-compliant Others have taken care of multi-language issues (e.g., 2-byte characters) XML & Related Languages 5 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML looks a bit like HTML XML and HTML each have elements and attributes … but HTML is not XML HTML specifies meaning of each element and each attribute and how it will look in a browser Ex: <p> is a paragraph XML uses elements to delimit pieces of data. The application decides what it means Ex: <p> is a ??? In an XML document, a single element name can mean different things in different contexts! XML & Related Languages 6 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is text… but not meant to be read It allows the programmer to look at the data Especially helpful during design and debugging Don’t need a working program to look at the data It works around hardware end-ian differences XML parsing rules are strict No need for each application to determine whether a data file is ‘broken’ (legally defined) No second-guessing a ‘broken’ file’s meaning XML & Related Languages 7 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is verbose by design Yes, typical XML data files are bigger than an equivalent binary representation Advantages (see previous slide) are believed to outweigh disadvantages Disk space is getting cheaper Compression can be good and fast Protocols (e.g., HTTP/1.1) can compress on the fly and save bandwidth as needed XML & Related Languages 8 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is a family of technologies XML 1.0 is a specification that defines elements, attributes, etc. Technologies based on XML 1.0 define a growing set of modules, services, etc. for common tasks XLink – add hyperlinks to XML file XPointer – access to parts of an XML file XSLT – a transformation language for rearranging, adding, deleting elements and attributes DOM – a standard set of function calls for accessing XML from a programming language XML Schema – help developers precisely define the structures of their own XML documents … XML & Related Languages 9 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is new… but not that new Development started in 1996 A W3C Recommendation since February 1998 SGML has been an ISO standard since 1986 HTML started in 1990 Best parts of SGML + lessons learned from the HTML experience (the good and the bad) = XML Powerful, more regular, and simple to use and understand XML & Related Languages 10 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML leads HTML to XHTML XHTML is an important XML application (a document format with a specific purpose): Most of the same elements as HTML Slight syntax changes to conform to XML rules Results in Syntax that is correct (well-formed) Adds meaning (semantics) to the syntax » Ex: <p> means paragraph XML & Related Languages 11 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is modular You can define a new document format by combining and reusing existing formats Namespace mechanism » Avoids confusion that can arise from use of same basic name XML Schema » Defines document structure » Provides a mechanism for combining existing schemas into a new schema XML & Related Languages 12 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is the basis for RDF and ‘the semantic web’ RDF (Resource Description Framework) XML format that supports resource description and metadata applications » Ex: music playlists, image collections, bibliographies RDF integrates applications and agents into one semantic web » Content and a description of the content (metacontent) XML & Related Languages 13 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML in 10 points • XML is license-free, platform-independent, and very well supported Define your own document structures Choose from a growing number of industryconsortia, agreed upon formats Wide variety of tools Works on Linux, Unix, Windows, Mac, … Lets you focus on applications rather than infrastructure! XML & Related Languages 14 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Building Applications with XML • XML = low-level syntax for representing structured data • Supports a wide variety of applications • Simple data representation and organization model reduces data incompatibility, need for re-keying, etc. Database queries output in XML Transformed from XML (using XSLT) into HTML Separate data from presentation XML & Related Languages 15 http://www.w3.org/XML/Activity Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Other uses of XML • Method invocation on remote servers through a firewall SOAP: Simple Object Access Protocol • Storing configuration and deployment data for applications OS-independent formats for .ini, .config files • Templates describing various fields and attributes of business forms Again separating the meaning of the data on the form (e.g., name) from the layout and look of the form XML & Related Languages 16 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML Tools • • • • Text Editors & Browsers XML Parsers XSLT Processors XML Validation, etc. References: » http://www.xml.com/buyersguide » http://www.w3c.org/XML/Schema » http://www.codenotes.com (CN: XM000101) XML & Related Languages 17 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Text Editors & Browsers • Since XML is text only, you can get started with a simple text editor & browser emacs, vi, Notepad, etc. Just edit a file and save it with the .xml extension Netscape, Internet Explorer, and Opera Can process an XML file Useful for viewing or checking basic syntax XML & Related Languages 18 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Text Editors & Browsers (part 2) • Many more advanced editors are available XML Spy: Popular, very complete XML editor, validator, etc. with add-ons for XSLT, etc. (Windows) Not free, but 30-day free download/trial » http://www.xmlspy.com SoftQuad XMetaL (Windows) ChannelPoint Merlot (Any OS with Java) Tibco Turbo Microsoft XML Notepad XML & Related Languages 19 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML Parsers • APIs used to read and navigate XML files Most browsers have a simple one built in Two popular, free parsers Java Web Services Developer Pack 1.2 » Includes JAXP (Java APIs for XML Processing) » http://java.sun.com/webservices/webservicespack.html Microsoft XML Core Services (MSXML v4.0 SP2) » http://msdn.microsoft.com/xml (look for downloads) XML & Related Languages 20 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XSLT Processors • Transform XML to Another XML format A non-XML format • Typical use: transform XML data to presentation (UI) • JAXP and MSXML have XSLT support • SAXON: XSLT and XML SAX parser (v 6.5.2) http://saxon.sourceforge.net XML & Related Languages 21 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML Validation, etc. • XSV Command line XML validator (for Win32) ftp://ftp.cogsci.ed.ac.uk/pub/XSV Web-based XML validator http://www.w3c.org/2001/03/webdata/xsv XML & Related Languages 22 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML Essentials (part 1) • XML = a way of presenting structured information in a text-based document • A ‘meta-markup’ language It can be used to define other mark-up grammars What’s legal syntax for a particular XML application XML & Related Languages 23 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML: 2 Key Concepts • Well-formed Basic, overall syntax rules for all XML documents • Valid Additional rules for an XML Application: A particular ‘family’ of XML documents Document structure conforms to a DTD or an XML Schema (more later) XML & Related Languages 24 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: XML Documents • Structuring or ‘marking up’ data Putting data in XML format… … in an XML document • An XML document is all text Some text is ‘mark up’ data: providing the structure Some text is ‘parsed character data’ (PCDATA): providing the data values in the context of the structure XML & Related Languages 25 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: XML Documents (part 2) • An XML document contains exactly one root element (called the root element or document element)… The root element is the tag that appears at the beginning and end of the document • … all other elements are nested inside the root element Nesting can be as deep as necessary XML & Related Languages 26 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: XML Documents (part 3) • An XML Document may contain a prolog… Prolog = text before the root element Not part of the structured data Typically, an XML declaration and/or processing instructions » references to grammar, etc. more later XML & Related Languages 27 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: XML Documents (part 4) • Basic XML document <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <root> <tag>Parsed Character Data</tag> </root> XML & Related Languages 28 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements • Elements Primary organizational mechanism in XML Containers that hold and organize information XML has no pre-defined elements Each element must have a start-tag and a matching end-tag (with leading slash) Start tag: End tag: XML & Related Languages <tag> </tag> 29 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 2) • Inside the element’s start- and end-tag Any combination of character data and other elements • Mixed-content element: an element that contains both other elements and text • Empty-element: contains neither other elements nor text Ex: XML & Related Languages <empty></empty> or <empty/> 30 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 3) • Find the mixed-content element(s) and empty element(s) <element1>Character data</element1> <element2> <tag1>Some more</tag1> <tag2>character data</tag2> </element2> <element3></element3> <element3> Even <element4>more</element4> character data </element3> <element4/> • Would this be ‘well-formed’ XML? XML & Related Languages 31 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 4) • Relationships between elements Root (or document) element Ancestor (parent, parent of parent, etc.) Parent Sibling Child Descendent (children, children of children, etc.) XML & Related Languages 32 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 5) • Properties of element names Case sensitive: Element ≠ element Cannot contain spaces! Cannot start with letters ‘xml’ in any combination of upper/lower case Reserved for use by the XML spec 1st character must be a letter or underscore (“_”) ‘letter’ is broad definition. Not just English letter Can contain numerals (0-9), hyphen (“-”), and period (“.”) in any position except the 1st character Colon (“:”) allowed only for declaring namespaces XML & Related Languages 33 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 6) • Examples: Good or bad? <RootElement> <My Element>data</My Element> <TagName>data</TagName> <3rdRock>data</3rdRock> <xMLKing>I have a dream</xMLKing> </RootElement> XML & Related Languages 34 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Elements (part 7) • Examples: Good or bad? <RootElement> <My Element>data</My Element> <TagName>data</TagName> <3rdRock>data</3rdRock> <xMLKing>I have a dream</xMLKing> </RootElement> ok. XML & Related Languages Introduction & XML Essentials 35 no space allowed ok. no leading digit use of XML reserved © 2003 John E. Arnold All Rights Reserved Basic Syntax: Attributes • Attribute A name/value pair listed in an element’s starttag. Name associated to value with equals sign (=) An element can have 0 or more attributes Each attribute must be unique for that element That is, the same attribute can’t appear twice in the element’s start-tag Attribute value must be enclosed in ‘single’ or “double” quotes Remember, attribute values are text, too! Attributes cannot appear in end-tag XML & Related Languages 36 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Attributes (part 2) • Examples <examples> <stock symbol="EMC" price="10.00">EMC Corporation</stock> <auto year="2002" make="Toyota" model='Corolla'> <color>Maroon</color> <VIN>XXYYZZ123456789</VIN> </auto> <department cost_center="123"> <employees> <employee badge="1234">John Doe</employee> <employee badge="5678">Jane Smith</employee> </employees> </department> </examples> XML & Related Languages 37 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Attributes (part 3) • Properties of attribute names Like elements Case-sensitive Cannot contain spaces Cannot start with ‘xml’ Must start with letter or underscore XML & Related Languages 38 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Basic Syntax: Attributes (part 4) • Why attributes? Often used to contain metadata about the element or hold key values, but… … no firm rules One of the design decisions you will make is whether, when, and how to use attributes. We’ll see some reasons for them when we talk about DTDs. But, for now, we’ll just think about it. XML & Related Languages 39 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML • To be considered an XML document, the document must be well-formed: Syntactically correct If not well-formed, parsers will fail to read the document No almost correct… It’s well-formed or it’s not an XML document “A data object is an XML document if it is well-formed, as defined in this specification.” • Well-formed in XML has rules that are more strict than HTML XML & Related Languages 40 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 2) • Every element must have a start-tag and an end-tag <elementX>any mix of markup and/or character data</elementX> XML & Related Languages 41 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 3) • XML elements cannot overlap End-tag of an inner element must be present before the end-tag of the parent element <parent>This is an outer element <child>with a properly enclosed inner element</child>. </parent> XML & Related Languages 42 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 4) • Every XML document must have exactly one root element (also called the document element) No special name Can be any legal element name <anyElementNameWeWant> <data>some of our data</data> <data>more data.</data> <otherData>Other data with a different element name</otherData> </anyElementNameWeWant> XML & Related Languages 43 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 5) • Attributes Any specific attribute can appear only once for any given element Can’t model 2 values of an attribute with the same attribute appearing twice Attribute name is separated from the value with an equals sign (=) Whitespace around the = is optional Attribute values must be enclosed in single or double quotes and they must match No difference in meaning… the parsers won’t even tell you which was used XML & Related Languages 44 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 6) • Attributes (continued) So… how do suppose we model an attribute value that contains quotation marks? One way is to use the alternate quotation mark for the value delimiter Ex: character=‘Peter “Spider-Man” Parker’ or character=“Peter ‘Spider-Man’ Parker” Another way is to use an entity reference… XML & Related Languages 45 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 7) • Legal (and illegal) characters in character data We’ve seen that < and > have a special function in XML markup. Other characters that have special function: & ″ ′ You can always get these characters to appear in your data using an entity reference XML & Related Languages 46 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 8) • Entity Reference Escape sequence for reserved characters General form: &refname; Reserved character < > ″ ′ & XML & Related Languages Entity Reference &lt; &gt; &quot; &apos; &amp; 47 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Well-Formed XML (part 9) • Putting it all together in an example <?xml version=“1.0”?> <question instruction=‘Press “ENTER” for the answer . . .’> <content>True or false:</content> <content>6 &lt; 7</content> </question> XML & Related Languages 48 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Other XML Syntax • Some features added to the basic XML syntax of elements and attributes to provide a fully functional markup language: XML declaration Processing Instructions Comments CDATA XML & Related Languages 49 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML declaration • Identifies intent that a text file is (supposed to be) an XML document Not strictly required… but a ‘best practice’ If present, it must be the 1st line of the prolog Before any comments, processing instructions, and the root element <?xml version=“1.0” encoding=“UTF-16” standalone=“yes”?> XML & Related Languages 50 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved XML declaration (part 2) • Note that this is not an element • version attribute is required Value must be 1.0 (until a new standard is released) • encoding attribute is optional (default: UTF-8) Describes how text in document is encoded Typical values are UTF-8 (universal transformation format – 8 bit byte – ASCII) or UTF-16 (Unicode) an incorrect value can cause your document to be read (or displayed) incorrectly • standalone attribute is optional (default: no) Indicates whether the document relies on an external DTD (more later) or not Really a hint since parsers decide what to do with this value XML & Related Languages 51 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Processing Instructions • An XML document may include processing instructions Intent is that some application will read the document and interpret these instructions as some kind of command or guidance General form:<?target instructions?> Typically used to inform parser/browser that XML document is associated with a particular CSS or XSL file Triggers the transformation XML & Related Languages 52 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Processing Instructions (part 2) • Not restricted to prolog But can’t appear inside an element’s tag • Example <?xml-stylesheet type=“text/css” href=“mysheet.css”?> • More about these when we look at CSS and XSLT XML & Related Languages 53 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Comments • Provide additional information about the document’s contents • Parser will ignore the comments Really exist only for human reader General form: <!-- comment here --> • Be very careful to close the comment! XML & Related Languages 54 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved CDATA • Forces text (including markup) to be treated as character data • Easiest way to handle element text that contain a lot of illegal characters So you don’t have to use entity references for all of them • Can occur anywhere in the root element or its children • Cannot be nested XML & Related Languages 55 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved CDATA (part 2) • General form: <![CDATA[ your raw character data here ]]> XML & Related Languages 56 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved CDATA (part 2) • Example: <example> This is some raw HTML: <![CDATA[<html> <head><title>XML Course</title></head> <body bgcolor="blue"><p>This is some text.</p></body></html>]]> </example> XML & Related Languages 57 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Unit 1: Summary • XML is a low-level syntax used to represent structured data in text • The basis for many technologies that build on XML 1.0 to solve particular problems for general or specific domains • Platform-independent with broad vendor support and a lot of tools XML & Related Languages 58 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved Unit 1: Summary (continued) • Primary building blocks Elements Attributes – applied to element – appear in start-tag • XML document must be well-formed: Exactly one root element (a.k.a. document element) Every element has a start-tag and an end-tag Element tags cannot overlap Attribute values must be enclosed in single or double quotes Reserved characters (< > & " ') need to be replaced with entity references XML & Related Languages 59 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved For next unit… • Class readings: see Syllabus Namespaces DTDs Validating XML documents • Also: Visit www.w3c.org and see the breadth of technologies that are related to XML Choose an XML parser, XSLT tool, etc. Install it Try it out on some XML examples to get comfortable with it XML & Related Languages 60 Introduction & XML Essentials © 2003 John E. Arnold All Rights Reserved