XML Refresher Course Bálint Joó School of Physics University of Edinburgh

advertisement
XML Refresher Course
Bálint Joó
School of Physics
University of Edinburgh
May 02, 2003
Contents
XML Documents
Basic Structure
Parsing via SAX
Document Object Model (DOM)
Basic Tree Representation
DOM Node Types
DOM Notes
Conclusions
XML Documents
Begin with Prologue:
<?xml version="1.0"?>
Sequence of tags follows:
<foo>
<bar>
Some stuff
</bar>
</foo>
<iHaveNoData/>
Element Structure
Elements have a name:
< myName >
Must occur as pair of opening/closing tags
possibly containing data:
<myName> Data </myName>
or as empty tags (no data):
<myName/>
Attributes
Elements can have one or more attributes
Attributes are name/value pairs
<myName attributeName=”attributeValue”/>
Attributes are simple - they have no sub tags
Attributes may have a purpose (e.g. declaration)
<myName xmlns:bj=”http://www.ph.ed.ac.uk”/>
declares namespace bj
Namespaces
Allow reuse of tag names for different purposes
Consist of a prefix and a URI
Declared with xmlns attribute:
<foo xmlns:prefix=”http://www.uri.org/foo”>
Tags/Attributes from namespace are prefixed:
<prefix:foo>
<foo prefix:bar=”fred”/>
In some cases, attribute values may be prefixed
Namespaces in QCDML
Suppose Metadata Working Group can't agree
on convention for parameter β but both
UKQCD and SciDAC want to use the name
beta but with different meanings.
Define namespaces: sciDac and ukqcd
Can then have tags:
<sciDac:beta>
<ukqcd:beta>
Parsing XML via SAX
SAX - Simple API for XML
Treats XML Document as a “ program”
SAX Parsers provide hooks to let the user write
an “ interpreter” for the “ program”
Generally fast, with small memory footprint
BUT: writing interpreters is potentially
burdensome / problem specific
Document Object Model (DOM)
DOM specifies a Dynamic Interface to XML
documents
Tree based representation
Various APIs for accessing the representation
Traversing
searching
creating/updating
We consider here the tree representation only
(as it is closely related to XPath)
DOM Trees
Docum
ent
Document Node
<?xml version=”1.0”?>
Root Link
<foo>
child
Node
Root Node
Sibling
next
Sibling Node
(brother/sister)
parent
Sibling
previous
Node
<fred/>
Node
</foo>
<bar/>
Child Node
DOM Nodes
There are several types of Node. Most useful:
Document
Element
Corresponds to <foo>...</foo> or <foo/>
Attribute
The attribute in <foo attribute=”value”>
Text
The data in <foo> data </foo>
The value in <foo att=”value”>
DOM Notes
DOM Preserves Document order
(parent/child, previous/next sibling links)
Getting Documents into DOM is easy
Using libxml:
doc=xmlParseFile(“foo.xml”);
Many free DOM parsers exist even for C/C++
Apache Xerces, libxml
Difficulty shifts to extracting data from DOM
Conclusions
This talk provided basic introduction to XML
document structure
Discussed DOM representation of XML
Highlighted need to define Easy To Use API to
query DOM objects
What does Easy To Use mean ?
What is Easy To Parse?
Stay Tuned for Part 2...
Download