XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003 Contents XML Documents Basic Structure Parsing via SAX Document Object Model (DOM) Basic Tree Representation DOM Node Types DOM Notes Conclusions XML Documents Begin with Prologue: <?xml version="1.0"?> Sequence of tags follows: <foo> <bar> Some stuff </bar> </foo> <iHaveNoData/> Element Structure Elements have a name: < myName > Must occur as pair of opening/closing tags possibly containing data: <myName> Data </myName> or as empty tags (no data): <myName/> Attributes Elements can have one or more attributes Attributes are name/value pairs <myName attributeName=”attributeValue”/> Attributes are simple - they have no sub tags Attributes may have a purpose (e.g. declaration) <myName xmlns:bj=”http://www.ph.ed.ac.uk”/> declares namespace bj Namespaces Allow reuse of tag names for different purposes Consist of a prefix and a URI Declared with xmlns attribute: <foo xmlns:prefix=”http://www.uri.org/foo”> Tags/Attributes from namespace are prefixed: <prefix:foo> <foo prefix:bar=”fred”/> In some cases, attribute values may be prefixed Namespaces in QCDML Suppose Metadata Working Group can't agree on convention for parameter β but both UKQCD and SciDAC want to use the name beta but with different meanings. Define namespaces: sciDac and ukqcd Can then have tags: <sciDac:beta> <ukqcd:beta> Parsing XML via SAX SAX - Simple API for XML Treats XML Document as a “ program” SAX Parsers provide hooks to let the user write an “ interpreter” for the “ program” Generally fast, with small memory footprint BUT: writing interpreters is potentially burdensome / problem specific Document Object Model (DOM) DOM specifies a Dynamic Interface to XML documents Tree based representation Various APIs for accessing the representation Traversing searching creating/updating We consider here the tree representation only (as it is closely related to XPath) DOM Trees Docum ent Document Node <?xml version=”1.0”?> Root Link <foo> child Node Root Node Sibling next Sibling Node (brother/sister) parent Sibling previous Node <fred/> Node </foo> <bar/> Child Node DOM Nodes There are several types of Node. Most useful: Document Element Corresponds to <foo>...</foo> or <foo/> Attribute The attribute in <foo attribute=”value”> Text The data in <foo> data </foo> The value in <foo att=”value”> DOM Notes DOM Preserves Document order (parent/child, previous/next sibling links) Getting Documents into DOM is easy Using libxml: doc=xmlParseFile(“foo.xml”); Many free DOM parsers exist even for C/C++ Apache Xerces, libxml Difficulty shifts to extracting data from DOM Conclusions This talk provided basic introduction to XML document structure Discussed DOM representation of XML Highlighted need to define Easy To Use API to query DOM objects What does Easy To Use mean ? What is Easy To Parse? Stay Tuned for Part 2...