Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003 Contents In search of a simple API for accessing DOM The multiple tag problem What is it? Is it a problem for us? How can we get around it? XPath What is easy to parse? Software: XPathReader package Conclusions Motivation (Starting Points) Lack of free Data- binding tools for C/C++ Desire to read ILDG Metadata documents, marshal application data => Have to write our own tools Would like simple API to get at document data Would like same API to cope with ILDG metadata AND application data. We got as far as reading into a DOM. Start With Simple Idea Consider simple API with functions push(tagname) -- select tag with name tagname pop() -- move up a level getType( tagname , result ) Type = string | float | double | int | bool; Equivalent API: directory like structure with no absolute paths: cd(tagname) = push(tagname) , cd(..) = pop() Simple Data: No Attributes, No Namespaces No Empty Elements. Example Open(''file.xml''); push(''foo''); string bar; getString(''bar'', bar); double fred; getDouble(''fred'', fred); pop(); <? xml version=”1.0”?> <foo> <bar>String</bar> <fred>5.0</fred> </foo> So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders BUT Soon we encounter... The Multiple Tag Problem Consider following snippet: <size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis> </size> Lets try our API: push(''size''); But what does: push(''axis''); do? Multiple Tag Problem (cont'd) push(“axis”) could select in document order We could add an index to push(“axis”) push(“axis”, 1) push(“axis”,2) We could add an index attribute to <axis> <axis index=”1”> <axis index=”2”> But then we'd need a mechanism to match index attribute We could change the names of axis: <axis1> <axis2> We could put the different <axis> into different namespaces -- effectively same as adding attribute We could try and match the <dimension> tag. The consequences Changing tagnames for simplicity of parsing just seems wrong Matching the <dimension> tag is not possible without first selecting an <axis> in our scheme (locality) Adding attributes/namespaces complicates API. This use of different namespaces would be philosophically wrong. Adding order of occurrance index into API is cleanest No need to change Schema, Instance documents etc. Document ordering removes random access capability In General For less simple (more general) XML documents duplicate tags can be distinguished by: Occurrance Order Name Attributes Content Namespace An ideal, simple API should allow matching on all of these to interrogate any XML document. What about Locality ? push(namespace, tagname, attributes, occurrance) getType(ns, tagname, attributes, occurrance, result) But NO local parser can match on element content. need to open a tag based on value of content BUT can't get to content without opening tag. <size> <num_dimensions>2</num_dimensions> <axis> <dimension>2</dimension> <length>16</length> </axis> <axis> <dimension>1</dimension> <length>16</length> </axis> </size> Document order may not help here Schema document still satisfied. Would like to match on <dimension> tag Need to abandon locality Lesson In order to avoid ambiguity we must Restrict the form of markup we deal with Force decisions onto our Schema writers OR complicate our API rely on tag ordering (either implicitly or explicitly) introduce attributes (forcing decision on Schema writers) give up locality in the API Global Queries: XPath Would like a nice way to encode tag name attributes order of occurrence attribute/content matching predicates Can this be done? YES! Using XPath XPath Axes Attribute Axis: @ Parent axis: .. Preceding Sibling Axis (no compact selector) Node Child axis: ./ Following Sibling Axis (no compact selector) XPath Axes specify coordinates for DOM. Some Axes can include more than one node: ancestors: parent and all its ancestors XPath Selectors tagname tagname * selects all children of current node called selects all children of node @name @* selects all attribute nodes called name selects all atributes nodes of current node. name[i] selects the i-th occurrance of child node called name .. selects parent of current node //name selects name with any set of ancestors XPath Examples XPath Query: / <?xml version=”1”?> <size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis> </size> Selection XPath Examples XPath Query: /size <?xml version=”1”?> <size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> Selection </axis> </size> XPath Examples XPath Query: /size/axis OR /size/* OR //axis <?xml version=”1”?> <size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> Selection </axis> </size> XPath Examples XPath Query: Query on order of occurrance /size/axis[2] OR /size/axis[dimension=”2”] Query on element content <?xml version=”1”?> <size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension>2</dimension> <length>16 </length> Selection </axis> </size> XPath Examples XPath Query: /size/bj:axis Support Namespaces <?xml version=”1”?> <size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis> </size> Selection XPath Examples XPath Query: /size/axis[@index=”2”] Attribute Matching <?xml version=”1”?> <size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis> </size> Visit: http://www.zvon.org/xxl/XPathTutorial for more ... Selection XPath Notes Can return sets of nodes - not just unique node Has more features: Functions to turn query results into strings, numbers, booleans Encodes all features we need C/C++ linkable XPath Processors exist Xerces, Xalan, libxml Solves all our reader API problems in nice way. XPath Based Reader API Basic Functions: open(file/stream); getType(xpath_string, result); getAttributeType(xpath_string, attributeName, result); Semantics: The xpath_string must identify a unique node. What is Easy to Parse? Stylistic discussion on Metadata Mailing list. One particular question: “ How should we mark up things?” Chris' Way: Tomoteru's Way: <size> <dimensions>4</dimensions> <axis> <name>X</name> <length>16</length> </axis> <axis> <name>Y</name> <length>16 </length> </axis> </size> <size> <x value=”16”/> <y value=”16”/> <z value=”16”/> <t value=”32”/> </size> Known as the: “ Element v.s. Attribute” debate in the XML world. What is Easy to Parse? One statement is that the attribute way is perhaps easier to parse? With XPath, both ways are easy to parse. To get the length of the x dimension: Chris' Way: number(//size/axis[normalize-space(string(name))=”X”]/length) getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue); Tomoteru's Way: number(//size/x/@value) getIntAttribute(“//size/x”, “value”, intValue); Chris' Way has more complex query. But equally simple API Call. Element v.s. Attribute Debate (aside) Looked on Web Tomoteru's way is preferred in general by object modellers (eg. database people) Mark up most “ atomic” data as attributes Use tags to indicate “ table structure” Chris' way is perhaps preferred by archivists or librarians (Go Kim!) Decide for yourself, a discussion is available at: http://www.oasis-open.org/cover/elementsAndAttrs.html Found no universally accepted best practice. Software: XPathReader Wrote software to implement XPath Reader API in C++ Wraps around free libxml2 (C) library Uses overloading and templating Two Classes: BasicXPathReader: Use XPath to get at basic C++ types (ints, std::strings, etc) XPathReader Allows reading of Complex Numbers and Arrays. XPathReader Class Public Members open/close functions: void open(istream& is); void close(void); count results of XPath Query: int countXPath(const string& xpath_query); get value of attribute from node identified by XPath: template <typename T> void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result); get value of node identified by XPath template <typename T> void getXPath(const string& xpath, T& result); Complex Numbers and Arrays XPathReader Library provides Classes for Complex Numbers and Arrays: template<typename T> class TComplex { ... }; template<typename T> class Array { ... }; Can have Complex numbers of arrays Eg for storing real/imaginary parts of arrays: TComplex< Array< double > > Can also have Complex-es templated on string-s Mathematically not sensible... Complex Number Markup & Marshal Invented simple mark up: <foo> <cmpx> <re>real part</re> <im>imag part</im> </cmpx> </foo> can maintain API through C++ function overloading and recursion: template <typename T> void getXPath(const string& path, TComplex<T>& result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() ); } similar but slightly more involved for Array. Array Markup Arrays were marked up as follows: <foo> <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> <size>N</size> <el idx=”x”> element[0] </el> <el idx=”x+1”> element[1] </el> ... <el idx=”x+N-1”> element[N-1] </el> </array> </foo> This is a general mark up -- suitable for local parsers too Array Mark - Up Example <size> <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> <num_dimensions>4</num_dimensions> <axis dimension=”1”> Minimally invasive ... Insert <array> </array> tags </axis> Copy <dimension> tag to attribute <axis dimension=”2”> ... Easy to implement with XSL </axis> transformation ... Working group needn't amend </array> current metadata schema for it. </size> Conclusions Discussed API Issues for Parsing XML without full “data binding” tools. Discussed Repeated Tag problem Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too. Discussed C++ Implementation of an XPathReader API Discussed how to parse compound data types Described markup for Complex Numbers and Arrays Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data. References/Links XML, DOM, XPath: http://www.w3.org Tutorials (XPath/XSLT): http://www.zvon.org libxml2: http://www.xmlsoft.org Attribute v.s. Entities (and other discussions): http://www.oasis-open.org/cover/elementsAndAttrs.html XPathReader software send email to me: bj@ph.ed.ac.uk SciDAC CVS repository at JLAB (xpath_reader) SciDAC: http://www.lqcd.org