Getting Data out of XML Documents Bálint Joó School of Physics

advertisement
Getting Data out of XML Documents
Bálint Joó
School of Physics
University of Edinburgh
May 02, 2003
Contents
In search of a simple API for accessing DOM
The multiple tag problem
What is it?
Is it a problem for us?
How can we get around it?
XPath
What is easy to parse?
Software: XPathReader package
Conclusions
Motivation (Starting Points)
Lack of free Data- binding tools for C/C++
Desire to read ILDG Metadata documents,
marshal application data
=> Have to write our own tools
Would like simple API to get at document data
Would like same API to cope with ILDG
metadata AND application data.
We got as far as reading into a DOM.
Start With Simple Idea
Consider simple API with functions
push(tagname) -- select tag with name tagname
pop() -- move up a level
getType( tagname , result )
Type = string | float | double | int | bool;
Equivalent API: directory like structure with no
absolute paths:
cd(tagname) = push(tagname) , cd(..) = pop()
Simple Data: No Attributes, No Namespaces No
Empty Elements.
Example
Open(''file.xml'');
push(''foo'');
string bar; getString(''bar'', bar);
double fred;
getDouble(''fred'', fred);
pop();
<? xml version=”1.0”?>
<foo>
<bar>String</bar>
<fred>5.0</fred>
</foo>
So far so good - nice and simple
Current UKQCD Schema has no attributes/namespaces
Empty tags serve no purpose except as placeholders
BUT Soon we encounter...
The Multiple Tag Problem
Consider following snippet:
<size>
<axis>
<dimension> 1 </dimension>
<length>16</length>
</axis>
<axis>
<dimension> 2 </dimension>
<length>16 </length>
</axis>
</size>
Lets try our API: push(''size'');
But what does: push(''axis''); do?
Multiple Tag Problem (cont'd)
push(“axis”) could select in document order
We could add an index to push(“axis”)
push(“axis”, 1) push(“axis”,2)
We could add an index attribute to <axis>
<axis index=”1”> <axis index=”2”>
But then we'd need a mechanism to match index attribute
We could change the names of axis:
<axis1> <axis2>
We could put the different <axis> into different
namespaces -- effectively same as adding attribute
We could try and match the <dimension> tag.
The consequences
Changing tagnames for simplicity of parsing just seems
wrong
Matching the <dimension> tag is not possible without
first selecting an <axis> in our scheme (locality)
Adding attributes/namespaces complicates API.
This use of different namespaces would be
philosophically wrong.
Adding order of occurrance index into API is cleanest
No need to change Schema, Instance documents etc.
Document ordering removes random access capability
In General
For less simple (more general) XML documents
duplicate tags can be distinguished by:
Occurrance Order
Name
Attributes
Content
Namespace
An ideal, simple API should allow matching on
all of these to interrogate any XML document.
What about Locality ?
push(namespace, tagname, attributes, occurrance)
getType(ns, tagname, attributes, occurrance, result)
But NO local parser can match on element content.
need to open a tag based on value of content
BUT can't get to content without opening tag.
<size>
<num_dimensions>2</num_dimensions>
<axis>
<dimension>2</dimension>
<length>16</length>
</axis>
<axis>
<dimension>1</dimension>
<length>16</length>
</axis>
</size>
Document order may
not help here
Schema document still
satisfied.
Would like to match on
<dimension> tag
Need to abandon locality
Lesson
In order to avoid ambiguity we must
Restrict the form of markup we deal with
Force decisions onto our Schema writers
OR complicate our API
rely on tag ordering (either implicitly or explicitly)
introduce attributes (forcing decision on Schema writers)
give up locality in the API
Global Queries: XPath
Would like a nice way to encode
tag name
attributes
order of occurrence
attribute/content matching predicates
Can this be done?
YES! Using XPath
XPath Axes
Attribute Axis: @
Parent axis: ..
Preceding
Sibling Axis
(no compact
selector)
Node
Child axis: ./
Following Sibling Axis
(no compact selector)
XPath Axes specify coordinates for
DOM.
Some Axes can include more than
one node:
ancestors: parent and all its
ancestors
XPath Selectors
tagname
tagname
*
selects all children of current node called
selects all children of node
@name
@*
selects all attribute nodes called name
selects all atributes nodes of current node.
name[i]
selects the i-th occurrance of child node
called name
..
selects parent of current node
//name
selects name with any set of ancestors
XPath Examples
XPath Query:
/
<?xml version=”1”?>
<size>
<axis>
<dimension> 1 </dimension>
<length>16</length>
</axis>
<axis>
<dimension> 2 </dimension>
<length>16 </length>
</axis>
</size>
Selection
XPath Examples
XPath Query:
/size
<?xml version=”1”?>
<size>
<axis>
<dimension> 1 </dimension>
<length>16</length>
</axis>
<axis>
<dimension> 2 </dimension>
<length>16 </length>
Selection
</axis>
</size>
XPath Examples
XPath Query:
/size/axis
OR
/size/*
OR
//axis
<?xml version=”1”?>
<size>
<axis>
<dimension> 1 </dimension>
<length>16</length>
</axis>
<axis>
<dimension> 2 </dimension>
<length>16 </length>
Selection
</axis>
</size>
XPath Examples
XPath Query:
Query on order of occurrance
/size/axis[2]
OR
/size/axis[dimension=”2”]
Query on element content
<?xml version=”1”?>
<size>
<axis>
<dimension> 1 </dimension>
<length>16</length>
</axis>
<axis>
<dimension>2</dimension>
<length>16 </length>
Selection
</axis>
</size>
XPath Examples
XPath Query:
/size/bj:axis
Support Namespaces
<?xml version=”1”?>
<size xmlns:bj=”http://fred.org”>
<bj:axis>
<dimension> 1 </dimension>
<length>16</length>
</bj:axis>
<axis index=”2”>
<dimension> 2 </dimension>
<length>16 </length>
</axis>
</size>
Selection
XPath Examples
XPath Query:
/size/axis[@index=”2”]
Attribute Matching
<?xml version=”1”?>
<size xmlns:bj=”http://fred.org”>
<bj:axis>
<dimension> 1 </dimension>
<length>16</length>
</bj:axis>
<axis index=”2”>
<dimension> 2 </dimension>
<length>16 </length>
</axis>
</size>
Visit:
http://www.zvon.org/xxl/XPathTutorial
for more ...
Selection
XPath Notes
Can return sets of nodes - not just unique node
Has more features:
Functions to turn query results into strings, numbers,
booleans
Encodes all features we need
C/C++ linkable XPath Processors exist
Xerces, Xalan, libxml
Solves all our reader API problems in nice way.
XPath Based Reader API
Basic Functions:
open(file/stream);
getType(xpath_string, result);
getAttributeType(xpath_string,
attributeName,
result);
Semantics:
The xpath_string must identify a unique node.
What is Easy to Parse?
Stylistic discussion on Metadata Mailing list.
One particular question:
“ How should we mark up things?”
Chris' Way:
Tomoteru's Way:
<size>
<dimensions>4</dimensions>
<axis>
<name>X</name>
<length>16</length>
</axis>
<axis>
<name>Y</name>
<length>16 </length>
</axis>
</size>
<size>
<x value=”16”/>
<y value=”16”/>
<z value=”16”/>
<t value=”32”/>
</size>
Known as the: “ Element v.s. Attribute”
debate in the XML world.
What is Easy to Parse?
One statement is that the attribute way is perhaps easier
to parse?
With XPath, both ways are easy to parse.
To get the length of the x dimension:
Chris' Way:
number(//size/axis[normalize-space(string(name))=”X”]/length)
getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue);
Tomoteru's Way:
number(//size/x/@value)
getIntAttribute(“//size/x”, “value”, intValue);
Chris' Way has more complex query. But equally simple API
Call.
Element v.s. Attribute Debate (aside)
Looked on Web
Tomoteru's way is preferred in general by object
modellers (eg. database people)
Mark up most “ atomic” data as attributes
Use tags to indicate “ table structure”
Chris' way is perhaps preferred by archivists or
librarians (Go Kim!)
Decide for yourself, a discussion is available at:
http://www.oasis-open.org/cover/elementsAndAttrs.html
Found no universally accepted best practice.
Software: XPathReader
Wrote software to implement XPath Reader API
in C++
Wraps around free libxml2 (C) library
Uses overloading and templating
Two Classes:
BasicXPathReader:
Use XPath to get at basic C++ types (ints, std::strings, etc)
XPathReader
Allows reading of Complex Numbers and Arrays.
XPathReader Class Public Members
open/close functions:
void open(istream& is); void close(void);
count results of XPath Query:
int countXPath(const string& xpath_query);
get value of attribute from node identified by XPath:
template <typename T>
void getXPathAttribute(const string& xpath_to_node,
const string& attribute_name,
T& result);
get value of node identified by XPath
template <typename T>
void getXPath(const string& xpath, T& result);
Complex Numbers and Arrays
XPathReader Library provides Classes for
Complex Numbers and Arrays:
template<typename T> class TComplex { ... };
template<typename T> class Array { ... };
Can have Complex numbers of arrays
Eg for storing real/imaginary parts of arrays:
TComplex< Array< double > >
Can also have Complex-es templated on string-s
Mathematically not sensible...
Complex Number Markup & Marshal
Invented simple mark up:
<foo>
<cmpx>
<re>real part</re>
<im>imag part</im>
</cmpx>
</foo>
can maintain API through C++ function overloading
and recursion:
template <typename T>
void getXPath(const string& path, TComplex<T>& result) {
getXPath( path+”/cmpx/re”, result.real() );
getXPath( path+”/cmpx/im”, result.imag() );
}
similar but slightly more involved for Array.
Array Markup
Arrays were marked up as follows:
<foo>
<array sizeName=”size” elemName=”el” indexName=”idx”
indexStart=”x”>
<size>N</size>
<el idx=”x”> element[0] </el>
<el idx=”x+1”> element[1] </el>
...
<el idx=”x+N-1”> element[N-1] </el>
</array>
</foo>
This is a general mark up -- suitable for local parsers too
Array Mark - Up Example
<size>
<array sizeName=”num_dimensions”
elemName=”axis” indexName=”dimension”
indexStart =”1”>
<num_dimensions>4</num_dimensions>
<axis dimension=”1”>
Minimally invasive
...
Insert <array> </array> tags
</axis>
Copy <dimension> tag to attribute
<axis dimension=”2”>
...
Easy to implement with XSL
</axis>
transformation
...
Working group needn't amend
</array>
current metadata schema for it.
</size>
Conclusions
Discussed API Issues for Parsing XML without full
“data binding” tools.
Discussed Repeated Tag problem
Concluded that XPath is simple and elegant way to solve
problem - hopefully convinced you too.
Discussed C++ Implementation of an XPathReader API
Discussed how to parse compound data types
Described markup for Complex Numbers and Arrays
Suggest Complex and Array markup be standardised by
Metadata Working Group (but not necessarily that it be
used in metadata documents) - to assist sharing of data.
References/Links
XML, DOM, XPath: http://www.w3.org
Tutorials (XPath/XSLT): http://www.zvon.org
libxml2: http://www.xmlsoft.org
Attribute v.s. Entities (and other discussions):
http://www.oasis-open.org/cover/elementsAndAttrs.html
XPathReader software
send email to me: bj@ph.ed.ac.uk
SciDAC CVS repository at JLAB (xpath_reader)
SciDAC: http://www.lqcd.org
Download