Parsing XML in Perl

advertisement
Perl/XML::DOM - reading
and writing XML from Perl
Dr. Andrew C.R. Martin
martin@biochem.ucl.ac.uk
http://www.bioinf.org.uk/
Aims and objectives
Refresh the structure of a XML (or
XHTML) document
Know problems in reading and
writing XML
Understand the requirements of XML
parsers and the two main types
Know how to write code using the
DOM parser
PRACTICAL: write a script to read XML
An XML refresher!
<mutants>
<mutant_group native='1abc01'>
<structure>
<method>x-ray</method>
<resolution>1.8</resolution>
Tags: paired
opening and
Attributes:
optional
un-paired
tags
use
<rfactor>0.20</rfactor>
closing
tags within
– contained
</structure>
special syntax
the opening tag
<mutant domid='2bcd01'>
May contain data and/or
<structure>
<method>x-ray</method> other (nested) tags
<resolution>1.8</resolution>
<rfactor>0.20</rfactor>
</structure>
<mutation res='L24' native='ALA' subs='ARG' />
</mutant>
</mutant_group>
</mutants>
Writing XML
Writing XML is straightforward
Generate XML from a Perl script using
print() statements.
However:
tags correctly nested
quote marks correctly paired
international character sets
Reading XML
As simple or complex as you wish!
Full control over XML:
simple pattern may suffice
Otherwise, may be dangerous
may rely on un-guaranteed formatting
<mutants><mutant_group native='1abc01'><structure><method>
x-ray</method><resolution>1.8</resolution><rfactor>0.20
</rfactor></structure><mutant domid='2bcd01'><structure>
<method>x-ray</method><resolution>1.8</resolution>
<rfactor>0.20</rfactor></structure><mutation res='L24’
native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>
XML Parsers
Clear rules for data boundaries and
hierarchy
Predictable; unambiguous
Parser translates XML into
stream of events
complex data object
XML Parsers
Good parser will handle:
Different data sources of data
files
character strings
remote references
different character encodings
standard Latin
Japanese
checking for well-formedness errors
XML Parsers
Read stream of characters
differentiate markup and data
Optionally replace entity references
(e.g. < with <)
Assemble complete document
disparate (perhaps remote) sources
Report syntax and validation errors
Pass data to client program
XML Parsers
If XML has no syntax errors it is
'well formed'
With a DTD, a validating parser will
check it matches:
'valid'
XML Parsers
Writing a good parser is a lot of work!
A lot of testing needed
Fortunately, many parsers available
Getting data to your program
Parser can generate 'events'
Tags are converted into events
Events triggered in your program as the
document is read
Parser acts as a pipeline converting
XML into processed chunks of data sent
to your program:
an 'event stream'
Getting data to your program
OR…
XML converted into a tree structure
Reflects organization of the XML
Whole document read into memory
before your program gets access
Pros and cons
Data structure
Event stream
More convenient
Can access data in any
order
Code usually simpler
 May be impossible to handle
very large files
 Need more processor time
 Need more memory
Faster to access limited data
Use less memory
 Parser loses data at the next
event
 More complex code
 In the parser, everything is likely to be event-driven
 tree-based parsers create a data structure from the
event stream
SAX and DOM
de facto standard APIs for XML parsing
 SAX (Simple API for XML)
 event-stream API
 originally for Java, but now for several programming
languages (including Perl)
 development promoted by Peter Murray Rust, 1997-8
 DOM (Document Object Model)
 W3C standard tree-based parser
 platform- and language-neutral
 allows update of document content and structure as
well as reading
Perl XML parsers
Many parsers available
Differ in three major ways:
parsing style (event driven or data structure)
'standards-completeness’
speed (implementation in C or pure Perl)
Perl XML parsers
XML::Simple
Very easy to use
Designed for simple applications
Can’t handle 'mixed content'
tags containing both data and other tags
<p>This is <b>mixed</b> content</p>
Perl XML parsers
XML::Parser
Oldest Perl XML parser
Reasonably fast and flexibile
Not very standards-compliant.
Is a wrapper to 'expat’
probably the first C XML parser written by
James Clark
XML::Parser
Simple example - check well-formedness
Remember 2 lines
use XML::Parser;
before
an error
my $xmlfile
= shift
@ARGV;
# the file to
$@parse
stores
the
error
# initialize parser object and parse the
string
my $parser = XML::Parser->new( ErrorContext => 2 );
eval { $parser->parsefile( $xmlfile ); };
# report
if( $@ )
{
$@ =~
print
}
else
{
print
}
error or success
usedmodule
so parser
s/at \/.*?$//s; Eval
# remove
line number
STDERR "\nERROR in '$xmlfile':\n$@\n";
doesn’t cause
program to exit
STDERR "'$xmlfile' is well-formed\n";
Perl XML parsers
XML::DOM
Implements W3C DOM Level 1
Built on top of XML::Parser
Very good fast, stable and complete
Limited extended functionality
XML::SAX
Implements SAX2 wrapper to Expat
Fast, stable and complete
Perl XML parsers
XML::LibXML
Wrapper around GNOME libxml2
Very fast, complete and stable
Validating/non-validating
DOM and SAX support
Perl XML parsers
XML::Twig
DOM-like parser, BUT
Allows you to define elements which can
be parsed as discrete units
'twigs' (small branches of a tree)
Perl XML parsers
Several others
More specialized, adding…
XPath (to select data from XML document)
re-formatting (XSLT or other methods)
...
XML::DOM
DOM is a standard API
once learned moving to a different
language is straightforward
moving between implementations also easy
Suppose we want to extract some data
from an XML file...
XML::DOM
<data>
<species name='Felix domesticus'>
<common-name>cat</common-name>
<conservation status='not endangered' />
</species>
<species name='Drosophila melanogaster'>
<common-name>fruit fly</common-name>
<conservation status='not endangered' />
</species>
</data>
We want:
cat (Felix domesticus) not endangered
fruit fly (Drosophila melanogaster) not endangered
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
Import XML::DOM module
obtain filename
initialize a parser
parse the file
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
Can now treat $species
$name = $species->getAttribute('name');
Returns each element
in the same way as $doc
$species->getElementsByTagName('conservation');
matching ‘species’
$conservation =
$status = $conservation->item(0)->getAttribute('status');
Nested loops
correspond to
nested tags
print "$cname ($name) $status\n";
}
$doc->dispose();Look
at each species element in turn
$species contains content and descendant elements
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$common_name contains content
$conservation = $species->getElementsByTagName('conservation');
and any descendant elements
$status = $conservation->item(0)->getAttribute('status');
Returns
each element element
Here there is only
one <common-name>
matching
‘common-name’
per <species>
element,
but parser can't
Extract
the
text from the element
print "$cnameThe
($name)
$status\n";
<common-name>
contains
only
know
that.
Have toelement
specify
that
we want
object
using
->getNodeValue
}
one child
is data.
the first (and
only)which
<common-name>
element
Obtain first (and only) child element
$doc->dispose();
using ->getFirstChild
->getElementsByTagName returns an array
Here we work through the array:
foreach $species ($doc->getElementsByTagName('species'))
{
}
Here we obtain an object and use the ->item()
method to obtain the first item:
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
Alternative is to access an individual array element:
@common_names = $species->getElementsByTagName('common-name');
$cname = $common_names[0]->getFirstChild->getNodeValue;
<species name='Felix domesticus'>
<common-name>cat</common-name>
<conservation status='not endangered' />
</species>
There could have been
Obtain
more
thethan
actual text
one <common-name>rather
element
than an
within this <species>
element
element
object
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
Attributes
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
Attributes much simpler!
$parser = XML::DOM::Parser->new();
Can only contain text (no
$doc = $parser->parsefile($file);
nested elements)
Simply specify the attribute
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name) $status\n";
}
$doc->dispose();
<species name='Felix domesticus'>
<species name='Felix domesticus'>
<common-name>cat</common-name>
<conservation status='not endangered' />
</species>
DOM parser can't
know
there
is only one
Extract
thea<conservation>
elements
Contains
only
‘status’
attribute,
<conservation>
element
per
species.
from
thiswith
<species>
element
extract
its
value
->getAttribute()
This isthe
an first
empty
element,
there are no
Extract
(and
only) one.
child elements so we don’t need
->getFirstChild
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
Print the extracted information
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name)
$status\n";
Loop
back to next <species>
element
}
$doc->dispose();
Clean up and free memory
XML::DOM
Note
Not necessary to use variable names
that match the tags, but it is a very
good idea!
There are many many more functions,
but this set covers most needs
Writing XML with
XML::DOM
#!/usr/bin/perl
use XML::DOM;
$nspecies = 2;
@names = ('Felix domesticus', 'Drosophila melanogaster');
@commonNames = ('cat', 'fruit fly');
@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;
$xml_pi = $doc->createXMLDecl ('1.0');
print $xml_pi->toString;
$root = $doc->createElement('data');
Import XML::DOM
Initialize data
Create an XML
for($i=0; $i<$nspecies; $i++)
document object
{
Utility method to
$species = $doc->createElement('species');
$species->setAttribute('name', $names[$i]);
print XML header
$root->appendChild($species);
$cname = $doc->createElement('common-name');
$text = $doc->createTextNode($commonNames[$i]);
$cname->appendChild($text);
$species->appendChild($cname);
<?xml version=“1.0” ?>
$cons = $doc->createElement('conservation');
$cons->setAttribute('status', $consStatus[$i]);
$species->appendChild($cons);
}
print $root->toString;
#!/usr/bin/perl
use XML::DOM;
$nspecies = 2;
@names = ('Felix domesticus', 'Drosophila melanogaster');
@commonNames = ('cat', 'fruit fly');
@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;
$xml_pi = $doc->createXMLDecl ('1.0');
print $xml_pi->toString;
$root = $doc->createElement('data');
Create a ‘root’
element
for our
dataof
Loop through
each
the species
for($i=0; $i<$nspecies; $i++)
{
$species = $doc->createElement('species');
$species->setAttribute('name', $names[$i]);
$root->appendChild($species);
Create a <species>
$cname = $doc->createElement('common-name');
$text = $doc->createTextNode($commonNames[$i]);
Set
its ‘name’
element
$cname->appendChild($text);
attribute
$species->appendChild($cname);
Join the <species>
element
to the parent <data>
$cons = $doc->createElement('conservation');
$cons->setAttribute('status', $consStatus[$i]);
$species->appendChild($cons);
}
print $root->toString;
element
<?xml version=“1.0” ?>
<data>
<species name='Felix domesticus’ />
</data>
#!/usr/bin/perl
use XML::DOM;
$nspecies = 2;
@names = ('Felix domesticus', 'Drosophila melanogaster');
@commonNames = ('cat', 'fruit fly');
@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;
$xml_pi = $doc->createXMLDecl ('1.0');
print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++)
Create a <common-name>
{
Create a text node containing
$species = $doc->createElement('species');
element
the
specified
$species->setAttribute('name', $names[$i]); text
$root->appendChild($species);
$cname = $doc->createElement('common-name');
Join this text node as a
$text = $doc->createTextNode($commonNames[$i]);
Join
the
<name>
element
child
of
the
<name>
element
$cname->appendChild($text);
as a child of <species>
$species->appendChild($cname);
$cons = $doc->createElement('conservation');
$cons->setAttribute('status', $consStatus[$i]);
$species->appendChild($cons);
}
print $root->toString;
<?xml version=“1.0” ?>
<data>
<species name='Felix domesticus'>
<common-name>cat</common-name>
</species>
</data>
#!/usr/bin/perl
use XML::DOM;
$nspecies = 2;
@names = ('Felix domesticus', 'Drosophila melanogaster');
@commonNames = ('cat', 'fruit fly');
@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;
$xml_pi = $doc->createXMLDecl ('1.0');
print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++)
{
$species = $doc->createElement('species');
$species->setAttribute('name', $names[$i]);
Create a <conservation>
Set the ‘status’
$cname = $doc->createElement('common-name');
element
Join the <conservation>
element
attribute
$text = $doc->createTextNode($commonNames[$i]);
$cname->appendChild($text);
as a child of <species>
$root->appendChild($species);
$species->appendChild($cname);
$cons = $doc->createElement('conservation');
$cons->setAttribute('status', $consStatus[$i]);
$species->appendChild($cons);
}
print $root->toString;
<?xml version=“1.0” ?>
<data>
<species name='Felix domesticus'>
<common-name>cat</common-name>
<conservation status='not endangered' />
</species>
</data>
#!/usr/bin/perl
use XML::DOM;
$nspecies = 2;
@names = ('Felix domesticus', 'Drosophila melanogaster');
@commonNames = ('cat', 'fruit fly');
@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;
$xml_pi = $doc->createXMLDecl ('1.0');
print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++)
{
$species = $doc->createElement('species');
$species->setAttribute('name', $names[$i]);
Loop back to handle the
$doc->createElement('common-name');
next species
$root->appendChild($species);
$cname =
$text = $doc->createTextNode($commonNames[$i]);
$cname->appendChild($text);
$species->appendChild($cname);
$cons = $doc->createElement('conservation');
$cons->setAttribute('status',Finally
$consStatus[$i]);
print the resulting
$species->appendChild($cons);
}
print $root->toString;
data structure
<?xml version=“1.0” ?>
<data>
<species name='Felix domesticus'>
<common-name>cat</common-name>
<conservation status='not endangered' />
</species>
<species name='Drosophila melanogaster'>
<common-name>fruit fly</common-name>
<conservation status='not endangered' />
</species>
</data>
Summary - reading XML
Create a parser
$parser = XML::DOM::Parser->new();
Parse a file
$doc = $parser->parsefile('filename');
Extract all elements matching tag-name
$element_set = $doc->getElementsByTagName('tag-name')
Extract first element of a set
$element = $element_set->item(0);
Extract first child of an element
$child_element = $element->getFirstChild;
Extract text from an element
$text = $element->getNodeValue;
Get the value of a tag’s attribute
$text = $element->getAttribute('attribute-name');
Summary - writing XML
Create an XML document structure
$doc = XML::DOM::Document->new;
Utility to create an XML header
$header = $doc->createXMLDecl('1.0');
Create a tagged element
$element = $doc->createElement('tag-name');
Set an attribute for an element
$element->setAttribute('attrib-name', 'value');
Append a child element to an element
$parent_element->appendChild($child_element);
Create a text node element
$element = $doc->createTextNode('text');
Print a document structure as a string
print $root_element->toString;
Summary
Two types of parser
Event-driven
Data structure
Writing a good parser is difficult!
Many parsers available
XML::DOM for reading and writing data
Download