Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/ Aims and objectives Refresh the structure of a XML (or XHTML) document Know problems in reading and writing XML Understand the requirements of XML parsers and the two main types Know how to write code using the DOM parser PRACTICAL: write a script to read XML An XML refresher! <mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> Tags: paired opening and Attributes: optional un-paired tags use <rfactor>0.20</rfactor> closing tags within – contained </structure> special syntax the opening tag <mutant domid='2bcd01'> May contain data and/or <structure> <method>x-ray</method> other (nested) tags <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group> </mutants> Writing XML Writing XML is straightforward Generate XML from a Perl script using print() statements. However: tags correctly nested quote marks correctly paired international character sets Reading XML As simple or complex as you wish! Full control over XML: simple pattern may suffice Otherwise, may be dangerous may rely on un-guaranteed formatting <mutants><mutant_group native='1abc01'><structure><method> x-ray</method><resolution>1.8</resolution><rfactor>0.20 </rfactor></structure><mutant domid='2bcd01'><structure> <method>x-ray</method><resolution>1.8</resolution> <rfactor>0.20</rfactor></structure><mutation res='L24’ native='ALA’ subs='ARG'/></mutant></mutant_group></mutants> XML Parsers Clear rules for data boundaries and hierarchy Predictable; unambiguous Parser translates XML into stream of events complex data object XML Parsers Good parser will handle: Different data sources of data files character strings remote references different character encodings standard Latin Japanese checking for well-formedness errors XML Parsers Read stream of characters differentiate markup and data Optionally replace entity references (e.g. &lt; with <) Assemble complete document disparate (perhaps remote) sources Report syntax and validation errors Pass data to client program XML Parsers If XML has no syntax errors it is 'well formed' With a DTD, a validating parser will check it matches: 'valid' XML Parsers Writing a good parser is a lot of work! A lot of testing needed Fortunately, many parsers available Getting data to your program Parser can generate 'events' Tags are converted into events Events triggered in your program as the document is read Parser acts as a pipeline converting XML into processed chunks of data sent to your program: an 'event stream' Getting data to your program OR… XML converted into a tree structure Reflects organization of the XML Whole document read into memory before your program gets access Pros and cons Data structure Event stream More convenient Can access data in any order Code usually simpler May be impossible to handle very large files Need more processor time Need more memory Faster to access limited data Use less memory Parser loses data at the next event More complex code In the parser, everything is likely to be event-driven tree-based parsers create a data structure from the event stream SAX and DOM de facto standard APIs for XML parsing SAX (Simple API for XML) event-stream API originally for Java, but now for several programming languages (including Perl) development promoted by Peter Murray Rust, 1997-8 DOM (Document Object Model) W3C standard tree-based parser platform- and language-neutral allows update of document content and structure as well as reading Perl XML parsers Many parsers available Differ in three major ways: parsing style (event driven or data structure) 'standards-completeness’ speed (implementation in C or pure Perl) Perl XML parsers XML::Simple Very easy to use Designed for simple applications Can’t handle 'mixed content' tags containing both data and other tags <p>This is <b>mixed</b> content</p> Perl XML parsers XML::Parser Oldest Perl XML parser Reasonably fast and flexibile Not very standards-compliant. Is a wrapper to 'expat’ probably the first C XML parser written by James Clark XML::Parser Simple example - check well-formedness Remember 2 lines use XML::Parser; before an error my $xmlfile = shift @ARGV; # the file to $@parse stores the error # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report if( $@ ) { $@ =~ print } else { print } error or success usedmodule so parser s/at \/.*?$//s; Eval # remove line number STDERR "\nERROR in '$xmlfile':\n$@\n"; doesn’t cause program to exit STDERR "'$xmlfile' is well-formed\n"; Perl XML parsers XML::DOM Implements W3C DOM Level 1 Built on top of XML::Parser Very good fast, stable and complete Limited extended functionality XML::SAX Implements SAX2 wrapper to Expat Fast, stable and complete Perl XML parsers XML::LibXML Wrapper around GNOME libxml2 Very fast, complete and stable Validating/non-validating DOM and SAX support Perl XML parsers XML::Twig DOM-like parser, BUT Allows you to define elements which can be parsed as discrete units 'twigs' (small branches of a tree) Perl XML parsers Several others More specialized, adding… XPath (to select data from XML document) re-formatting (XSLT or other methods) ... XML::DOM DOM is a standard API once learned moving to a different language is straightforward moving between implementations also easy Suppose we want to extract some data from an XML file... XML::DOM <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data> We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); Import XML::DOM module obtain filename initialize a parser parse the file foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Can now treat $species $name = $species->getAttribute('name'); Returns each element in the same way as $doc $species->getElementsByTagName('conservation'); matching ‘species’ $conservation = $status = $conservation->item(0)->getAttribute('status'); Nested loops correspond to nested tags print "$cname ($name) $status\n"; } $doc->dispose();Look at each species element in turn $species contains content and descendant elements #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $common_name contains content $conservation = $species->getElementsByTagName('conservation'); and any descendant elements $status = $conservation->item(0)->getAttribute('status'); Returns each element element Here there is only one <common-name> matching ‘common-name’ per <species> element, but parser can't Extract the text from the element print "$cnameThe ($name) $status\n"; <common-name> contains only know that. Have toelement specify that we want object using ->getNodeValue } one child is data. the first (and only)which <common-name> element Obtain first (and only) child element $doc->dispose(); using ->getFirstChild ->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { } Here we obtain an object and use the ->item() method to obtain the first item: $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Alternative is to access an individual array element: @common_names = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> There could have been Obtain more thethan actual text one <common-name>rather element than an within this <species> element element object $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Attributes #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; Attributes much simpler! $parser = XML::DOM::Parser->new(); Can only contain text (no $doc = $parser->parsefile($file); nested elements) Simply specify the attribute foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); <species name='Felix domesticus'> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> DOM parser can't know there is only one Extract thea<conservation> elements Contains only ‘status’ attribute, <conservation> element per species. from thiswith <species> element extract its value ->getAttribute() This isthe an first empty element, there are no Extract (and only) one. child elements so we don’t need ->getFirstChild $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) Print the extracted information { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; Loop back to next <species> element } $doc->dispose(); Clean up and free memory XML::DOM Note Not necessary to use variable names that match the tags, but it is a very good idea! There are many many more functions, but this set covers most needs Writing XML with XML::DOM #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); Import XML::DOM Initialize data Create an XML for($i=0; $i<$nspecies; $i++) document object { Utility method to $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); print XML header $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); <?xml version=“1.0” ?> $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); Create a ‘root’ element for our dataof Loop through each the species for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); Create a <species> $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); Set its ‘name’ element $cname->appendChild($text); attribute $species->appendChild($cname); Join the <species> element to the parent <data> $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; element <?xml version=“1.0” ?> <data> <species name='Felix domesticus’ /> </data> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) Create a <common-name> { Create a text node containing $species = $doc->createElement('species'); element the specified $species->setAttribute('name', $names[$i]); text $root->appendChild($species); $cname = $doc->createElement('common-name'); Join this text node as a $text = $doc->createTextNode($commonNames[$i]); Join the <name> element child of the <name> element $cname->appendChild($text); as a child of <species> $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> </species> </data> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); Create a <conservation> Set the ‘status’ $cname = $doc->createElement('common-name'); element Join the <conservation> element attribute $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); as a child of <species> $root->appendChild($species); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> </data> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); Loop back to handle the $doc->createElement('common-name'); next species $root->appendChild($species); $cname = $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status',Finally $consStatus[$i]); print the resulting $species->appendChild($cons); } print $root->toString; data structure <?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data> Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile('filename'); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName('tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute('attribute-name'); Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString; Summary Two types of parser Event-driven Data structure Writing a good parser is difficult! Many parsers available XML::DOM for reading and writing data