Joy of XML - from Stanford

advertisement
The Joy of SAX
(and DOM, and JDOM…)
Bill MacCartney
11 October 2004
Roadmap


What are XML APIs good for?
Overview of JAXP



SAX




SAX Architecture
Using SAX
SAXExample.java
DOM




JAXP XML Parsers
SAX vs. DOM
DOM Architecture
Using DOM
DOMExample.java
JDOM
What are XML APIs for?

You want to read/write data from/to XML files,
and you don't want to write an XML parser.

Applications:

processing an XML-tagged corpus

saving configs, prefs, parameters, etc. as XML files

sharing results with outside users in portable format


example: typed dependency relations
alternative to serialization for persistent stores

doesn't break with changes to class definition

human-readable
Overview of JAXP

JAXP = Java API for XML Processing

Provides a common interface for creating and using the
standard SAX, DOM, and XSLT APIs in Java.

All JAXP packages are included standard in JDK 1.4+.
The key packages are:
javax.xml.parsers
The main JAXP APIs, which provide a common
interface for various SAX and DOM parsers.
org.w3c.dom
Defines the Document class (a DOM), as well as
classes for all of the components of a DOM.
org.xml.sax
Defines the basic SAX APIs.
javax.xml.transform
Defines the XSLT APIs that let you transform XML
into other forms. (Not covered today.)
JAXP XML Parsers

javax.xml.parsers defines abstract classes
DocumentBuilder (for DOM) and SAXParser (for SAX).


It also defines factory classes DocumentBuilderFactory and
SAXParserFactory. By default, these give you the “reference
implementation” of DocumentBuilder and SAXParser, but they
are intended to be vendor-neutral factory classes, so that you
could swap in a different implementation if you preferred.
The JDK includes three XML parser implementations from
Apache:

Crimson: The original. Small and fast. Based on code donated to
Apache by Sun. Standard implementation for J2SE 1.4.

Xerces: More features. Supports XML Schema. Based on code
donated to Apache by IBM.

Xerces 2: The future. Standard implementation for J2SE 5.0.
SAX vs. DOM
SAX = Simple API for XML

Java-specific

interprets XML as a stream of events

you supply event-handling callbacks

SAX parser invokes your eventhandlers as it parses

doesn't build data model in memory

serial access

very fast, lightweight

good choice when
a) no data model is needed, or
b) natural structure for data model is
list, matrix, etc.
DOM = Document Object Model

W3C standard for representing structured
documents

platform and language neutral
(not Java-specific!)

interprets XML as a tree of nodes

builds data model in memory

enables random access to data

therefore good for interactive apps

more CPU- and memory-intensive

good choice when data model has natural
tree structure
There is also JDOM … more later
SAX Architecture
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Using SAX
Here’s the standard recipe for starting with SAX:
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
// get a SAXParser object
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// invoke parser using your custom content handler
saxParser.parse(inputStream, myContentHandler);
saxParser.parse(file, myContentHandler);
saxParser.parse(url, myContentHandler);
(This reflects SAX 1, which you can still use, but SAX 2 prefers a new
incantation…)
Using SAX 2
In SAX 2, the following usage is preferred:
// tell SAX which XML parser you want (here, it’s Crimson)
System.setProperty("org.xml.sax.driver",
"org.apache.crimson.parser.XMLReaderImpl");
// get an XMLReader object
XMLReader reader = XMLReaderFactory.createXMLReader();
// tell the XMLReader to use your custom content handler
reader.setContentHandler(myContentHandler);
// Have the XMLReader parse input from Reader myReader:
reader.parse(new InputSource(myReader));
But where does myContentHandler come from?
Defining a ContentHandler

Easiest route: define a new class which extends
org.xml.sax.helpers.DefaultHandler.

Override event-handling methods from
DefaultHandler:
startDocument()
endDocument()
startElement()
endElement()
characters()
error()
//
//
//
//
//
//
//
receive
receive
receive
receive
receive
receive
...plus
notice of start of document
notice of end of document
notice of start of each element
notice of end of each element
a chunk of character data
notice of recoverable parser error
more...
(All are no-ops in DefaultHandler.)
startElement()and endElement()
The SAXParser invokes your callbacks to notify you of events:
startElement(String namespaceURI, // for use w/ namespaces
String localName,
// for use w/ namespaces
String qName,
// "qualified" name -- use this one!
Attributes atts)
endElement(String namespaceURI,
String localName,
String qName)

For simple usage, ignore namespaceURI and localName, and just use
qName (the “qualified” name).

XML namespaces are described in an appendix, below.

startElement() and endElement() events always come in pairs:

“<foo/>” will generate calls:
startElement("", "", "foo", null)
endElement("", "", "foo")
SAX Attributes

Every call to startElement() includes an
Attributes object which represents all the
XML attributes for that element.

Methods in the Attributes interface:
getLength()
getIndex(String qName)
getValue(String qName)
getValue(int index)
// ... and others ...
//
//
//
//
return number of attributes
look up attribute's index by qName
look up attribute's value by qName
look up attribute's value by index
SAX characters()
The characters() event handler receives notification of
character data (i.e. content that is not part of an XML
element):
public void characters(char[] ch,
int start,
int length)
// buffer containing chars
// start position in buffer
// num of chars to read
 May be called multiple times within each block of character data—for
example, once per line.
 So, you may want to use calls to characters() to accumulate characters
in a StringBuffer, and stop accumulating at the next call to
startElement().
SAXExample: Input XML
<?xml version="1.0" encoding="UTF-8"?>
<dots>
this is before the first dot
and it continues on multiple lines
<dot x="9" y="81" />
<dot x="11" y="121" />
<flip>
flip is on
<dot x="196" y="14" />
<dot x="169" y="13" />
</flip>
flip is off
<dot x="12" y="144" />
<extra>stuff</extra>
<!-- a final comment -->
</dots>
SAXExample: Code
Please see SAXExample.java
SAXExample: Input  Output
<?xml version="1.0" encoding="UTF-8"?>
<dots>
this is before the first dot
and it continues on multiple lines
<dot x="9" y="81" />
<dot x="11" y="121" />
<flip>
flip is on
<dot x="196" y="14" />
<dot x="169" y="13" />
</flip>
flip is off
<dot x="12" y="144" />
<extra>stuff</extra>
<!-- a final comment -->
</dots>
startDocument
startElement: dots (0 attributes)
characters:
this is before the first dot
and it continues on multiple lines
startElement: dot (2 attributes)
endElement:
dot
startElement: dot (2 attributes)
endElement:
dot
startElement: flip (0 attributes)
characters:
flip is on
startElement: dot (2 attributes)
endElement:
dot
startElement: dot (2 attributes)
endElement:
dot
endElement:
flip
characters:
flip is off
startElement: dot (2 attributes)
endElement:
dot
startElement: extra (0 attributes)
characters:
stuff
endElement:
extra
endElement:
dots
endDocument
Finished parsing input. Got the following dots:
[(9, 81), (11, 121), (14, 196), (13, 169), (12,
144)]
DOM Architecture
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
DOM Document Structure
XML Input:
<?xml version="1.0" encoding="UTF-8"?>
<dots>
this is before the first dot
and it continues on multiple lines
<dot x="9" y="81" />
<dot x="11" y="121" />
<flip>
flip is on
<dot x="196" y="14" />
<dot x="169" y="13" />
</flip>
flip is off
<dot x="12" y="144" />
<extra>stuff</extra>
<!-- a final comment -->
</dots>
Document structure:
Document
+---Element <dots>
+---Text "this is before the first dot
|
and it continues on multiple lines"
+---Element <dot>
+---Text ""
+---Element <dot>
+---Text ""
+---Element <flip>
|
+---Text "flip is on"
|
+---Element <dot>
|
+---Text ""
|
+---Element <dot>
|
+---Text ""
+---Text "flip is off"
+---Element <dot>
+---Text ""
+---Element <extra>
|
+---Text "stuff"
+---Text ""
+---Comment "a final comment"
+---Text ""

There’s a text node between every pair of element nodes, even if the text is empty.

XML comments appear in special comment nodes.

Element attributes do not appear in tree—available through Element object.
Using DOM
Here’s the basic recipe for getting started with DOM:
import javax.xml.parsers.*;
import org.w3c.dom.*;
// get a DocumentBuilder object
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = null;
try {
db = dbf.newDocumentBuilder();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
// invoke parser to get a Document
Document doc = db.parse(inputStream);
Document doc = db.parse(file);
Document doc = db.parse(url);
DOM Document access idioms
OK, say we have a Document. How do we get at the pieces of it?
Here are some common idioms:
// get the root of the Document tree
Element root = doc.getDocumentElement();
// get nodes in subtree by tag name
NodeList dots = root.getElementsByTagName("dot");
// get first dot element
Element firstDot = (Element) dots.item(0);
// get x attribute of first dot
String x = firstDot.getAttribute("x");
More Document accessors
Node access methods:
String
short
Document
boolean
NodeList
Node
Node
Node
Node
Node
boolean
... and more ...
getNodeName()
getNodeType()
getOwnerDocument()
hasChildNodes()
getChildNodes()
getFirstChild()
getLastChild()
getParentNode()
getNextSibling()
getPreviousSibling()
hasAttributes()
Element extends Node and adds these access methods:
String
boolean
String
NodeList
… and more …
getTagName()
hasAttribute(String name)
getAttribute(String name)
getElementsByTagName(String name)
Document extends Node and adds these access methods:
Element
getDocumentElement()
DocumentType
getDoctype()
... plus the Element methods just mentioned ...
... and more ...
e.g.
DOCUMENT_NODE,
ELEMENT_NODE,
TEXT_NODE,
COMMENT_NODE,
etc.
Creating & manipulating Documents
The DOM API also includes lots of methods for creating and
manipulating Document objects:
// get new empty Document from DocumentBuilder
Document doc = db.newDocument();
// create a new <dots> Element and add to Document as root
Element root = doc.createElement("dots");
doc.appendChild(root);
// create a new <dot> Element and add as child of root
Element dot = doc.createElement("dot");
dot.setAttribute("x", "9");
dot.setAttribute("y", "81");
root.appendChild(dot);
More Document manipulators
Node manipulation methods:
void
Node
Node
Node
... and more ...
setNodeValue(String nodeValue)
appendChild(Node newChild)
insertBefore(Node newChild, Node refChild)
removeChild(Node oldChild)
Element manipulation methods:
void
setAttribute(String name, String value)
void
removeAttribute(String name)
… and more …
Document manipulation methods:
Text
createTextNode(String data)
Comment
createCommentNode(String data)
... and more ...
Writing a Document as XML

Strangely, since JAXP 1.1, there is no simple, documented way
to write out a Document object as XML.

Instead, you can exploit an undocumented trick: cast the
Document to a Crimson XmlDocument, which knows how to
write itself out:
import org.apache.crimson.tree.XmlDocument;
XmlDocument x = (XmlDocument) doc;
x.write(out, "UTF-8");

There is a supported way to write Documents as XML via the
XSLT library, but it is far more clumsy than this two-line trick.

Of course, one could just walk the Document tree and write
XML using printlns.

JDOM remedies this with easy XML output!
DOMExample: Code
Please see DOMExample.java
JDOM Overview

DOM can be awkward for Java programmers

Language-neutral  does not use Java features


Example: getChildNodes() returns a NodeList, which is not a List.
(NodeList.iterator() is not defined.)
JDOM looks like a good alternative:

open source project, Apache license, late beta

builds on top of JAXP, integrates with SAX and DOM

similar to DOM model, but no shared code

API designed to be easy & obvious for Java programmers

exploits power of Java language: collections, method overloading

rumored to become integrated in future JDKs


XML output is easy!
Key packages: org.jdom, org.jdom.transform,
org.jdom.input, org.jdom.output.
DOM vs. JDOM
The DOM way:
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);
The JDOM way:
Document doc = new Document();
Element e = new Element("root");
e.setText("This is the root");
doc.addContent(e);
schweet!
Pointers

Everything in this tutorial (slides, example code, example
data) will be archived at http://nlp.stanford.edu/local/
for your future reference.

There’s a good JAXP/SAX/DOM tutorial at:
http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/

You can learn more about JDOM at
http://www.jdom.org/docs/faq.html
Download