XML and XML Processing in Java

advertisement
Tutorial:
Introduction to XML and Java:
XML, dom4j and XPath
Eran Toch
Methodologies in the Development
of Information Systems
December 2003
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
Sources
• Major Sources:
– http://www.cis.upenn.edu/~cis550/slides/xml.ppt
CIS550 Course Notes, U. Penn, source for many
slides
– http://www.cs.technion.ac.il/~oshmu/
236804 - Seminar in Computer Science 4: XML Technology, Systems and Theory
– http://dom4j.org
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
2
Agenda
• Short Introduction to XML
– What is XML
– Structure and Terminology
– JAVA APIs for XML: an Overview
• dom4j
– Parsing an XML document
– Writing to an XML document
• Xpath
– Xpath Queries
– Xpath in dom4j
• References
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
3
The Structure of XML
• XML consists of tags and text
• Tags come in pairs <date> ...</date>
• They must be properly nested
<date> <day> ... </day> ... </date> --- good
<date> <day> ... </date>... </day> --- bad
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
4
XML text
• XML has only one “basic” type -- text. It is
bounded by tags e.g.
<title> The Big Sleep </title>
<year> 1935 </ year> --- 1935 is still text
• XML text is called PCDATA (for parsed
character data). It uses a 16-bit encoding,
e.g. \&\#x0152 for the Hebrew letter Mem
Later we shall see how new types are
specified by XML-data
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
5
XML structure
• Nesting tags can be used to express various
structures. E.g. A tuple (record):
<person>
<name> Jeff Cohen</name>
<tel> 04-828-1345 </tel>
<tel> 054-470-778 </tel>
<email> jeffc@cs.technion.ac.il </email>
</person>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
6
XML structure (cont.)
• We can represent a list by using the same
tag repeatedly:
<addresses>
<person> ... </person>
<person> ... </person>
<person> ... </person>
...
</addresses>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
7
XML structure (cont.)
• Nested tags can be part of a list too:
<addresses>
<person>
<name> Yossi Orr</name>
<tel> 04-828-1345 </tel>
<email> yossio@cs.technion.ac.il </email>
</person>
<person>
<name> Irma Levy</name>
<tel> 03-426-1142 </tel>
<email>irmal@yourmail.com</email>
</person>
</addresses>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
8
Terminology
• The segment of an XML document between an opening and a
corresponding closing tag is called an element.
• Meta date about an element can appear in an attribute.
attribute
<person type=“Friend”>
<name>Ortal Derech</name>
<tel>04-8732122</tel>
element
<tel>054-646888</tel>
<email>oderech@tx.technion.ac.il</email>
</person>
text
element, a sub-element
of
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
9
XML is tree-like
person
name
tel
tel
email
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp@dcs.gla.ac.sc
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
10
A Complete XML Document
<?XMLversion ="1.0" encoding="UTF-8"
standalone="no"?>
<!DOCTYPE addresses SYSTEM
"http://www.technion.ac.il/~erant/addresses.dtd">
Tells whether or not
this document
references an
external entity or an
external data type
specification
<addresses>
<person>
<name> Jeff Cohen</name>
<tel> 04-828-1345 </tel>
<tel> 054-470-778 </tel>
<email> jeffc@cs.technion.ac.il </email>
</person>
</addresses>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
11
XML Structure Definitions
• DTD
– Document Type Definition – defines structure
constraints for XML documents
• XML Schema
– Same as DTD, more powerful because it includes
facilities to specify the data type of elements and it is
based on XML.
• Namespaces
– Namespaces are a way of preventing name clashes
among elements from more than one source within
the same XML document.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
12
More Standards
• Xpath
– XML Path Language, a language for locating parts of
an XML document.
• Xquery
– A query language for XML documents (like SQL…).
• XSLT
– XSL Transformations, a language for transforming
XML documents into other XML documents.
• RDF
– Resource Description Framework. A formal
knowledge model from the World Wide Web.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
13
Why Is XML Important?
• Because it exists, and everybody uses it.
• Plain Text - you can create and edit files with
anything.
• Data Identification - XML tells you what kind
of data you have, not how to display it.
• Separation from style.
• Hierarchical, and easily processed.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
14
An Overview of the APIs
• JAXP: Java API for XML Processing
– It provides a common interface for creating and using
the standard SAX, DOM, and XSLT APIs.
• JAXB: Java Architecture for XML Binding
– defines a mechanism for writing out Java objects as
XML.
• JDOM
– Represents an XML file as a tree of objects
(sophisticated version of DOM)
• dom4j
– Lightweight version of JDOM.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
15
Agenda
• Introduction to XML
– What is XML
– Structure and Terminology
– JAVA APIs for XML: an Overview
• dom4j
– Parsing an XML document
– Writing to an XML document
• Xpath
– Xpath Queries
– Xpath in dom4j
• References
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
16
dom4j
• An Open Source XML framework for Java.
• Allows you to read, write, navigate, create
and modify XML documents.
• Integrates with DOM and SAX.
• Full XPath support.
• XSLT Support.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
17
Download and Use
• Go to: http://dom4j.org.
• Go to http://dom4j.org/download.html, and
download the latest release (current = 1.4).
• Unzip.
• Don’t forget the classpath. When working in
an IDE, don’t forget to add the log4j.jar
library.
• Javadoc: http://dom4j.org/apidocs/index.html.
• Quick start guide: http://dom4j.org/guide.html.
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
18
Opening an XML Document
import org.dom4j.*;
public class Foo {
public Document parse(String id)
throws DocumentException{
SAXReader reader = new SAXReader();
Document document = reader.read(id);
return document;
}
}
We can read: file,
URL, InputStream,
String
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
19
Example XML File
<?xml version="1.0" encoding="UTF-8" ?>
<salesdata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="C:\Documents and Settings\eran\
My Documents\Academic\Courses\XML\xpath_ass_schema.xsd">
<year>
<theyear>1997</theyear>
<region><name>central</name><sales unit="millions">34</sales></region>
<region><name>east</name><sales unit="millions">34</sales></region>
<region><name>west</name><sales unit="millions">32</sales></region>
</year>
<year>
<theyear>1998</theyear>
<region><name>east</name><sales unit="millions">35</sales></region>
region><name>west</name><sales unit="millions">42</sales> </region>
</year>
</salesdata>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
20
Accessing XML Elements
Accessing root
element
Retrieving child
elements
public void dump(Document document)
throws DocumentException{
Element root = document.getRootElement();
for (Iterator i = root.elementIterator(); i.hasNext(); ) {
Element element = (Element)i.next();
System.out.println(element.getQualifiedName());
System.out.println(element.getTextTrim());
System.out.println(element.elementText("theyear"));
}
}
Retrieving element
name
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
Retrieving element
text
Retrieving the text
of the child
element “theyear”
21
Accessing XML Elements – cont’d
• What will be the output of dump()?
year
1997
year
1998
Why?
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
22
Accessing XML Elements Recursively
public void go(Element element, int depth){
for (int d=0; d<depth; d++){
System.out.print("
");
}
System.out.print(element.getQualifiedName());
System.out.println(" "+ element.getTextTrim());
for (Iterator i = element.elementIterator(); i.hasNext(); ) {
Element son = (Element)i.next();
go(son, depth+1);
}
}
What will be the
output?
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
23
Accessing Recursively – cont’d
salesdata
year
theyear 1997
region
name central
sales 34
region
name east
sales 34
region
name west
sales 32
year
theyear 1998
region
name east
sales 35
region
name west
sales 42
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
The whole XML
tree, element
names + values
24
Creating an XML document
Creating root
element
public Document createDocument() {
Document document = DocumentHelper.createDocument();
Element root = document.addElement("phonebook");
Element address1 = root.addElement("address")
.addAttribute("name", "Yuval")
.addAttribute("category", "family")
.addText("Ehud 3, Jerusalem");
Element address2 = root.addElement("address")
.addAttribute("name", "Ortal")
.addAttribute("category", "friends")
.addText("Kibbutz Givaat Haim");
return document;
Adding elements
}
What will we get
when running go()?
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
25
Creating an XML document – cont’d
phonebook
address Ehud 3, Jerusalem
address Kibbutz Givaat Haim
XML tree
structure of the
new document
FileWriter out = new FileWriter("C:\\addresses.xml");
document.write(out);
String XML = document.asXML()
Retrieving the
XML itself as
string
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
Writing the XML
document to a
file
26
Client Program
public static void main(String[] args) {
Foo foo = new Foo();
try{
Document doc = foo.parse("C:\\Documents and Settings\\eran\\
My Documents\\Academic\\Courses\\XML\\sales.xml");
Opening the
foo.dump(doc);
file
foo.go(doc.getRootElement(), 0);
foo.xpath(doc);
Document newDoc = foo.createDocument();
foo.go(newDoc.getRootElement(), 0);
Dumping
FileWriter out = new FileWriter( "C:\\addresses.xml" );
and printed
newDoc.write(out);
recursively
}
catch (Exception E){
System.out.println(E);
Creating a
}
new
}
document
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
27
Agenda
• Introduction to XML
– What is XML
– Structure and Terminology
– JAVA APIs for XML: an Overview
• dom4j
– Parsing an XML document
– Writing to an XML document
• Xpath
– Xpath Queries
– Xpath in dom4j
• References
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
28
Xpath - Introduction
• XML Path Language. XPath is a language for
addressing parts of an XML document.
• Enables node locating and retrieving, very
much like directory accessing in file systems.
• Limited (but not bad) filtering and querying
abilities.
• Retrieved the actual PCDATA or node sets
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
29
Xpath – Simple Path Selection
Xpath Expression: /salesdata/year/theyear
<theyear>1997</theyear>
<theyear>1998</theyear>
“/” signifies child-of
/salesdata/year[2]/theyear
<theyear>1998</theyear>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
Filtering the level –
getting only the second
year element
30
Xpath – Conditions
/salesdata/year/region[sales > 34]
Going down to region, and
filtering according to the
sales element
<region>
<name>east</name>
<sales unit="millions">35</sales>
</region>
<region>
<name>west</name>
<sales unit="millions">42</sales>
</region>
/salesdata/year/region[sales > 34]/name
?
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
31
Xpath – Traveling Up the Tree
/salesdata/year/region[sales > 34]/parent::year/theyear
<theyear>1998</theyear>
Going up the XML tree (and
then down again)
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
32
Xpath – Traveling Down Fast
/descendant::sales
<sales
<sales
<sales
<sales
<sales
unit="millions">34</sales>
unit="millions">34</sales>
unit="millions">32</sales>
unit="millions">35</sales>
unit="millions">42</sales>
Going all the way down,
until the sales element
//sales
Same same
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
33
Xpath – Advanced Queries
• The years (text nodes) for which sales data exists:
Logical operators
ancestor is same
as parent but
goes all the way
up to year
//region[name=\"west\" and sales >
32]/sales[@unit='millions']/ancestor::year
/theyear
Accessing attributes
<theyear>1998</theyear>
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
34
Xpath – Advanced Queries (cont’d)
•
The years (text nodes) in which the west region
sales were higher than the east region sales; sales
may be expressed in thousands or in millions:
year[region[name="west"]/sales[@unit='millions'
*1000 or @unit='thousands'] >
region[name="east"]/sales[@unit='millions‘
*1000 or @unit='thousands']]/theyear/text()
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
35
Xpath in dom4j
• Xpath queries can be used in dom4j:
Xpath expression
is fed to the
xpathSelector
public void xpath(Document document) {
XPath xpathSelector =
DocumentHelper.createXPath("/salesdata/year/theyear");
List results = xpathSelector.selectNodes(document);
for (Iterator iter = results.iterator(); iter.hasNext(); ) {
Element element = (Element) iter.next();
System.out.println(element.asXML());
}
}
The nodes are selected
from the document,
according to the xpath
query
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
36
Agenda
• Introduction to XML
– What is XML
– Structure and Terminology
– JAVA APIs for XML: an Overview
• dom4j
– Parsing an XML document
– Writing to an XML document
• Xpath
– Xpath Queries
– Xpath in dom4j
• References
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
37
References - XML
• XML tutorial:
– http://www.w3schools.com/xml/default.asp
• XML Specification from w3c:
– http://www.w3.org/XML/
• The Java/XML Tutorial:
– http://java.sun.com/xml/tutorial_intro.html
• DTD Tutorial:
– http://www.xmlfiles.com/dtd/
• XML Schema Tutorial:
– http://www.w3schools.com/schema/default.asp
• XML Schema Resource Page:
– http://www.w3.org/XML/Schema
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
38
dom4j
• Web site:
– http://dom4j.org/
• Javadocs:
– http://dom4j.org/apidocs/index.html
• Quick Start:
– http://dom4j.org/guide.html
• Cookbook (main functionality):
– http://dom4j.org/cookbook.html
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
39
Xpath
• Xpath specification:
– http://www.w3.org/TR/xpath
• Xpath tutorial:
– http://www.w3schools.com/xpath/default.asp
• Xpath tutorial (extended):
– http://www.zvon.org/xxl/XPathTutorial/General/exampl
es.html
• Xpath reference:
– http://www.vbxml.com/xsl/XPathRef.asp
XML and Java: XML, dom4j and Xpath – Eran Toch
Methodologies in Information System Development
40
Download