ppt - Internet Database Lab.

advertisement
XML
SNU OOPSLA Lab.
October 2005
Contents








Semistructured Data
Introduction
History
XML Application
DTD & XML Schema
DOM & SAX
Summary
Online Resources
2
Semistructured Data(1/3)
 Semistructured Data and XML
 Integration of heterogeneous sources
 Data sources with non-rigid structure
 Biological data
 Web data
 Characteristics of Semistructured Data




Missing or additional attributes
Multiple attributes
Different types in different objects
Heterogeneous collections
self-describing, irregular data, no a priori structure
3
Semistructured Data(2/3)
Data Model
Bib
&o1
complex object
paper
paper
book
references
&o12
&o24
references
author
title
year
&o29
references
author
http
page
author
title publisher
title
author
author
author
&o43
&25
&96
1997
firstname
lastname
atomic object
last
firstname
&243
“Serge”
“Abiteboul”
“Victor”
lastname
first
&206
“Vianu”
122
133
Object Exchange Model (OEM)
4
Semistructured Data(3/3)
Syntax for Semistructured Data
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
5
Introduction(1/4)
 XML
 An acronym for ‘eXtensible Markup Language’
 A meta-language that describes other languages
 A data format for storing structured and semi-structured
text for dissemination and ultimate publication, perhaps on
a variety of media
6
Introduction(2/4)
 Properties
 Tags enclose identifiable parts of the document
 Self-describing
 Physical/logical structure
 Physical structure : allows components of the document, called
entities
 Logical structure : allows a document to be divided into named
units and sub-units, called elements
7
Introduction(3/4)
Logical Structure
Physical Structure
Document
Unit
Sub-unit
entities
(internal)
(separate)
elements
8
Introduction(4/4)
XML markup
<warning>
<para> This substance if hazardous to health </para>
<para> See procedure 12A. 7 for information on protective clothing required. </para>
<logo …/>
</warning>
<transaction>
<time date=“19980509”/>
<amount>123</amount>
<currency type=“pounds”/>
<from id=“x98765”> J. Smith</from>
<to id=“x56565>M. Jones</to>
</transaction>
XML document
9
History(1/2)
XML
1997
HTML
WWW
1992
SGML
1986
GM
1960
Internet
GM = Generalized
Markup
10
History(1/2)
 1960’s, IBM GML(Generalized Markup Language)
 1980’s, ISO 8879,
SGML(Standard Generalized Markup Language)
 Early 1990’s, HTML(HyperText Markup Language)
 1996, W3C’s XML
 1998, XML 1.0
 1999, RDF(Resource Description Framework)
11
Application
ASP, Java,VB
DBMS
XSL Processor
Tree
DOM
DOM API
Parser
DTD
SAX
Events
XML
HTML
Browser
DOM(Document Object Model)
SAX(Simple APIs for XML)
XSL(eXtensible Stylesheet Language)
ASP(Active Server Page)
Data exchange applications
12
An XML Document
<?xml version=“1.0”?>
<!DOCTYPE sigmodRecord SYSTEM sigmodRecord.dtd”>
<sigmodRecord><issue>
<volume>1</volume>
<number>1</number>
<articles><articles>
<title> XML Research Issues</title>
<initPage> 1 </initPage>
<endPage> 5 </endPage>
<authors>
<author AuthorPosition=“00”> Tom Hanks </author>
…
</authors></article></articles></issue>
</sigmodRecord>
13
DTD(1/2)
 DTD(Document Type Definition)
 An optional but powerful feature of XML
 Comprises a set of declarations that define a document
structure tree
 Some XML processors read the DTD and use it to build
the document model in memory
 Establishes formal document structure rules
 It define the elements and dictates where they may be applied in
relation to each other
14
DTD(2/2)
 Declare Vs. Define
 Declare  “This document is a concert poster”
 Define  “A concert poster must have the
following features”
 DTD define
 Element type + Attribute + Entities
 Valid Vs. Invalid
 Valid  conforms to DTD
 Invalid  fail to conform to DTD
Well formed
XML Document
Valid XML
Document
15
XML Schema
 Schema
 W3C standard : specifies structure of XML documents
 Data types for elements/attributes
 String, int, float
 Unordered set is also allowed
 Derivation of types are allowed
 Replaces DTDs
 Removes syntactic distinctions between DTD and XML
 Richer types compared to DTD
16
XML Schema Example
<xsd:element name=“article” minOccurs=“0” maxOccurs=“unbounded”>
<xsd:complexType><xsd:sequence>
<xsd:element name=“title” type=“xsd:string”/>
<xsd:element name=“initPage” type=“xsd:string”/>
<xsd:element name=“endPage” type=“xsd:string”/>
<xsd:element name=“author” type=“xsd:string”/>
</xsd:sequence></xsd:complexType>
<xsd:element>
DTD
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
article (title,initPage,endPage,author)>
title (#PCDATA)>
initPage (#PCDATA)>
endPage (#PCDATA)>
author (#PCDATA)>
17
DOM(1)
 Characteristics
 Hierarchical (tree) object model for XML documents
 Associate list of children with every node
 Preserves the sequence of the elements in the XML
documents
sigmodRecord
issue
volume
XML document
number
title
articles
initPage
endPage
18
DOM(2)
 DOM interfaces
 Node : The base data type of the DOM.
 Element : The vast majority of the objects you’ll deal with
are Elements.
 Attr : Represents an attribute of an element.
 Text : The actual content of an Element or Attr.
 Document : Represents the entire XML document
19
SAX(1)
 DOM : expensive to materialize for a large XML collection
 Characteristics
 Event-driven : fire an event for every open tag/end tag
 Does not require full parsing
 Enables custom object model building
Document Handler
create
startDocument()
Application
startElement()
characters()
Feedback
When event driven
endElement()
give
<!……………>
<->
………….
</->
endDocument()
parsing
Parser
Event driven
20
SAX(2)
 The SAX API actually defines four interfaces for handling
events




EntityHandler
TDHandler
DocumentHandler
ErrorHandler
 All of these interfaces are implemented by HandlerBase.
21
DOM vs SAX(1/3)
 Why use DOM?
 Need to know a lot about the
structure of a document
 Need to move parts of the document
around
 Need to use the information in the
document more than once
 Why use SAX?
 Only need to extract a few elements
from an XML document
22
DOM vs SAX(2/3)
<book id="1">
<verse>
Sing, O goddess, the anger of Achilles son of Peleus, that brought countless
ills upon the Achaeans. Many a brave soul did it send hurrying down to
Hades, and many a hero did it yield a prey to dogs and vultures, for so were
the counsels of Jove fulfilled from the day on which the son of
Atreus, king of men, and great Achilles, first fell out with one another.
</verse>
<verse>
And which of the gods was it that set them on to quarrel? It was the son of
Jove and Leto; for he was angry with the king and sent a pestilence upon
...
• Doing this with the DOM would take a lot of memory
• SAX API would be much more efficient
23
DOM vs SAX(3/3)
...
<address>
<name> <first-name>Mary</first-name> <last-name>McGoon</last-name> </name>
<street>1401 Main Street</street> <city>Anytown</city> <state>NC</state> <zip>34829</zip>
</address>
<address>
<name>…..
<street> …..
</address>
<address>
<name>…..
<street> …..
</address>
If we were parsing an XML document containing 10,000 addresses, and we
wanted to sort them by last name??
DOM would automatically store all of the data.
We could use DOM functions to move the nodes n the DOM tree
24
Summary
 XML
 eXtensible Markup Language
 A data format for storing structured and semi-structured
text
 physical/logical structure
 DTD& XML Schema
 Establishes formal document structure rules
 DOM & SAX API
 DOM: Need to know a lot about the structure of a document
 SAX: Need to extract a few elements from an XML document
25
Online Resources
 XML tutorial
http://www.xml.com
http://www.w3c.org
http://www.w3schools.com/
http://www.xmltraining.com/course-searchxml+online+tutorials
 http://xmlfiles.com/




26
Download