XML and Data Management Who We Are Researchers in industrial labs

advertisement
XML and Data
Management
Mary Fernández
Michael Benedikt
Juliana Freire
Arnaud Sahuguet
AT&T Research
Bell Labs - Lucent Technologies
Large-Scale Programming Research
Network Data and Services Research
 2002 by AT&T and Lucent
Who We Are
„
„
Researchers in industrial labs
Our companies
Make critical use of XML & database technology
Do not sell XML products
„
„
Members of W3C XQuery & XPath Groups
XML users, just like you
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
We are not here to sell you any products
2
Goals of Tutorial
„
Help you understand issues related to
management of XML
Querying and Data Access
Publishing
Storage
„
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
Help you articulate questions/answers when
you talk to vendor/customer
Help understand issues if you decide to “roll
your own” solution
3
More Goals of Tutorial
For each topic, we will try
to answer the following questions:
„
„
„
 2002 by AT&T and Lucent
would be ideal solution?
are issues you should be aware of?
are commercial offerings?
are emerging solutions from research?
WWW2002 - Hawaii
XML and Data Management
„
What
What
What
What
4
What this Tutorial is NOT?
„
Not detailed study of commercial products
Commercial tools are immature, rapidly evolving
Product summary will be obsolete tomorrow
„
„
Not research presentation
Not about latest W3C proposals
See W3C track for detailed information
XML and Data Management
WWW2002 - Hawaii
 2002 by AT&T and Lucent
5
Roadmap
„
Introduction (Mary for Arnaud)
z
z
z
z
<Coffee-break/>
Interfaces and APIs (Mary)
z
z
z
Key concepts and processing models
Programmatic interfaces: DOM, SAX
Query languages: XPath 2.0, XQuery 1.0
<Lunch/>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
Why XML?
Examples of XML in action
Application scenarios
What is the XML Data Management Problem?
6
Roadmap
„
Publishing Relational Data in XML (Michael)
z
z
z
„
Goals & problems of publishing
Publishing languages
Exporting & querying documents
<Coffee-break/>
Storage (Juliana)
z
Storage strategies
Issues, systems, & techniques
Question time = any time!
 2002 by AT&T and Lucent
WWW2002 - Hawaii
Thanks to Our Colleagues
Sihem Amer-Yahia, AT&T Labs
Jerome Simeon, Lucent – Bell Labs
Philip Wadler, Avaya Labs
W3C XSLT & XQuery Working Groups
XML and Data Management
z
7
Introduction
Why XML?
„
Lingua franca of the Web
Simple, open, widely accepted
„
„
„
Web’s secret sauce
Next silver bullet
<Your favorite motto here/>
XML and Data Management
 2002 by AT&T and Lucent
WWW2002 - Hawaii
10
XML In Action
GRAA_HUMAN
STANDARD;
PRT;
262 AA.
P12544;
01-OCT-1989 (REL. 12, CREATED)
01-OCT-1989 (REL. 12, LAST SEQUENCE UPDATE)
15-DEC-1998 (REL. 37, LAST ANNOTATION UPDATE)
GRANZYME A PRECURSOR (EC 3.4.21.78) (CYTOTOXIC T-LYMPHOCYTE PROTEINASE
1) (HANUKKAH FACTOR) (H FACTOR) (HF) (GRANZYME 1) (CTL TRYPTASE)
(FRAGMENTIN 1).
GZMA OR CTLA3 OR HFSP.
HOMO SAPIENS (HUMAN).
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Primates; Catarrhini; Hominidae; Homo.
[1]
SEQUENCE FROM N.A.
TISSUE=T-CELL;
MEDLINE; 88125000.
GERSHENFELD H.K., HERSHBERGER R.J., SHOWS T.B., WEISSMAN I.L.;
"Cloning and chromosomal assignment of a human cDNA encoding a T
cell- and natural killer cell-specific trypsin-like serine
protease.";
PROC. NATL. ACAD. SCI. U.S.A. 85:1184-1188(1988).
[2]
SEQUENCE OF 29-53.
MEDLINE; 88330824.
POE M., BENNETT C.D., BIDDISON W.E., BLAKE J.T., NORTON G.P.,
RODKEY J.A., SIGAL N.H., TURNER R.V., WU J.K., ZWEERINK H.J.;
"Human cytotoxic lymphocyte tryptase. Its purification from granules
and the characterization of inhibitor and substrate specificity.";
J. BIOL. CHEM. 263:13215-13222(1988).
[3]
SEQUENCE OF 29-40, AND CHARACTERIZATION.
MEDLINE; 89009866.
HAMEED A., LOWREY D.M., LICHTENHELD M., PODACK E.R.;
"Characterization of three serine esterases isolated from human IL-2
activated killer cells.";
J. IMMUNOL. 141:3142-3147(1988).
[4]
SEQUENCE OF 29-39, AND CHARACTERIZATION.
MEDLINE; 89035468.
KRAEHENBUHL O., REY C., JENNE D.E., LANZAVECCHIA A., GROSCURTH P.,
CARREL S., TSCHOPP J.;
RA
<?xml version ="1.0"?>
<cml title="SwissProtTree">
<molecule title="SwissProt file" scheme="SWISSPROT">
<string title="ID" scheme="SWISSPROT">GRAA_HUMAN</string>
<integer title="Length">262</integer>
<string dictname="AC" title="Accession Number(s)" scheme="SWISSPROT">
<list title="History">
<list dictname="DT" title="Revision" scheme="SWISSPROT">
<string title="Comments">REL. 12, CREATED</string>
<date>1989-10-01</date>
</list>
<list dictname="DT" title="Revision" scheme="SWISSPROT">
<string title="Comments">REL. 12, LAST SEQUENCE UPDATE</string>
<date>1989-10-01</date>
</list>
<list dictname="DT" title="Revision" scheme="SWISSPROT">
<string title="Comments">REL. 37, LAST ANNOTATION UPDATE</string>
<date>1998-12-15</date>
</list>
</list>
<string dictname="DE" title="Description" scheme="SWISSPROT"> GRANZYME
<string dictname="GN" title="Gene Name(s)" scheme="SWISSPROT">GZMA OR
<string dictname="OS" title="Organism Species" scheme="SWISSPROT"> HOMO
<string dictname="OC" title="Organism Classification" scheme="SWISSPROT
<citation number="1" title="SEQUENCE FROM N.A.">
<list title="AUTHORS">
<person>
<initials>H.K.</initials>
<surname>GERSHENFELD</surname>
</person>
<person>
<initials>R.J.</initials>
<surname>HERSHBERGER</surname>
</person>
<person>
<initials>T.B.</initials>
<surname>SHOWS</surname>
</person>
<person>
<initials>I.L.</initials>
<surname>WEISSMAN</surname>
</person>
</list>
[…]
GERSHENFELD H.K., HERSHBERGER R.J., SHOWS T.B., WEISSMAN I.L.;
<list title="AUTHORS">
<person>
<initials>H.K.</initials>
<surname>GERSHENFELD</surname>
</person>
<person>
<initials>R.J.</initials>
<surname>HERSHBERGER</surname>
</person>
<person>
<initials>T.B.</initials>
<surname>SHOWS</surname>
</person>
…
</list>
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
ID
AC
DT
DT
DT
DE
DE
DE
GN
OS
OC
OC
RN
RP
RC
RX
RA
RT
RT
RT
RL
RN
RP
RX
RA
RA
RT
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RX
RA
RA
[…]
11
What Are Benefits?
„
Tags
z
„
Easier for machine & human to parse
Tree structure
z
z
z
 2002 by AT&T and Lucent
with parent/child relationships
to understand
to enforce
to navigate
XML and Data Management
z
Nodes
Easier
Easier
Easier
WWW2002 - Hawaii
12
What Is XML?
„
Just a syntax
But a standardized, extensible syntax
„
„
Allows specification of new dialects
Power comes from related technologies
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Parsers, schemas, query languages,
protocols, application-specific dialects, etc.
13
XML Standards Landscape
„
Schema languages
z
z
„
Programming APIs
z
z
„
DOM
SAX, SAX2
Query languages
z
z
XPath
XSL-T
XQuery
ML
Standard organizations
z
z
W3C
OASIS
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
„
DTDs
XML Schemas
14
Examples of Application
Domains
Example documents & queries
XML Application Domains
http://xml.coverpages.org/gen-apps.html
http://www.xml.org/xml/industry_industrysectors.jsp
XML and Data Management
 2002 by AT&T and Lucent
WWW2002 - Hawaii
16
XML Dialect “pôt pourri”
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Extensible Financial Reporting Markup Language (XFRML),
eXtensible Business Reporting Language (XBRL),
MusicXML,
Spacecraft Markup Language (SML),
Bank Internet Payment System (BIPS),
Bioinformatic Sequence Markup Language (BSML),
Biopolymer Markup Language (BIOML),
Open Catalog Format (OCF),
Chemical Markup Language (CML),
Electronic Business XML Initiative (ebXML),
Open Trading Protocol (OTP),
FinXML, Financial Information eXchange protocol (FIX),
RecipeML, CVML,
XML Bookmark Exchange Language (XBEL),
Scalable Vector Graphics (SVG),
NewsML,
DocBook,
Real Estate Listing Markup Language (RELML), . . .
17
FpML (Finance)
„
Complex nesting
Transaction processing
„
Example queries
„
z
z
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
Who are the parties involved in a given contract?
When does the contract expire?
What are the various components of a contract?
What is the total amount of a contract?
18
BioML (Bioinformatics)
„
Large size + complex nesting
z
„
„
„
„
Annotations (meta-data) describe gene sequence
(data)
Free text for annotations
Full-text queries for sequence matching
Hierarchical queries for gene finding & visualization
Example queries
z
z
Who sequenced the data?
Is there a motif similar to “GTTACCTGGCCAGT” in an intron
of the sequence?
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
Nature of genetic sequences
19
HL7 (Healthcare)
„
„
„
„
„
Stream data, e.g., from medical devices
Patient data in legacy systems
Transaction processing
Temporal queries
Example queries
z
z
Who is the patient?
When did the measurement take place?
For the duration of the measurement, what were the max
and min value for the following vital signs (…)?
XML and Data Management
z
WWW2002 - Hawaii
20
 2002 by AT&T and Lucent
SOAP
„
„
XML envelope for messages
Headers & bodies are XML documents
<soap:Envelope xmlns:soap='http://www.w3.org/2001/10/soap-envelope'>
<soap:Header>
„
Example queries
Header goes here
Request goes here
</soap:Body>
</soap:Envelope>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Extract the header
</soap:Header>
z Extract the body
<soap:Body>
z Extract messages that expire before 07-May-2002
z
21
Bill (Library of Congress)
„
Documents may have arbitrary nesting (section,
subsection, etc.)
Large text fragments
„
Complex full-text search
„
z
Example queries
z
z
Find bills that contain “water quality” within 6 terms of
“EPA”
Find bills that contain “homeland security” within
<amendment> element
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
Localized, order, proximity, fuzziness, stemming,
synonyms, relevance (document scoring)
22
Importance of Schemas
„
Standardized vocabulary for application
domain
z
„
„
„
„
Necessary for validation
Useful for human editing
Useful for storage
Useful for query optimization
Useful for mapping to programming
languages (e.g., Java, C#)
 2002 by AT&T and Lucent
WWW2002 - Hawaii
Where Does XML fit?
Some use-cases
from application vendors and
some XML technologies
XML and Data Management
„
Requires deep understanding of application
23
Web Publishing
Export content as XML
XML to device-specific language
XML and Data Management
[ from Oracle 9i brochures]
WWW2002 - Hawaii
 2002 by AT&T and Lucent
25
XML in Business Automation
Export data as XML
XML to XML
transformation
Transport data as XML
[ from Oracle 9i brochures]
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Import XML data
26
What is the XML Data
Management Problem?
XML Data Management
XML Producer
Legacy
Database
XML Consumer
XML
XML
Documents
& Schemas
API or
Query
XML
Interfaces
Store
XML
XML
Persistent
Database
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Publish
XML
XML
28
Challenges
„
Flexibility, expressiveness of XML data model
„
Diverse applications
No one-size-fits-all solution
„
Immature technologies
No off-the-shelf solution
XML and Data Management
WWW2002 - Hawaii
 2002 by AT&T and Lucent
29
o
ati
plic
Ex
o
act
F
l
a
tern
Application
rs
ns
ap
a cy
a
g
e
L
dat
i se
y
c
ert
a
p
g
x
Le
ee
ous
h
In
XML and Data Management
Da
ta
„
„
St
Sc ruc
t
„
Ex hem ure
t
„S
en a
iz
si
e
bi
lit
y
Problem Dimensions
Query structure
Query access patterns
Stored vs stream-based
Transactional
 2002 by AT&T and Lucent
WWW2002 - Hawaii
30
XML Data Management
XML Producer
Legacy
Database
XML Consumer
XML
XML
Documents
& Schemas
API or
Query
XML
Interfaces
Store
XML
XML
Persistent
Database
 2002 by AT&T and Lucent
31
WWW2002 - Hawaii
32
XML and Data Management
WWW2002 - Hawaii
<Coffee/>
 2002 by AT&T and Lucent
XML and Data Management
Publish
XML
XML
XML & Data Management
Part I: XML Interfaces
APIs and Languages
IMDB Example : Data
<!ENTITY hollywood “Hollywood”>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<imdb>
<show year=“1993”> <!-- Example Movie -->
<title>Fugitive, The</title>
<review>
<suntimes>
<reviewer>Roger Ebert</reviewer> gives <rating>two thumbs
up</rating>! A fun action movie, Harrison Ford at his best.
</suntimes>
</review>
<review>
<nyt>The standard &hollywood; summer movie strikes back.</nyt>
</review>
<box_office>183,752,965</box_office>
</show>
<show year=“1994”> <!-- Example Television Show -->
<title>X Files,The</title>
<seasons>4</seasons>
</show> . . .
</imdb>
34
IMDB Example : Schema
<element name=“show”>
<complexType>
<sequence>
<element name=“title” type=“xs:string”/>
<sequence minoccurs=“0” maxoccurs=“unbounded”>
<element name=“review” mixed=“true”/>
</sequence>
<choice>
<element name=“box_office” type=“xs:integer”/>
</choice>
</sequence>
<attribute name=“year” type=“xs:integer” use=“optional”/>
</complexType>
</element>
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
<element name=“seasons” type=“xs:integer”/>
35
Key Concepts
„
Data Model
What features of document are important?
„
„
Individual characters, synthesized strings
„
Entity references, CDATA sections
„
Comments, processing instructions, namespace nodes
„
Typed values, un-typed (well-formed) values, mixed content
Schema (a.k.a. Type)
Specifies contract between data producers & consumers
Types of literal (terminal) data
„
Names of elements & attribute
„
“Vertical” & “horizontal” structure of elements
„
Regular expressions (a la Unix grep) over XML tags
„
Impacts querying
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
36
Interface Characteristics
„
Expressiveness & Ease-of-use
What XML content is accessible?
Access method
„
Navigational, streams, declarative
„
Linguistic vs. programmatic
What are appropriate applications?
„
Flexibility & Completeness
„
Safety
z
z
Respect schema/type of input document(s)?
Guarantee/enforce expected schema/type of output?
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Support typed values, un-typed text, mixed content?
Support for update?
37
Interface Landscape
„
„
„
„
Document-Object Model (DOM)
Simple API for XML (SAX)
XPath 2.0
XQuery 1.0
Brief comparison of XSLT 2.0 with XQuery 1.0
„
Focus:
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Impact of interfaces on data management
38
Relational Analogs
XSLT 2.0/XQuery 1.0
XPath 2.0
XPath 2.0 Data Model
Relational Data Model
SAX API
XML Document
JDBC/ODBC
Relational Database
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
DOM API
SQL
39
Generic XML Processing Model
• XML Information Set
per-character, per-entity model of XML document
DTD or
XML Schema
XML
Document
XML
Infoset
Expand entity references
Check well-formedness
 2002 by AT&T and Lucent
Document
Validator
XML
Infoset
(+ Types)
Application/
Storage
System
Validate data
Add type annotations
Insert default values
WWW2002 - Hawaii
XML and Data Management
Document
Parser
40
Navigational Access: DOM
„
„
„
Language-independent, programmatic API
Un-typed, object model of document content
Application requirements
Full navigational access to document
z Dynamic update, add, & delete document content
Ex: Client-side browser apps; Plumbing of Dynamic HTML
z
DOM
Instance
Application
Validator
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
DOM
Parser
XML Document
+
DTD or XML Schema
41
DOM Example
Document
childNodes
Element(“imdb”)
childNodes
Element(“show”)
attributes
Attr(“year”,“1993”)
childNodes
firstChild
lastChild
Element(“title”)
Element(“box_office”)
Element(“review”)
childNodes
Text(“Fugitive,The”)
childNodes
Text(“183,752,965”)
next/previousSibling
parentNode
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Element(“review”)
42
DOM Characteristics
„
Data Model
No access to type information
Access to everything else, e.g., entity references, CDATA
sections, all node kinds
„
Query Access
Ex: Reviews of shows with title “Fugitive, The” in IMDB
z Implement programmatically by hand
z
Use DOM Level 3 (XPath interface)
/imdb/show[title=“Fugitive, The”]/review
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
for s in documentElement.getElementsByTagName(“show”)
if (some t in s.getElementsByName(“title”)
satisfies (t.characters() = “Fugitive, The”))
then s.getElementsByName(“review”)
43
More DOM Characteristics
„
„
No access to schema information
No type-safety guarantees
Atomic values (integers, dates, etc) modeled as un-typed text
No guarantee that processing produces valid output
 2002 by AT&T and Lucent
XML and Data Management
Validate input =>
Process un-typed objects =>
Validate output
WWW2002 - Hawaii
44
Streams Access : SAX
„
„
Language-independent, programmatic API
Stream of un-typed elements, attributes, text
Call-backs into application
„
Applications
Content-based routing of XML messages
Ex: filter stock quotes, network alerts, …
z Read-once processing of large documents
Ex: load XML document into storage system
z
SAX
Parser
SAX
Events
Application
Validator
 2002 by AT&T and Lucent
XML and Data Management
XML Document
+
DTD or XML Schema
WWW2002 - Hawaii
45
WWW2002 - Hawaii
46
SAX Example
startElement(“imdb”, null)
startElement(“show”, null)
comment(“Example movie”)
startElement(“title”, (“year”, “1993”))
characters(“Fugitive, The”)
endElement(“title”)
startElement(“review”, null)
startElement(“suntimes”, null)
startElement(“reviewer”)
characters(“Roger Ebert”)
XML and Data Management
endElement(“reviewer”)
characters(“ gives ”) ...
startElement(“rating”, null)
characters(“two thumbs up”)
endElement(“rating”)
endElement(“suntimes”)
endElement(“review”)
...
 2002 by AT&T and Lucent
SAX Characteristics
Data Model
„
Supports “document order” access to content
Read-only access to un-typed nodes
„
No update-in-place
Stream transformation
Query Access
„
XPath expressions in descendant/following-sibling axes
Use automata to preserve state
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
startElement(tag, attributes){
if (tag = “imdb”) push(“imdb”)
else if (peek() = “imdb” && tag = “show” &&
attribute[“title”] = “Fugitive, The”) then
push(“show”)
else if (peek() = “show” && tag = “review”) then
writeElement(“review”, attributes)...
47
More SAX Characteristics
„
„
No access to schema information
No type-safety guarantees
Atomic values (e.g., integers) modeled as un-typed text
No guarantee that processing produces valid output
 2002 by AT&T and Lucent
XML and Data Management
Validate input =>
Stream of un-typed node events =>
Validate output
WWW2002 - Hawaii
48
Common Querying Tasks
„
Filter, select XML values
z
„
Merge, integrate values from multiple XML sources
z
„
XML construction
Programmatic interfaces specify how
Query languages specify what, not how
z
z
XML and Data Management
„
Joins, aggregation
Transform XML values from one schema to another
z
„
Navigation, selection, extraction
Provide abstractions for common tasks
Easier than programmatic interfaces
WWW2002 - Hawaii
 2002 by AT&T and Lucent
49
Query Languages
„
XPath 2.0
z
z
„
XSLT 2.0: XML ⇒ XML, HTML, Text
z
z
z
„
Loosely-typed scripting language
Format XML in HTML for display in browser
Must be highly tolerant of variability/errors in data
XQuery 1.0: XML ⇒ XML
z
z
Strongly-typed query language
Large-scale database access
Must guarantee safety/correctness of operations on data
Over time, XSLT & XQuery may both serve needs of
many application domains
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
„
Common language for navigation, selection, extraction
Used in XSLT, XQuery, XPointer, XML Schema, XForms, et al
50
Query Processing Model
ƒ Other models possible
XML
Document(s)
Data
Model
Instance
Query
Query
Evaluator
Data
Model
Instance
Application
(May) type check query
Evaluates query on data model instance
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XPath 2.0
Functionality & Features
XML and Data Management
XPath 2.0
Data Model
Parser
Validator
XML
Schema(ta)
51
XPath 2.0 Functionality
„
Language
z
z
z
z
„
Uniform semantics & syntax in XSLT & XQuery
Guarantees same syntactic expression has same
semantics
Navigation, selection, value extraction
Arithmetic, logical, comparison expressions
Data Model
z
z
Minimal interface necessary to express semantics
Sequences of element, attribute, comment, PI, &
text nodes & atomic values
Access to type information
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
53
XPath 2.0 Data Model
Document
children
Element(“imdb”)
children
Element(“show”)
attributes
children
Attribute(“year”,“1993”)
Element(“title”),
Element(“box_office”)
children
children
xs:string(“Fugitive,The”)
xs:integer(183,752,965)
parent
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Element(“review”), Element(“review”),
54
Filtering
„
Simple (forward) navigation & extraction
Constraints only on self, children, descendants
Return all titles (at any level) in IMDB
/imdb//title
z
Syntactic sugar for:
root()/child::imdb/descendant-or-self::node()/child::title
„
XML and Data Management
Selection
//show[year >= 2000]
WWW2002 - Hawaii
 2002 by AT&T and Lucent
55
Complex Filtering
„
Navigate & select siblings, ancestors
//show/reviewer[following-sibling::rating]
„
Document order
//surgery[//anesthesia[1] before //incision[1]]
„
Constraints on following/preceding siblings, ancestors
Used to identify context of a value
 2002 by AT&T and Lucent
XML and Data Management
Often used in document-processing applications
WWW2002 - Hawaii
56
Text Processing
„
Full-text operators
//*[xf:text-contains(., “Russell Crowe”)]
Span structured content
„
Myriad other text functions proposed:
Phrase search
Support for stop words
Ex: “Tom Cruise” within two words of “Penelope Cruz”
Boolean combinations
Ranking, relevance
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Proximity searching
57
Variability in XML Data
„
Replication, absence of XML values
Demands flexible semantics for selection
„
Selection
//show[year >= 2000]
z
Explicit expression:
//show[some $v in ./child::year satisfies data($v) ge 2000]
Existence/absence of value
//show/reviewer[following-sibling::rating]
z
Explicit expression:
//show/reviewer[not empty(./following-sibling::rating)]
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
58
Variability in Schemas
„
Documents may contain fragments with strongly
typed values & un-typed text
Demands flexible, but consistent, semantics
„
Un-typed text
„
Typed values
Strict interpretation of typed values
Type error is fatal
/book/@isbn * 0.07
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Permissive conversion from PCDATA to typed values
<book isbn=“ISBN 10-111”>
<price>45.50</price>
</book>
/book/price * 0.07
59
Beyond XPath 2.0
„
Limitations
Constructing new XML
Recursive processing of recursive XML data
„
„
Supported by XSLT & XQuery
Differences between XSLT & XQuery
„
Focus on XQuery
XSLT covered elsewhere
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Safety: XQuery enforces input & output types
Compositionality:
XQuery maps XML to XML, XSLT maps XML to
anything
Important feature for XML publishing
60
XQuery 1.0
Functionality & Features
XQuery 1.0
„
„
Functional, strongly typed query language
XQuery 1.0 = XPath 2.0 + …
A few more expressions
for-let-where-return (FLWR) ~ SQL’s SELECT-FROM-WHERE
Sort-by
XML construction (Transformation)
Operators on types (Compile & run-time type tests)
+ User-defined functions
XML and Data Management
Modularize large queries
Process recursive data
+ Strong typing
Guarantees result value conforms to output type
Enforced statically or dynamically
 2002 by AT&T and Lucent
WWW2002 - Hawaii
62
Joins & XML Construction
„
Arbitrary nesting of expressions & literal XML
For each actor, return box office receipts of films in which they starred
in past 2 years
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
let $imdb := document(“www.imdb.com/imdb.xml”)
for $actor in $imdb//actor
let $films :=
$imdb//show[box_office and @year >= 2000
Iteration
and $actor/name = .//actor[@role=“star”]/name]
Join
return
<receipts>
{ $actor }
XML Construction
<total> { sum($films/box_office) } </total>
</receipts>
Aggregation
63
XML Transformation
„
User-defined functions
Same expressiveness as XSLT templates + parameters
Signatures specify types of arguments & return values
Types enforced statically or dynamically
define function show2movie(element show $show)
returns element movie?
{ // Convert a show (that is a movie) to a movie
if ($show/box_office) then <movie> { $show/* } </movie>
}
let $imdb := document(“www.imdb.com/imdb.xml”)
return <movies>
for $show in $imdb/show return show2movie($show)
</movies>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
else ()
64
Recursive XML Data
„
Recursive functions support recursive data
<Part id=“001”>
<Part id=“002”>
<Part id=“003”/>
</Part>
<Part id=“004”/>
</Part>
<PartCt count=“2” id=“001”>
<PartCt count=“1” id=“002”/>
<PartCt count=“0” id=“003”/>
</PartCt>
<PartCt count=“0” id=“004”/>
</PartCt>
returns element PartCt
{ <PartCt count=“{ count($p1/Part) }” { $p1/@id }> {
for $p2 in $p1/Part return partCount($p2)
} </PartCt>
}
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
define function partCount(element Part $p1)
65
Safety
„
„
Shared schema (Sshared) is contract between
producers & consumers
Producer writes query to transform input data into
output data
Dinput : Sinput ⇒ Qproducer ⇒ Doutput : Soutput
„
Static Type Checking takes Sinput & Qproducer
Infers Soutput : schema of output data
„
Checks that Soutput is “subtype” of Sshared
„
Guarantees Doutput : Sshared
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
66
Inferring Type of Expression
„
Expression
<titles> { $imdb//title } </titles>
„
Static type derived from expression
„
Value
<titles>
<title>Fugitive, The</title>
<title>X Files, The</title>
</titles>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<element name=“titles”>
<complexType>
<sequence minoccurs=“0” maxoccurs=“unbounded”>
<element name=“title” type=“xs:string”/>
</sequence>
</complexType>
</element>
67
Inferring Type of Expression
„
Expression
Cannot determine statically that constraint is satisfied
//show[contains(title, “Fugitive”)]
„
Inferred type is conservative
<sequence minoccurs=“0” maxoccurs=“unbounded”>
<elementref name=“show”>
</sequence>
„
Value
<show year=“1993”> <!-- Example Movie -->
<title>Fugitive, The</title>
<review>
<suntimes>
<reviewer>Roger Ebert</reviewer> gives <rating>two thumbs
up</rating>! A fun action movie, Harrison Ford at his best.
</suntimes>
</review>
...
</show>
XML and Data Management
WWW2002 - Hawaii
68
 2002 by AT&T and Lucent
Type-Safe Composition
„
Expression
//show[contains(title, “Fugitive”)]
„
Inferred type
<sequence minoccurs=“0” maxoccurs=“unbounded”>
<elementref name=“show”>
</sequence>
„
Required type
„
XML and Data Management
define function show(element show+ $show) returns …
<sequence minoccurs=“1” maxoccurs=“unbounded”>
<elementref name=“show”>
</sequence>
Type mismatch raises an error
If static typing, error raised when analyzing query
If dynamic typing, error raised when evaluating query
WWW2002 - Hawaii
 2002 by AT&T and Lucent
69
Feature Summary
XML Content
What
How
DOM
Navigational
Not
Preserved
Not
Enforced
Entity refs
String data
Streams
Not
Preserved
Not
Enforced
XPath
2.0
Typed values
Declarative
Preserved
XSLT
2.0
Typed values
Declarative
Transform
Preserved
Not
Enforced
XQuery
1.0
Typed values
Declarative
Transform
Preserved
Enforced
SAX
 2002 by AT&T and Lucent
In-place
Transform
Safety
Input
Output
WWW2002 - Hawaii
XML and Data Management
Entity refs
String data
Update
70
Implementor’s Perspective
„
Interface : multiple implementation strategies
XSLT 2.0/XQuery 1.0
XPath 2.0
XPath Data Model
SAX API
Implement
from scratch
Build on
existing storage system
XML Parser
Special-purpose
streams processor
XML Information Set
XML Document
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
DOM API
Translate into
SQL/OQL/LDAP
Custom
query engine
71
User’s Perspective
„
Appropriate interface depends on
Processing behavior of application
Requirements of application
Safety, update, …
Capabilities of underlying storage & query system
„
Solutions for Publishing & Storing XML data
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
APIs & Languages complete, yet complex
Vendors implement features that make “80/20” for
their user base
Ex: Support read-only subset of DOM API
As customer, must ask whether vendor’s choices
appropriate for your applications
72
References
„
DOM
http://www.w3.org/TR/REC-DOM-Level-1/
„
SAX
http://www.saxproject.org/
„
XPath 2.0
http://www.w3.org/TR/query-datamodel/
http://www.w3.org/TR/xpath20/
http://www.w3.org/TR/query-operators/
XML and Data Management
„
XQuery 1.0
http://www.w3.org/TR/xquery/
 2002 by AT&T and Lucent
WWW2002 - Hawaii
73
XPath 2.0 vs. XPath 1.0
„
Consistent semantics for XPath 2.0
XML documents with schema
Expression raises error or evaluates to unique value
„
Ordered sequences of typed node & atomic
values
Expr , union | intersect except Expr
for Var in Expr return Expr
Conditional expression
if Expr then Expr else Expr
„
Quantified expressions
some/every Var in Expr satisfies Expr
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
74
XML and Data Management
<Lunch/>
WWW2002 - Hawaii
 2002 by AT&T and Lucent
Extra Slides
75
XML Test Drive
„
„
„
„
„
„
„
CARS
Speed
Color
Comfort
Gas consumption
Safety
Fashion/Hip factor
Price
„
„
„
„
„
„
„
XML
Performance
User Interface
Flexibility
Storage overhead
Maturity
Hype factor
Price
There are trade-offs: Tutorial will help you make
informed decisions about trade-offs
XML and Data Management
Just like for cars, it’s hard to get all the features
WWW2002 - Hawaii
77
 2002 by AT&T and Lucent
WWW2002 - Hawaii
78
XML and Data Management
 2002 by AT&T and Lucent
XML & Data Management
Part II: Relational
Publishing in XML
„
Agenda:
z
z
z
z
Goals and problems of publishing
Publishing languages
Exporting documents
Querying documents
XML -it’s only an interchange format
Data
base
Data
base
Magical
Publishing
Box!
XML!
Data
base!
Most data is stored in pre-existing databases
Will continue to be updated through these interfaces
„
Need to provide XML wrappers to export data
„
Focus here on relational databases
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
HTML
80
The “easy” publishing problem
„
Export relational data in a canonical format
Actors
LastName FirstName
Viterelli
Joe
....
Useful for
• web publishing
• data integration
XML and Data Management
...
<Row> <LastName> Viterelli </LastName>
<FirstName> Joe</FirstName>
</Row>
...
WWW2002 - Hawaii
 2002 by AT&T and Lucent
81
Example: Oracle 9i XML SQL Utility
„
„
Standardization of format proceeding as part
of SQLX
Export
Embed single SQL query in XSL stylesheet
Emit result in canonical, flat XML
<rowset>
<row> <lastname> Viterelli </lastname>
<firstname> Alex </firstname> </row> ...
</rowset>
Similar facility available in DB2, SQL Server
„
SQL server:
SELECT Lname, Fname FROM Actor FOR XML Auto
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<firstname> Joe </firstname> </row>
<row> <lastname> Winter </lastname>
82
Harder problem: Publish This!
LastName FirstName
Viterelli
Joe
....
Typical Situation:
publish in predefined format
<actor>
WWW2002 - Hawaii
<actor>
 2002 by AT&T and Lucent
XML and Data Management
<familyname> Viterelli </familyname>
<firstname> Joseph </firstname>
<roletype> slow-witted gangsters who are knowledgeable
about veal</roletype>
<movies>
<movie title=“Analyze This”>
<character name=“Jelly”/>
</movie>
<movie title=“See Spot Run”>
<character name=“Gino Valente” />
</movie>
</movies>
83
Publishing XML to a fixed interface
„
Compared to storing XML data in a relational
database:
z
z
easier: just worry about read access to XML
harder: one less degree of freedom
Process XML
requests
(DOM, etc.)
Vendor
Extension or
Middleware
Create
Schema
Supplier
RDBMs
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Store/Update
84
Publishing XML to a fixed interface
„
Compared to storing XML data in a relational
database:
z
z
easier: just worry about read access to XML view
harder: one less degree of freedom
Process XML
requests
Vendor
Extension or
Middleware
Create
Schema
Supplier
RDBMs
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Store/Update
85
Publishing XML to a Fixed Interface
„
Compared to storing XML data in a relational
database:
z
z
easier: just worry about read access to XML view
harder: one less degree of freedom.
Vendor
Extension or
Middleware
Supplier
RDBMs
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Process XML
requests
86
Perspectives on Publishing
„
Application developer:
How to describe map from relational data to XML
How much transparency:
querying document requires how much additional effort?
„
Relational vendor implementer:
„
Middleware developer perspective (e.g., you):
Above, plus how to remain vendor-independent
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Performance of document retrieval
Performance of queries
Exploitation of underlying relational engine
87
Ultimate Goal
<Movies>
...
</Movies>
Application
Vendor
Extension or
Middleware
DOM calls
XQuery requests
Maintain “common illusion” of storing XML document
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Supplier
RDBMs
88
Relational Publishing in XML
„
Agenda:
z
z
z
z
Goals and problems of publishing
Publishing languages
Exporting documents
Querying documents
View Definition Languages
„
Specification
z
z
How user describes desired translation
XML View = mapping from tables to XML
No standard yet – several languages, even for one
vendor!
Main challenge:
z
In this tutorial, give:
A flavor of commercial languages
A general framework based on XQuery
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
accommodate enormous variation in mappings
90
Describing Views
?
Appearance(mid, aid) :
{ <001, 011>, <001, 032> }
<Actor>
<Lname>Viterelli</Lname>
<Fname>Joe</Fname>
<Movie year=“1999”> Analyze This
</Movie>
<Movie year=“2001”> See Spot Run
</Movie>
</Actor>
<Actor>
<Lname>Winter</Lname>
<Fname>Alex</Fname>
<Movie year=“1988”> Bill and Ted’s
Excellent Adventure </Movie>
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Actor(aid, lname, fname) :
{ <001, “Viterelli”, “Joe”>,
…}
Movie(mid, title, year) :
{ <011, “Analyze This”, 1999>,
<032, “See Spot Run”, 2001> }
91
Approach 1: Universal Relation
IBM DB2 SQL Statement giving
big relation
<SQL_stmt>
SELECT A.lname,A.fname, M.year, M.title
FROM Movie M, Actor A, Appearance Ap
WHERE M.mid=Ap.mid AND M.aid=A.Aid
ORDER BY aid
</SQL_stmt>
A.lname A.fname M.year
Viterelli Joe
1999
Viterelli Joe
2001
M.title
See Spot Run
Analyze This
<Actor>
<Lname>Viterelli</Lname>
<Fname>Joe</Fname>
<Movie year=“1999”> Analyze This
+ Formatting Template annotated with
columns of the universal relation
</Movie>
</Movie>
</Actor>
<Actor>
<Lname>Winter</Lname>
<Fname>Alex</Fname>
…
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<Movie year=“2001”> See Spot Run
<element_node Actor>
<element_node Lname>
<text_node>
<Column name=“A.lname”/>
</text_node>
…
92
Universal Relation
„
Vendors:
IBM DB2 XML extender
SQLServer 2000 Universal XML
Variability in XML documents Æ complex
universal relations
Show actors with no movies ÆAdd Outer Join
Merge television actors with movie actors Æ
Outer Union
small documents Æ tedious
large documents Æ unthinkable
Close to a particular implementation
No safety guarantees on output
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Verbose and cumbersome
93
Approach 2: Annotated Schema
<xs:element name=“Actor” >
<xs:complexType>
<xs:sequence>
<xs:element name=“Lname” type=“xs:string” />
<xs:element name=“Fname” type=“xs:string” />
<xs:element ref=“Movie”
/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name=“Movie” >
<xs:attribute name=“Year” type=“xs:dateTime” />
</xs:element>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Start with desired output schema, and sprinkle with
annotations saying where the data comes from
94
Approach 2: Annotated Schema
>
<xs:element name=“Actor” sql:relation=“Actor”
>
<xs:complexType>
<xs:sequence>
<xs:element name=“Lname” type=“xs:string” />
sql:field=“lname” />
<xs:element name=“Fname” type=“xs:string” />
sql:field=“fname” />
<xs:element ref=“Movie”
/>
sql:relationship=“ActorAppear”
sql:relationship=“AppearMovie”/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name=“Movie” >
sql:relation=“Movie” sql:field=“title”/>
<xs:attribute name=“Year” type=“xs:dateTime” />
sql:field=“year” />
</xs:element>
Start with desired output schema, and sprinkle with
annotations saying where the data comes from
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<xs:annotation>
<xs:annotation>
<xs:appinfo>
<xs:appinfo>
<sql:relationship name=“AppearMovie”
<sql:relationship name=“ActorAppear”
parent=“Appearance”
parent=“Actor”
parent-key=“aid”
parent-key=“aid”
child=“Movie”
child=“Appearance”
child-key=“mid” />
child-key=“aid” />
</sql:relationship>
</sql:relationship>
</xs:appinfo>
</xs:appinfo>
</xs:annotation>
</xs:annotation>
95
Schema-driven mapping in SQL Server 2000
<xs:element name=“Actor" sql:relation=“Actor” >
This element associated with an
<xs:complexType>
Actor tuple
<xs:sequence>
<xs:element name=“Lname” type=“xs:string” sql:field=“lname” />
<xs:element name=“Fname” type=“xs:string” sql:field=“fname” />
<xs:element ref=“Movie”
sql:relationship=“ActorAppear”
sql:relationship=“AppearMovie”/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” >
<xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year”/>
</xs:element>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<xs:annotation>
<xs:annotation>
<xs:appinfo>
<xs:appinfo>
<sql:relationship name=“AppearMovie”
<sql:relationship name=“ActorAppear”
parent=“Appearance”
parent=“Actor”
parent-key=“aid”
parent-key=“aid”
child=“Movie”
child=“Appearance”
child-key=“mid” />
child-key=“aid” />
</sql:relationship>
</sql:relationship>
</xs:appinfo>
</xs:appinfo>
</xs:annotation>
</xs:annotation>
96
Schema-driven mapping in SQL Server 2000
Text content filled in with
<xs:element name=“Actor" sql:relation=“Actor” >
These fields of the tuple.
<xs:complexType>
<xs:sequence>
<xs:element name=“Lname” type=“xs:string” sql:field=“lname” />
<xs:element name=“Fname” type=“xs:string” sql:field=“fname” />
<xs:element ref=“Movie”
sql:relationship=“ActorAppear”
sql:relationship=“AppearMovie”/>
</xs:sequence>
</xs:complexType>
</xs:element>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” >
<xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” />
</xs:element>
<xs:annotation>
<xs:annotation>
<xs:appinfo>
<xs:appinfo>
<sql:relationship name=“AppearMovie”
<sql:relationship name=“ActorAppear”
parent=“Appearance”
parent=“Actor”
parent-key=“aid”
parent-key=“aid”
child=“Movie”
child=“Appearance”
child-key=“mid” />
child-key=“aid” />
</sql:relationship>
</sql:relationship>
</xs:appinfo>
</xs:appinfo>
</xs:annotation>
</xs:annotation>
97
Schema-driven mapping in SQL Server 2000
<xs:element name=“Actor" sql:relation=“Actor” >
<xs:complexType>
<xs:sequence>
<xs:element name=“Lname” type=“xs:string” sql:field=“lname” />
<xs:element name=“Fname” type=“xs:string” sql:field=“fname” />
<xs:element ref=“Movie”
sql:relationship=“ActorAppear”
sql:relationship=“AppearMovie”/>
</xs:sequence>
</xs:complexType>
</xs:element>
Condition defined below tells how
many movie elements inside this Actor
<xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” >
<xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” />
</xs:element>
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<xs:annotation>
<xs:annotation>
<xs:appinfo>
<xs:appinfo>
<sql:relationship name=“AppearMovie”
<sql:relationship name=“ActorAppear”
parent=“Appearance”
parent=“Actor”
parent-key=“aid”
parent-key=“aid”
child=“Movie”
child=“Appearance”
child-key=“mid” />
child-key=“aid” />
</sql:relationship>
</sql:relationship>
</xs:appinfo>
</xs:appinfo>
</xs:annotation>
</xs:annotation>
98
Schema-driven mapping in SQL Server 2000
<xs:element name=“Actor" sql:relation=“Actor” >
<xs:complexType>
<xs:sequence>
<xs:element name=“Lname” sql:field=“lname” type=“xs:string” />
<xs:element name=“Fname” sql:field=“fname” type=“xs:string” />
<xs:element ref=“Movie”
sql:relationship=“ActorAppear”
sql:relationship=“AppearMovie”/>
</xs:sequence>
</xs:complexType>
</xs:element>
Movie element has text content taken from this
Table and Field of associated tuple
<xs:element name=“Movie” sql:relation=“Movie” sql:field=“title” >
<xs:attribute name=“Year” type=“xs:dateTime” sql:field=“year” />
</xs:element>
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
<xs:annotation>
<xs:annotation>
<xs:appinfo>
<xs:appinfo>
<sql:relationship name=“AppearMovie”
<sql:relationship name=“ActorAppear”
parent=“Appearance”
parent=“Actor”
parent-key=“aid”
parent-key=“aid”
child=“Movie”
child=“Appearance”
child-key=“mid” />
child-key=“aid” />
</sql:relationship>
</sql:relationship>
</xs:appinfo>
</xs:appinfo>
</xs:annotation>
</xs:annotation>
99
Annotated Schemas
„
Variations from several vendors:
SQL Server 2000 shown previously
IBM DB2 DAD RDB_node format
„
Pro’s and Con’s
z
Compared to universal relation approach
Enable integration of validation and publishing
z
Current versions limited in expressive power
Relationships key/foreign key, not arbitrary SQL
Lack of support for unions, Full Outer Joins
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Much more modular
100
Emerging Approach
„
„
Transforming tables to XML similar to transforming XML to XML
Æ should be easier, not harder!
Why learn two languages?
z
Use canonical XML
z
Plus XQuery
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
<DB>
<Actor aid=“01” lname=“Viterelli” fname=“joe”/>
<Actor aid=“02” lname=“Winter” fname=“Alex”/>
…
<Appearance aid=“01” mid=“011”/>
…
<Movie aid=“011” title=“Analyze This” year=“1999”/>
…
</DB>
101
Describing Views with XQuery
„
„
Leverage power of XQuery
built-in functions
dynamic attribute creation
ordering
Does not imply naïve implementation!
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
for $actor in //Actor
return
<Actor>
<Fname> {$actor/@fname} </Fname>
<Lname> {$actor/@lname} </Lname>
for $actorappearance in //Appearance[@aid=$actor/@aid]
return
for $movie in //Movie[@mid=$actorappearance/@mid]
return
<Movie year=“{$movie/@year}”> {$movie/@title}
</Movie>
</Actor>
102
Relational Publishing in XML
„
Agenda:
z
z
z
z
Goals and problems of publishing
Publishing languages
Exporting documents
Querying documents
Remember the Dream!
<Movies>
...
</Movies>
Application
Vendor
Extension or
Middleware
DOM calls
XQuery requests
Maintain “common illusion” of storing XML document
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Supplier
RDBMs
104
Uncoupled Approach
„
Use vendor-provided canonical XML plus applicationdeveloper provided XSLT/XQuery
Canonical
XQuery
XML
Performance
XML and Data Management
„
Final
XML
WWW2002 - Hawaii
 2002 by AT&T and Lucent
105
Fully-Coupled Approach
ƒ Use vendor-provided XML template or postprocessing
language
Vendor languages currently lack flexibility
Support for querying document limited or absent
IBM DB2
Flex. Mapping
N
Y-
MS SQL
Server
Y-
Export View
Y
Y
Y
Query View
N
N
 2002 by AT&T and Lucent
XPath
WWW2002 - Hawaii
XML and Data Management
Oracle 9i
106
Middleware Approach
„
Write general wrapper layer that responds to requests
Generating SQL queries at runtime
Tagging results
„
Focus of rest of this section
Gives insight into vendor implementations as well
Middleware Layer
XML and Data Management
Query
Generator
Tagger
WWW2002 - Hawaii
 2002 by AT&T and Lucent
107
Middleware Approach
Query-cost Estimates
View
Definition Source Capabilities
Tagger
 2002 by AT&T and Lucent
??
?
SQL
Queries
WWW2002 - Hawaii
XML and Data Management
Request
Query
Generator
108
Middleware Systems
„
„
Research Systems
z
SilkRoute (AT&T Research)
z
Xperanto (IBM Research)
z
PRATA (Bell Labs)
z
Rolex (Bell Labs)
Component within Data Integration Systems
E-XMLMedia
z
Enosys
z
Tibco
XML and Data Management
z
WWW2002 - Hawaii
 2002 by AT&T and Lucent
109
Generating SQL
„
„
„
Running Example: Find Actors with last name
beginning L-Z, their movies, their agent, and awards
Goal: push work inside relational engine
Avoid: Wrapper repeatedly interleaving querying
and merging of result sets:
And for each $a in $A find:
SELECT AW.name , AWD.date FROM
Awards AW, Awarded AWD
WHERE AWD.aid = $a.aid AND AWD.awdid=AW.awdid
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Find $A=SELECT * FROM Actor WHERE lname>’M’
Then for each $a in $A find:
SELECT M.year, M.title FROM Movie M, Appear AP
WHERE AP.aid = $a.aid AND M.mid=AP.aid
110
Exploring Solution Space
„
„
Write down ‘building block’ queries
Show dependencies: attach to tree based on output
schema
<Actor>
1
*
Q1(actorid,fname,lname) =
SELECT actorid, lname,fname
FROM Actor WHERE lname > ‘ M ‘
*
<Movie>
mtitle
awardname
Q2 (agentid,name,actorid) =
…
Q3 (awardyear, awardname,actorid) =
SELECT AWD.awardyear,AW.awardname
A.actorid
FROM Awarded AWD, Award AW, Actor A
WHERE….
Q4 (mtitle,actorid) =
SELECT
FROM Actor, Appearance, Movie
WHERE...
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
<Award>
<Agent>
111
Example Solution
<Actor>
1
<Agent>
*
Q1(actorid,fname,lname) =
SELECT actorid, lname,fname
FROM Actor WHERE lname > ‘ M ‘
*
<Award>
awardname
Q2 (agentid,name,actorid) =
…
Q4 (mtitle,actorid) =
SELECT
FROM Actor, Appearance, Movie
WHERE...
“Unified Strategy” =
Q1 leftjoin Q2 ∪ Q1 leftjoin Q3 ∪ Q1 leftjoin Q4
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Q3 (awardyear, awardname,actorid) =
SELECT AWD.awardyear,AW.awardname
A.actorid
FROM Awarded AWD, Award AW, Actor A
WHERE….
<Movie>
mtitle
112
Unified Strategies
u v
Q1(u)
1
Q3(u,w)
Q2(u,v)
„
*
w x
*
Q4(u,x)
Trivial work to tag into XML in one pass down table
Maximum leverage of query optimizer
Output table is wide, deep, and sparse
Depending on DB implementation, may be costly
WWW2002 - Hawaii
in space and time
 2002 by AT&T and Lucent
XML and Data Management
Universal relation views (e.g. SQL Server Universal
XML) translate naturally to this
113
Example Solution
<Actor>
1
<Agent>
*
Q1(actorid,fname,lname) =
SELECT actorid, lname,fname
FROM Actor WHERE lname > ‘ M ‘
*
<Award>
awardname
Q2 (agentid,name,actorid) =
…
Q4 (mtitle,actorid) =
SELECT
FROM Actor, Appearance, Movie
WHERE...
“Fully Partitioned Strategy” =
Do Q1 to Q4 separately, merge while tagging
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Q3 (awardyear, awardname,actorid) =
SELECT AWD.awardyear, AW.awardname
A.actorid
FROM Awarded AWD, Award AW, Actor A
WHERE….
<Movie>
mtitle
114
Fully-Partitioned Strategies
Q1(u)
1
Q3(u,w)
Q2(u,v)
„
Q4(u,x)
No outer joins or unions
Merge-join in one pass in XML generation
u
u v
u w
u x
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
*
*
115
Searching for Solutions
„
„
Identify solution with partition of the view tree.
Find ‘best’ partition – number of partitionings is
exponential
<Actor>
1
*
*
<Manager> <Award>
<Movie>
<Director>
*
<Gross>
1
<VideoSales>
 2002 by AT&T and Lucent
1
<TheatreSales>
WWW2002 - Hawaii
XML and Data Management
1
116
Optimization Algorithms
„
Research systems attempt to heuristically solve
optimization problem of best partition
SilkRoute
PRATA
„
„
Use RDBMS as ‘oracle’ of query cost in evaluating
partition
Schema impact
XML and Data Management
Shallow, no recursion:
z
fixed set of SQL queries to be produced
With recursion or large nesting:
z
optimization is an open research problem
 2002 by AT&T and Lucent
WWW2002 - Hawaii
Relational Publishing in XML
„
z
z
z
z
Agenda:
Goals and problems of publishing
Publishing languages
Exporting documents
Querying documents
117
Querying a View
„
Wrapper layer must respond to requests by
generating SQL requests at runtime, tagging results
View
Definition
Request
(XQuery/Xpath)
Composed
XQuery
SQL
Generator
Tagger
SQL
Queries
??
?
Result
Tables
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Query
Composer
Query-cost Estimates
Source Capabilities
119
Querying a View
„
Advantage of XQuery view specs: easy
to compose query with view
Request
(XQuery/Xpath)
 2002 by AT&T and Lucent
?
WWW2002 - Hawaii
XML and Data Management
Query
Query
Composer Generator
120
Query Composition
?
°
=
?
View: Movies by Viterelli
for $aid in DB//ActorRow[@lname=“Viterelli”]/@aid
return
<Actor><Fname> Joe </Fname> <Lname> Viterelli </Lname>
for $actapp in DB//AppearRow[@aid=$aid]
for $movie in DB//MovieRow[@mid=$actapp/@mid]
return <Movie year=“{$movie/@year}”> {$movie/@title} </Movie>
</Actor>
XML and Data Management
Query: Get each Actor + Movies in 1999
for $act in //Actor return <Actor> {$act/Lname}
for $movie in $act/Movie[@year=1999]
return <Movie>{$movie}</Movie>
</Actor>
Composed Query: Movies by Viterelli in 1999
WWW2002 - Hawaii
 2002 by AT&T and Lucent
Query Composition
?
°
=
121
?
Composed Query on Canonical XML:=
for $aid in DB//ActorRow[@lname=“Viterelli”]/@aid
return
<Actor>Viterelli
for $actapp in DB//AppearRow[@aid=$aid]
for $movie in DB//MovieRow[@mid=$actapp/@mid and @year=1999]
return <Movie> {$movie/@title} </Movie>
</Actor>
XML and Data Management
„
Efficient query composition involves:
substitution
filtering
pattern matching
 2002 by AT&T and Lucent
WWW2002 - Hawaii
122
Issues
„
Query composition can generate even larger
queries than a pure export
complex documents Æ enormous SQL queries
„
„
Application where users generally want only a
small portion of the document
For XSLT not clear how to do composition
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Query composition can generate much
smaller queries than a pure export
z
„
DB optimizers crash
123
Publishing as DOM
„
Focus of most publishing systems
z
„
fulfill queries from persistent relational store to
serialized XML output.
One research system, ROLEX, fulfills DOM or
query requests against view by returning
DOM
User query/DOM call
Application
DB
 2002 by AT&T and Lucent
Virtual DOM
WWW2002 - Hawaii
XML and Data Management
DOM
ROLEX
Middleware
124
ROLEX
„
Needn’t return entire tree:
z
Optimize based on user navigation profile
User query/DOM call
Application
DB
Virtual DOM
Local part
of result DOM
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
ROLEX
Middleware
125
Publishing Summary
„
„
Goal: transparent access
Huge variety of mappings
need flexible view definition language
„
Application Dependence
How much transparency do you need?
What interfaces do you need supported?
vendor support still evolving
„
Schema Dependence
consider size of SQL queries generated by
wrapper/vendor
beware: deep XML views over normalized tables
tend to require large joins
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
hand wrapping still common
126
XML and Data Management
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML & Data Management
Part III: Storage
127
XML Data Management
XML Producer
XML Consumer
XML
Documents
& Schemas
Schema
Legacy
(Non-XML)
Database
API or
Query
XML
XML
Interfaces
XML
XML and Data Management
Publish
XML
XML
Store
XML
Schema
& XML
Persistent
(Non-XML)
Database
WWW2002 - Hawaii
 2002 by AT&T and Lucent
129
XML Storage Architecture
Logical
Layer
XPath
XQuery
DOM
XML
Physical
Layer
Relational
database
LDAP
File
System
Native
storage
XML is storage agnostic
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
OO
database
130
XML Storage Issues
„
Data layout
z
z
z
„
Query support
z
Indexing
z
z
„
FLWR queries – keyword-based searches,
SELECT/PROJECT/JOIN, recursion and document
construction
Support fast access for full-text, value-based and navigation
queries
Need full-text, value and structural indexes
General requirements: scalability, recovery,
concurrency control, etc.
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
Many alternatives!
Flexible and dynamic: both values and structure may
change
Schema may not be available
131
Storing XML
„
„
„
„
XML is flexible: used in different applications
There is no one-size-fits-all solution
The best choice depends on application!
Important questions
What is the data like? Flat vs. structured vs.
mixed; large vs. small; schema vs. schemaless;
ordered vs. unordered;
What are the queries like? Read-only vs.
updates; full-text vs. relational vs. navigation
What are the application requirements?
Support for transactions; concurrency control;
replication; etc.
XML and Data Management
WWW2002 - Hawaii
132
 2002 by AT&T and Lucent
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML storage in commercial relational
databases
Summary and remarks
 2002 by AT&T and Lucent
XML and Data Management
z
„
Issues
Systems and Techniques
WWW2002 - Hawaii
133
WWW2002 - Hawaii
134
Storage Choices
„
Flat streams
„
Native
„
Colonial
XML and Data Management
 2002 by AT&T and Lucent
Flat Streams
„

−
Store XML documents as is in text files
or CLOBs
Fast for storing and retrieving whole
documents
Query support: limited
−
−
Navigational queries require parsing
Full-text queries require indexes
No localized updates
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
−
135
Colonial
„


−
Re-use existing storage systems
Leverage mature systems
Simple integration with legacy data
Map XML document into underlying
structures
−
−
−
E.g., shred document into flat tables
Slow reconstruction of textual
representation
Query language mismatch
Mapping overheads
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
−
136
Native
„


−
New databases designed specifically for XML
XML documents stored as is
Efficient support for XML queries
May need to build new systems from the
ground up or adapt existing systems
z
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
Re-design features for XML (isolation, recovery,
etc)
May have incomplete support for some general
data management tasks
137
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML storage in commercial relational
databases
Summary and remarks
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
„
Issues
Systems and Techniques
138
Native Storage
Goal: Build high-performance systems
specifically designed to manage XML data
„
XML
Documents
XML
Queries/
APIs
Indexes
Access
XML and Data Management
Physical
Design
Disk pages
WWW2002 - Hawaii
 2002 by AT&T and Lucent
139
Native Approaches
„
Re-think the data management
problem in light of XML
z
z
„
Retool existing systems to handle XML
Build systems from scratch
Problems addressed:
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
XML-specific: data layout, query support
and indexing
General: Access control; transactions;
recovery; …
140
Native Issues: Data Layout
„
Requirements
z
z
z
„
Concise representation of documents
Efficient support for XML APIs and query languages
Ability to update values and structure
Cluster
subtrees
Map trees into physical disk pages
z
Lots of choices
imdb
title
Fugitive,
The
XML and Data Management
page
show
show
year
box_office
183,752,965
1993
title
year
Seinfeld
seasons
1993
13
page
page
WWW2002 - Hawaii
 2002 by AT&T and Lucent
141
Data Layout (cont.)
page
show
show
title
1993
show
box_office
183,752,965
title
Seinfeld
cluster years
cluster titles
 2002 by AT&T and Lucent
year
1993
page
title
Fugitive, The
title
Seinfeld
seasons
13
Cluster
similar
elements
WWW2002 - Hawaii
XML and Data Management
Fugitive,
The
year
…
…
show
imdb
142
Native Issues: Indexing
„
A physical layout cannot be optimal for all
possible access patterns
z
z
List the title and year of shows:
„
Clustering elements: too many disk accesses
„
Clustering shows is best
List the titles of shows released after 1994
„
„
Create additional structures to provide fast
access to data
XML requires different kinds of indexes:
z
Values, structure (navigation), full-text (keywordbased)
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
Neither strategy is optimal
143
Full-Text Indexing
Term
Refs
1993
Fugitive
<imdb>
<show year=“1993”>
<title>Fugitive, The</title>
<review>…</review>
…
</show> …
</imdb>
Find the shows where title contains
“Fugitive”
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Find all documents where “Fugitive”
occurs
144
XML-Aware Full-Text Indexing
Element
Child
of
show
imdb
title
show
year
show
Value
Term
Element
Fugitive
&t1
1993
&y1
Refs
imdb
&s1
&s2
show
&t1
title
Fugitive,
The
show
&t2
&y1
year
1993
box_office
183,752,965
title
Seinfeld
page
&y2
year
1994
seasons
13
page
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
page
145
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML storage in commercial relational
databases
Summary and remarks
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
„
Issues
Systems and Techniques
146
Native Systems: Summary
Query support
Full Text XPath
APIs
XQuery
Xyleme
Built from
scratch
?
Yes
Yes
Yes
Natix
Built from
scratch
Low-level
primitives
N/a
N/a
N/a
Xindice
Built from
scratch
XML:DB,
XML:RPC
No
Yes
No
eXcelon
OODB
DOM/XSLT
Yes
Yes
Yes
Tamino
Adabas
DOM/SAX
Yes
Yes
Partial
GoXML
Built from
scratch
?
Yes
Yes
Yes
Wide variation in supported features
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
Origin
147
NatiX
„
Focus: data layout
z
z
z
„
Efficient storage for trees
Minimize I/O for direct access and scanning
Support updates
Low-level storage primitives
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
Primitives to control layout of related elements on
disk
Support for read/write/insert/delete operations of
elements
148
Xyleme
„
„
„
„
Data layout: based on NatiX
Indexing: sophisticated indexing of text and
elements
Query support: XPath, XQuery, updates
More than just storage: A data warehouse
for XML content
z
z
z
Document classification
Data/schema integration
Web crawling
Document monitoring
XML and Data Management
z
WWW2002 - Hawaii
 2002 by AT&T and Lucent
149
eXcelon XIS
„
„
„
„
z
z
z
Arbitrary XML documents - no Schema or DTD
Can enforce schemas
Triggers; transactions; distributed caching
mechanism
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
„
Extends Object Store – an object-oriented
database
Data Layout: stores parsed nodes (accessible
via DOM)
Indexing: value, text, structural
Query Support: DOM, XSLT, XPath, XQuery,
updates
Other features:
150
Software A/G Tamino
„
„
„
Extends Adabas – nested relations
Indexing: full-text, value, structure
Query support:
z
z
„
Other features:
z
z
Transactions; triggers; backup/restore;
compression
Multimedia documents, e.g., graphics, video
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
Full-text search operators
Queries return entire document or projection of
document: No construction of new XML values
DOM and SAX
151
Other Native Systems
„
Xindice http://xml.apache.org/xindice/
z
z
„
Query support: XPath for its query language and
XML:DB XUpdate for its update language
APIs: XML:DB API for Java development; other
languages using an available XML-RPC plugin
GoXML
z
Query support:
XQuery, full text searching
„
 2002 by AT&T and Lucent
XML and Data Management
„
tree insert, replace and delete
WWW2002 - Hawaii
152
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML storage in commercial relational
databases
Summary and remarks
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
„
Issues
Systems and Techniques
153
Colonial Storage
LDAP
XML
Documents
Mapping
ObjectOriented
 2002 by AT&T and Lucent
Map
Access
Colonial
Queries
WWW2002 - Hawaii
XML and Data Management
RDBMS
XML
Queries
154
Colonial Issues
„
Storage design: map XML data model onto
storage model
z
„
Data loading: load XML document into
mapped structure
z
Query translation: queries over XML
document into queries over mapped
document
z
„
XML document Æ edges, tuples, objects
XQuery, XPath Æ SQL, LDAP, OQL
Result translation: results into XML
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
„
XML data model Æ graph, relations, objects
155
Storing XML in RDBMSs
mapping
Storage
Design
XML
Schema
Data
Loading
XML
Docs
Query
Translation
XQuery
Query
XML
results
Translation
Layer
Relational
Schema
Tuples
Relational
Result
Commercial RDBMS
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
SQL
Query
156
Example: Storage Design
*
imdb
actor
show
?
title
year
*
tilde
*
|
…
box_office
seasons
reviews
TABLE TVShows
TABLE Reviews
(show1_id INT,
(show2_id INT,
(review_id INT,
title STRING,
title STRING,
tilde STRING,
year INT,
year INT,
review STRING,
box_office INT)
seasons INT)
parent_Show INT)
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
TABLE Movies
157
Example: Data Loading
<imdb>
<show year=“1993”> <!-- Example Movie -->
<title>Fugitive, The</title>
<review>
<suntimes>
<reviewer>Roger Ebert</reviewer> gives <rating>two thumbs
up</rating>! A fun action movie, Harrison Ford at his best.
</suntimes>
</review>
<review>
<nyt>The standard Hollywood summer movie strikes back.</nyt>
</review>
<box_office>183,752,965</box_office>
</show>
INSERT INTO Reviews (wild,reviewer,rating,review,parent_Show)
VALUES (‘suntimes’, ‘Roger Ebert’, ‘two thumbs up, ‘A fun action
movie, Harrison Ford at his best.’,10927)
INSERT INTO Reviews (wild,review,parent_Show)
VALUES (‘nyt’, ‘The standard Hollywood summer movie strikes
back.’,10927)
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
INSERT INTO Movies (year, title, show1_id, box_office)
VALUES (1993, ‘Fugitive, The’, 10927, 183752965)
158
Example: Query Translation
Find the title, year, box office proceeds and reviews
for all 2001 movies
XQuery
For $v in document(“imdbdata”)/imdb/show
Where $v/year=2001
Return $v/title, $v/year, $v/box_office, $v/reviews
SELECT title, year, box_office, review
FROM Movies, Reviews
WHERE show1_id = Reviews.parent_Show
GROUP BY show1_id
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
SQL
159
Relational Storage Design
„
There are different classes of mappings
z
z
z
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
User-defined: user specifies mapping
Generic: fixed
Data-driven: mapping inferred from data
Schema/DTD driven: mapping inferred
from DTD or schema
Cost-based: mapping inferred from
schema, query workload and data
160
User-Defined Mappings
„
„
„
„
Supported by most commercial RBDMS
User specifies how to map elements to tables
Flexible mapping but…
There are drawbacks:
z
„
z
XML and Data Management
z
Requires knowledge of XML and relational
technology
Many different mappings
Hard to choose the best for an application
Data changes Æ need to update mapping
WWW2002 - Hawaii
 2002 by AT&T and Lucent
161
Generic Mapping: Edge
Edge Table
&0
show
&1
&2
@year
title review
&3
&4
&5
box office
&6
suntimes nytimes
&7
rating
Child
no.
tag
target
&0
1
show
&1
&0
2
show
&2
&9
rating
&11
&1
1
year
&3
&1
2
title
&4
&1
3
review
&5
&1
4
review
&6
&5
1
suntimes
&7
&8
&10
Find titles for all shows
node
SELECT Value.value
FROM Value, Edge as E1, Edge as E2
&3
WHERE E1.tag=“show”,
&4
E1.target=E2.source,
E2.tag=“title”,E2.target=Value.node
 2002 by AT&T and Lucent
Value Table
value
1994
Fugitive, The
WWW2002 - Hawaii
XML and Data Management
source
show
162
Generic Mapping: Tag-Based
Show Table
&0
show
show
&1
&2
@year
title review
&3
&4
&5
ordinal
target
&0
1
&1
&0
2
&2
Title Table
box office
&6
&11
suntimes nytimes
&7
source
source
ordinal
target
&1
2
Fugitive, The
&9
rating
&8
&10
Review Table
source
ordinal
target
&1
3
&5
&1
4
&6
Find titles for all shows
SELECT Title.target
FROM Title, Show
WHERE Show.target=Title.source
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
rating
163
Generic Mappings: Summary
„
„
Ignore regularity in structure
Canonical relational schema
z
z
„
Edge: store all edges in one table
Attribute: horizontal partition of Edge relation on
element tag
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Querying: Requires multi-table joins or self
joins for element reconstruction
164
Schema-Driven: Shared Inlining
*
<!ELEMENT imdb (show*, …)>
show
<!ELEMENT show(title, year?, reviews*,
|
(box_office|
?
(episode*, seasons)))>
title
year
reviews
box_office
<!ELEMENT title (#PCDATA)>
episode
<!ELEMENT year (#PCDATA)>
<!ELEMENT review(#PCDATA)> …
Show
ID : Int
seasons
*
title
year:Str
box_office:Str
seasons:Str
ID: Int
parentID: Int
parentCODE: Str
Episode
ID: Int
parentID: Int
parentCODE: Str
Title
ID : Int
parentID: Int
parentCODE: Str
XML and Data Management
Reviews
review: str
title:Str
Find titles for all shows
SELECT title FROM Show,Title
WHERE Title.parentID = Show.ID
WWW2002 - Hawaii
 2002 by AT&T and Lucent
165
Schema-Driven: Hybrid Inlining
<!ELEMENT imdb (show*, …)>
show
<!ELEMENT show(title, year?, reviews*,
(box_office|
?
(episode*, seasons)))>
title
year
*
reviews
|
box_office
<!ELEMENT title (#PCDATA)>
episode
<!ELEMENT year (#PCDATA)>
<!ELEMENT review(#PCDATA)> …
Show
ID : Int
title:Str
seasons
*
title
year:Str
box_office:Str
ID: Int
parentID: Int
parentCODE: Str
Episode
ID: Int
parentID: Int
parentCODE: Str
review: str
title:Str
Find titles for all shows
SELECT title FROM Show
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Reviews
seasons:Str
166
Schema-Driven: Summary
Use DTD/XML Schema to decompose
document
Shared/Hybrid
„
„
z
z
Querying:
„
Fast lookup & reconstruction of inlined elements
Reconstruction may require multi-table joins and
unions
+
-
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Rule of thumb: Inline as much as possible to
minimize number of joins
Shared: do not inline if shared, set-valued,
recursive
Hybrid: also inline if shared but not set-valued or
recursive
z
167
Data-Driven: STORED
„
„
„
Schemaless data
Analyze data -- try to infer schema graph:
“mine” data for common (regular) patterns
with high-support
Example:
z
z
„
Querying: use derived mapping definition to
automatically translate queries
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
Discover from IMDB data that every show has
year and title
Create a table for show that contains year and
title
Use generic mapping for patterns that are
irregular and have low-support
168
More mappings…
show
?
*
title year tilde
|
TABLE Show
TABLE Show
TABLE Show1
(show_id INT,
(show_id INT,
(show1_id INT,
title STRING,
title STRING,
title STRING,
year INT,
year INT,
year INT,
box_office INT,
box_office INT,
box_office INT)
seasons INT)
seasons INT)
TABLE Show2
seasons TABLE Review
reviews box_office
(show2_id INT,
(review_id INT,
(review_id INT,
title STRING,
tilde STRING,
review STRING,
year INT,
review STRING,
parent_Show INT)
seasons INT)
TABLE Review
TABLE Review
(review_id INT,
(review_id INT,
tilde STRING,
tilde STRING,
review STRING,
review STRING,
parent_Show INT)
parent_Show INT)
parent_Show INT)
There are many
alternative mappings!(I) Inline as many
elements as
possible
(III)Split Show table
(II)Partition
into TV and Movies
reviews table-one
for NYT,one for rest
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
TABLE NYTReview
169
Mappings and Performance Implications
1.4
1.2
1
0.8
(I)
0.6
(II)
0.4
(III)
0.2
Q1
Q2
Q3
Q4
W1
W2
• Performance depends on data, schema
and query workload
• A fixed mapping is unlikely to be the best
for all applications
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
0
170
The LegoDB Storage Mapping Engine
„
„
Application-driven shredding
Automatically generates and explores a
space of possible mappings
z
„
XQuery automatically translated at runtime
XML and Data Management
Uses a standard relational optimizer to
evaluate cost of mappings
WWW2002 - Hawaii
171
z
„
Uses information from schema, data statistics and
query workload
Selects the mapping which has the lowest cost for
a given application
 2002 by AT&T and Lucent
Colonial Techniques: Summary
Mapping
DTD/
Schema
Data
Query
workload
User defined
Manual
no
no
no
Generic
Automatic/
fixed
no
no
no
STORED
Automatic/
dataoriented
no
yes
no
Shared/Hybrid
Inlining
Automatic/
DTD-based
yes
no
no
LegoDB
Automatic/
cost-based
yes
yes
yes
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Strategy
172
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML storage in commercial relational
databases
Summary and remarks
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
„
Issues
Systems and Techniques
173
Commercial Systems
Oracle 9i
„ IBM DB2
„ Microsoft SQL Server
„
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
Not a comprehensive survey!
174
Oracle 9i: Schema Design
„
Store XML documents in CLOB (character
large objects) or BFILEs
z
„
Canonical mapping into object-relational
tables
z
z
z
z
tag names are mapped to column names
elements with text-only map to scalar columns
elements with sub-elements map to object types
list of elements maps to collections
Indexing: standard relational
Hybrid: user-defined
z
Canonical for structured fragments;
CLOBs/BFILEs for unstructured fragments
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
z
„
Indexing: Full-text
175
Oracle 9i (cont.)
„
Data Loading: multiple mechanisms
z
„
PL/SQL, custom code (Java, C++), SQL*Loader,…
Query support:
z
z
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
CLOBs: SQL + Oracle Text; XPath
Canonical: SQL
XSU for publishing results in XML format
No support for XQuery
176
IBM DB2
„
„
User-defined mapping through DAD
(Document Access Definition)
XML Collections: Declarative decomposition
of XML into multiple tables
z
z
„
Data loading: follows DAD mapping
Query support: SQL
z
z
z
Documents stored in XML columns
Side tables used for hot elements and attributes
Query support: full-text search (DB2 Text
Extender); extract, search, update elements and
attributes
WWW2002 - Hawaii
 2002 by AT&T and Lucent
XML and Data Management
XML Columns: CLOBs + side tables for
indexing individual elements
177
MS SQL Server
„
„
„
„
„
Edge Table: Edge + inlined scalar values
Shredded Rowset: Programmatic
decomposition of XML into multiple tables
Annotated schema
CLOB
Data loading:
z
„
Query support:
z
z
SQL
SQL extensions for publishing results as XML
(FOR XML clause)
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
Combine INSERT and OPENXML
OPENXML: access to XML data as a relational
rowset
178
Commercial Systems: Summary
„
Data design:
z
z
z
„
Querying:
z
SQL as the main access method to XML
documents – no support for XQuery
“XML-aware” extensions to SQL
E.g., Limited XPath navigation syntax
XML and Data Management
z
„
CLOBs
Fixed canonical mappings
User-defined mappings
Publish results in XML
WWW2002 - Hawaii
 2002 by AT&T and Lucent
179
Commercial Systems: Summary
Data Design
Loading
Query Support
Oracle 9i
CLOB/
Canonical OR /
User-defined
Hybrid
PL/SQL
Java, C++
SQL*Loader
Full-text, XPath
SQL
DB2
CLOB + side
tables/
User-defined
DAD
DAD-driven
SQL+Full Text
SQL
SQL
Server
EDGE/
User-defined
shredded
rowset/Annotat
ed Schema
OpenXML
Annotated
schema: bulk
load,
updategrams
SQL + Full Text
XPath
No support for Xquery, or updates via DOM
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
System
180
Agenda
„
„
Storage choices: Overview
Native XML Databases
z
z
„
Colonial Strategies
z
„
Issues
Systems and Techniques
XML and Data Management
z
„
Issues
Systems and Techniques
Commercial Solutions
Summary and remarks
WWW2002 - Hawaii
 2002 by AT&T and Lucent
181
Update Support
„
„
XQuery does not support updates (yet…)
How to update?
z
z
z
„
Flat streams: overwrite document
Colonial: SQL
Native: DOM, proprietary APIs
z
z
z
Flat streams: re-parse document
Colonial: need to understand the mapping and
maintain integrity constraints
Native: supported in some systems (e.g.,
eXcelon)
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
But how do you know you have not violated
schema?
182
Summary of Storage Techniques
Colonial
Support XQuery
SQL + extension
Performance?
Overheads for translation – not in
commercial systems
Supporting order can be expensive
Store and query docs as is
Data/query conversion layer is
required
Built from the ground up –
less mature
Mature systems
Extensible - no schema or
DTD needed
May require changes to schema in
order to support new tags
Need translation to
interoperate
Easy to interoperate with legacy
data
 2002 by AT&T and Lucent
„
WWW2002 - Hawaii
XML and Data Management
Native
183
How do I choose the best storage
solution for my XML application?
Match application requirements with vendors’
supported features
z
z
z
 2002 by AT&T and Lucent
WWW2002 - Hawaii
XML and Data Management
z
Updates via DOM?
XPath/XQuery support?
Relational interfaces?
Distribution, concurrency control, archiving,
application-development tools….
184
References
Native
„
Xindice - http://xml.apache.org/xindice
„
http://www.xmldb.org/index.html
„
Natix http://www.dataexmachina.de/natix.html
„
Carl-Christian Kanne, Guido Moerkotte: Efficient
Storage of XML Data. ICDE 2000: 198
„
Xyleme http://www.xyleme.com
„
Excelon http://www.exceloncorp.com
„
Tamino http://www.softwareag.com/tamino
„
http://www.xyzfind.com
„
GoXML http://www.xmlglobal.com/prod/db
„
Xupdate. http://www.xmldb.org/xupdate/
„
XML:DB http://www.xmldb.org/xapi/
„
„
„
„
„
Philip Bohannon, Juliana Freire, Prasan Roy,
Jérôme Siméon: From XML Schema to Relations:
A Cost-based Approach to XML Storage. ICDE
2002
IBM DB2 XML Extender http://www3.ibm.com/software/data/db2/extenders/xmlext/li
brary.html
Oracle XML DB
http://technet.oracle.com/tech/xml/content.html
Informix http://www.informix.com/xml/
SQL Server
http://www.microsoft.com/sql/techinfo/xml/defau
lt.asp
Colonial
TOX – The Toronto XML engine
http://www.cs.toronto.edu/tox
„
Daniela Florescu, Donald Kossmann: Storing
and Querying XML Data using an RDMBS. IEEE
Data Engineering Bulletin 22(3): 27-34 (1999)
„
Jayavel Shanmugasundaram, Kristin Tufte,
Chun Zhang, Gang He, David J. DeWitt, Jeffrey
F. Naughton: Relational Databases for Querying
XML Documents: Limitations and Opportunities.
VLDB 1999: 302-314
„
Alin Deutsch, Mary F. Fernandez, Dan Suciu:
Storing Semistructured Data with STORED.
SIGMOD Conference 1999: 431-442
 2002 by AT&T and Lucent
XML and Data Management
„
WWW2002 - Hawaii
185
Download