CSE 636 Data Integration XML Semistructured Data

advertisement
CSE 636
Data Integration
XML
Semistructured Data
Document Type Definitions
Semistructured Data
• Another data model, based on trees
• Motivation: flexible representation of data
– Often, data comes from multiple sources with
differences in notation, meaning, etc.
• Motivation: sharing of documents among
systems and databases
2
Graphs of Semistructured Data
• Nodes = objects
• Labels on arcs (attributes, relationships)
• Atomic values at leaf nodes (nodes with no
arcs out)
• Flexibility: no restriction on:
– Labels out of a node
– Number of successors with a given label
3
Example: Data Graph
root
beer
bar
beer
manf
name
servedAt
Bud
A.B.
manf
prize
name
M’lob
name
addr
Joe’s
Maple
The bar object
for Joe’s Bar
year
1995
award
Gold
The beer object
for Bud
4
XML
HTML
• Uses tags for formatting the presentation
(e.g., “italic”)
• Hard for applications to process
XML = Extensible Markup Language
• Uses tags for semantics
(e.g., “this is an address”)
– Similar to labels in semistructured data
• Allows you to invent your own tags
• Easy for applications to process
5
HTML  XML
<html>
<body>
<h1> Bibliography </h1>
<p>
<i>Foundations of Databases</i>
Abiteboul, Hull, Vianu
<br/> Addison Wesley, 1995
</p>
<p>
<i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br/> Morgan Kaufmann, 1999
</p>
</body>
</html>
<?xml version = “1.0” standalone = “yes” ?>
<bibliography>
<book>
<title>Foundations of Databases</title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
6
Why XML is of Interest to Us
• XML is just syntax for data
– Note: we have no syntax for relational data
– But XML is not relational: semistructured
• This is exciting because:
–
–
–
–
Can translate any data to XML
Can ship XML over the Web (HTTP, SOAP)
Can input XML into any application
Thus: data sharing and exchange on the Web
7
XML Data Sharing and Exchange

Applications
XML
DB

Applications
XML Data
Transform
Integrate
Web
(HTTP, SOAP)
Warehouse
Relational
DB
Web
Site
Web
Service
8
XML Tags & Elements
• Tags: book, title, author, …
– XML tags are case sensitive
• Tags, as in HTML, are normally matched pairs
– <book> … </book>
– Start tag: <book>, End tag: </book>
• Elements: everything between tags
– Example 1: <title>Foundations of Databases</title>
– Example 2: <book>
<title>Foundations of Databases</title>
</book>
• Elements may be nested arbitrarily
• Empty element: <book></book>
– Abbreviation <book/>
9
XML Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
• Attributes are alternative ways to represent data
10
Replacing Attributes with Elements
<book>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
<price> 55 </price>
<currency> USD </currency>
</book>
11
Elements vs. Attributes
• Too many attributes make documents hard to
read
• Attributes do not specify document structure
• Attributes are good for simple information
12
More XML: CDATA Section
• Syntax: <![CDATA[ .....any text here...]]>
• Example:
<example>
<![CDATA[ some text here </notAtag> <>]]>
</example>
13
More XML: Entity References
• Syntax: &entityname;
• Example:
<element> this is less than < </element>
• Some entities:
<
<
>
>
&
&
'
‘
"
“
&
Unicode char
14
More XML: Comments
• Syntax <!-- .... Comment text... -->
• Yes, they are part of the data model !!!
15
XML Semantics: a Tree !
Attribute
node
<data>
<person age=“25” >
person
<name> Mary </name>
<address>
<street> Maple </street>
age
<no> 345 </no>
address
name
<city> Seattle </city>
</address>
</person>
25
<person>
street
no
Mary
<name> John </name>
<address>Thailand</address>
<phone> 23456 </phone>
Maple
345
</person>
</data>
Element
node
data
person
name
address
phone
city
Thai
John
Seattle
23456
Text
node
• Order matters!!!
16
Well-Formed XML
• Start the document with a declaration,
surrounded by <?xml … ?>
• Normal declaration is:
<?xml version = “1.0” standalone = “yes” ?>
–
“Standalone” = “no DTD provided”
• Has single root element surrounding nested
elements
• Has matching tags
17
XML Data
• XML is self-describing
• Schema elements become part of the data
– Relational schema: person(name, phone)
– In XML <person>, <name>, <phone> are part of
the data, and are repeated many times
• Consequence: XML is much more flexible
• XML = semistructured data
– Well-Formed XML with nested tags is exactly the
same idea as trees of semistructured data
– XML also enables nontree structures, as does the
semistructured data model
18
XML is Semistructured Data
• Missing attributes:
<person>
<name> John</name>
<phone>1234</phone>
</person>
<person>
<name>Joe</name>
</person>
• Could represent in
a table with nulls
 no phone !
name
phone
John
1234
Joe
19
XML is Semistructured Data
• Repeated attributes
<person> <name>Mary</name>
<phone>2345</phone>
<phone>3456</phone>
</person>
• Impossible in tables:
name
phone
Mary
2345
 two phones !
3456
???
20
XML is Semistructured Data
• Attributes with different types in different
objects
<person>
<name>
 structured name !
<first>John</first>
<last>Smith</last>
</name>
<phone>1234</phone>
</person>
• Nested collections (no 1NF)
• Heterogeneous collections:
– <db> contains both <book>s and
<publisher>s
21
Document Type Definition (DTD)
•
•
•
•
Part of the original XML specification
An XML document may have a DTD
Valid XML: if it has a DTD and conforms to it
Validation is useful in data exchange
22
Very Simple DTD
<!DOCTYPE db [
<!ELEMENT db
((book|publisher)*)>
<!ELEMENT book
(title,author*,year?)>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year
(#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
]>
23
DTD: The Content Model
content
model
• Content model:<!ELEMENT tag (CONTENT)>
– Complex
–
–
–
–
= a regular expression over other
elements
Text-only
= #PCDATA
Empty
= EMPTY
Any
= ANY
Mixed content = (#PCDATA | A | B | C)*
24
DTD: Regular Expressions
DTD
sequence
<!ELEMENT name (firstName, lastName))
optional
<!ELEMENT name (firstName?, lastName))
zero or more
<!ELEMENT person (name, phone*))
one or more
<!ELEMENT person (name, phone+))
alternation
<!ELEMENT person (name, (phone|email)))
XML
<name>
<firstName>…</firstName>
<lastName>…</lastName>
<name>
</name>
<lastName>…</lastName>
</name>
<person>
<name>
<name>…</name>
<firstName>…</firstName>
<phone>…</phone>
<lastName>…</lastName>
<phone>…</phone>
</name>
<phone>…</phone>
<person>
…
<name>…</name>
</person>
<phone>…</phone>
<person>
<phone>…</phone>
<name>…</name>
<person>
…
</person>
<name>…</name>
</person>
<phone>…</phone>
<person>
</person>
<name>…</name>
<person>
<phone>…</phone>
<name>…</name>
</person>
<email>…</email>
</person>
25
DTD: Attributes
<!ELEMENT person (ssn, name, office, phone?)>
<!ATTLIST person age CDATA #REQUIRED
height CDATA #IMPLIED>
<person
age=“25”
height=“6”>
<name> ...</name>
...
</person>
26
DTD: Attributes
<!ATTLIST tag (name type kind)+>
Types:
• CDATA
= string
• (Mon | Wed | Fri) = enumeration
• ID
= key
• IDREF
= foreign key
• IDREFS
= foreign keys separated by space
• others
= rarely used
Kind:
• #REQUIRED
• #IMPLIED
= optional
• “value”
= default value
• “value” #FIXED
= the only value allowed
27
XML: IDs and References
• Attributes can be pointers from one object to
another
– Compare to HTML’s
NAME = “foo” and HREF = “#foo”
• Allows the structure of an XML document to
be a general graph, rather than just a tree
28
XML: Creating ID’s
• Give an element E an attribute A of type ID
• When using tag <E> in an XML document,
give its attribute A a unique value
• Example:
<E A = “xyz”>
29
XML: Creating References
• To allow objects of type F to refer to another
object with an ID attribute, give F an
attribute of type IDREF
• Or, let the attribute have type IDREFS, so the
F –object can refer to any number of other
objects
30
XML: IDs and References
<person id=“o555”>
<name>Jane</name>
</person>
<person id=“o456”>
<name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”>
<name>John</name>
</person>
• IDs and references in XML are just syntax
31
DTD: ID and IDREF(S) Attributes
<!ELEMENT person (ssn, name, office, phone?)>
<!ATTLIS person age
CDATA
#REQUIRED
id
ID
#REQUIRED
manager
IDREF
#REQUIRED
manages
IDREFS
#REQUIRED
>
<person
age=“25”
id=“p29432”
manager=“p48293”
manages=“p34982 p423234”>
<name> ....</name>
...
</person>
32
Use of DTDs
1. Set standalone = “no”
2. Either:
a) Include the DTD as a preamble of the XML
document, or
b) Follow DOCTYPE and the <root tag> by SYSTEM and
a path to the file where the DTD can be found, or
c) Mix the two... (e.g. to override the external
definition)
33
Example (a)
<?xml version = “1.0” standalone = “no” ?>
<!DOCTYPE BARS [
The DTD
<!ELEMENT BARS (BAR*)>
<!ELEMENT BAR (NAME, BEER+)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT BEER (NAME, PRICE)>
The document
<!ELEMENT PRICE (#PCDATA)>
]>
<BARS>
<BAR><NAME>Joe’s Bar</NAME>
<BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER>
<BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER>
</BAR>
<BAR> …
</BARS>
34
Example (b)
• Assume the BARS DTD is in file bar.dtd
<?xml version = “1.0” standalone = “no” ?>
<!DOCTYPE BARS SYSTEM “bar.dtd”>
Get the DTD
<BARS>
from the file
<BAR><NAME>Joe’s Bar</NAME>
bar.dtd
<BEER><NAME>Bud</NAME>
<PRICE>2.50</PRICE></BEER>
<BEER><NAME>Miller</NAME>
<PRICE>3.00</PRICE></BEER>
</BAR>
<BAR> …
</BARS>
35
DTDs as Grammars
<!DOCTYPE db [
<!ELEMENT db
((book|publisher)*)>
<!ELEMENT book
(title,author*,year?)>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year
(#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
]>
36
DTDs as Grammars
Same thing as: db
book
title
author
year
publisher
::=
::=
::=
::=
::=
::=
(book|publisher)*
(title,author*,year?)
string
string
string
string
• A DTD is a EBNF (Extended BNF) grammar
• An XML tree is precisely a derivation tree
• A valid XML document = a parse tree for that
grammar
37
DTDs as Grammars
<!DOCTYPE paper [
<!ELEMENT paper
<!ELEMENT section
<!ELEMENT title
<!ELEMENT text
]>
(section*)>
((title,section*) | text)>
(#PCDATA)>
(#PCDATA)>
<paper> <section> <text> </text> </section>
<section> <title> </title>
<section> … </section>
<section> … </section>
</section>
</paper>
• XML documents can be nested arbitrarily deep
38
DTDs as Schemas
Not so well suited:
• impose unwanted constraints on order:
– <!ELEMENT person (name,phone)>
• references cannot be constrained
– ID/IDREFS can reference any ID
• can be too vague:
– <!ELEMENT person ((name|phone|email)*)>
39
DTDs as Schemas
No context-dependant typing
dealer
UsedCars
NewCars
a
d
a
d
model
year
year
• Cannot distinguish between used car ads and
new car ads
– Different structure in different contexts
40
XML APIs
• Document Object Model - DOM
–
–
–
–
–
Manipulation of XML Data
Provides a representation of an XML Document as a tree
Reads XML Document into memory
http://www.w3.org/DOM
Many implementations (Sun JAXP, Apache Xerces, …)
• Simple API for XML - SAX
– Event-based framework for parsing XML data
– http://www.saxproject.org/
41
References
• Lecture Slides
– Jeffrey D. Ullman
– http://www-db.stanford.edu/~ullman/dscb/pslides/pslides.html
– Dan Suciu
– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/02xmlsynta
x.htm
– http://www.cs.washington.edu/homes/suciu/COURSES/590DS/11dtd.htm
– Alon Levy
– http://www.cs.washington.edu/education/courses/csep544/02sp/lectures/l
ecture5cut.ppt
• BRICS XML Tutorial
– A. Moeller, M. Schwartzbach
– http://www.brics.dk/~amoeller/XML/index.html
• W3C's XML homepage
– http://www.w3.org/XML
• XML School: an XML tutorial
– http://www.w3schools.com/xml
42
Download