Lecture 15 : Introduction to XML

advertisement
Introduction to Semistructured
Data and XML
Chapter 27
Database Management Systems, R. Ramakrishnan
1
How the Web is Today

HTML documents
• often generated by applications
• consumed by humans only
• easy access: across platforms, across organizations

No application interoperability:
• HTML not understood by applications
• Database technology: client-server
Database Management Systems, R. Ramakrishnan
2
New Universal Data Exchange
Format: XML
A recommendation from the W3C
 XML = data
 XML generated by applications
 XML consumed by applications
 Easy access: across platforms, organizations
Database Management Systems, R. Ramakrishnan
3
Paradigm Shift on the Web
From documents (HTML) to data (XML)
 From information retrieval to data
management
 For databases, also a paradigm shift:

• from relational model to semistructured data
• from data processing to data/query translation
• from storage to transport
Database Management Systems, R. Ramakrishnan
4
HTML




HTML is widely used for formatting and structuring
Web documents.
Designed to describe how a Web browser should
arrange text, images and push-buttons on a page.
Easy to learn, but does not convey structure and
meaning of data in the Web pages.
Fixed tag set.
Text (PCDATA)
Opening tag
<HTML>
<HEAD><TITLE>Welcome to the XML course</TITLE></HEAD>
<BODY>
<H1>Introduction</H1>
<IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >
Closing tag
</BODY>
</HTML>
“Bachelor” tag
Attribute name
Database Management Systems, R. Ramakrishnan
Attribute value
5
Semistructure data
1.
2.
3.
4.
Information integration: important new application
that motivates what follows.
Semistructured data: a new data model designed to
cope with problems of information integration.
XML (Extensible Markup Language) : a new Web
standard that is essentially semistructured data.
XQUERY: an emerging standard query language
for XML data.
Database Management Systems, R. Ramakrishnan
6
Information Integration
Problem: related data exists in many places. They talk
about the same things, but differ in model, schema,
conventions (e.g., terminology).
Example: In the real world, every bar has its own
database.
 Some may have relations like beer-price; others have
an Microsoft Word file from which the menu is
printed.
 Some keep phones of manufacturers but not
addresses.
 Some distinguish beers and ales; others do not.
Database Management Systems, R. Ramakrishnan
7
The Semistructured Data Model
Bib
Object Exchange
Model (OEM)
&o1
complex object
paper
paper
book
references
&o12
&o24
references
author
title
year
&o29
references
author
http
page
author
title publisher
title
author
author
author
&o43
&25
&96
1997
last
firstname
firstname
lastname
&243
“Serge”
“Abiteboul”
“Victor”
lastname
first
&206
“Vianu”
122
133
atomic object
Database Management Systems, R. Ramakrishnan
8
Characteristics of Semistructured
Data
Missing or additional attributes
 Multiple attributes
 Different types in different objects
 Heterogeneous collections

Self-describing, irregular data, no a priori structure
Database Management Systems, R. Ramakrishnan
9
Comparison with Relational Data
row
nam e
phone
John
3634
Sue
6343
D ic k
6363
Database Management Systems, R. Ramakrishnan
row
row
name phone name phone name phone
“John” 3634 “Sue” 6343 “Dick”
6363
{ row: { name: “John”, phone: 3634 },
row: { name: “Sue”, phone: 6343 },
row: { name: “Dick”, phone: 6363 }
}
10
XML (Extensible Markup Language)
A W3C standard to complement HTML
 Origins: Structured text SGML

• Large-scale electronic publishing
• Data exchange on the web

Motivation:
• HTML describes presentation
• XML describes content
Database Management Systems, R. Ramakrishnan
11
From HTML to XML
HTML describes the presentation
Database Management Systems, R. Ramakrishnan
12
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
Database Management Systems, R. Ramakrishnan
13
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
Database Management Systems, R. Ramakrishnan
14
Why are we DB’ers interested?
It’s data. That’s us.
 Database issues:

• How are we going to model XML? (graphs).
• How are we going to query XML? (XQuery)
• How are we going to store XML (in a relational
database? object-oriented? native?)
• How are we going to process XML efficiently?
(many interesting research questions!)
Database Management Systems, R. Ramakrishnan
15
XML Terminology

Tags: book, title, author, …
• start tag: <book>, end tag: </book>

Elements: <book>…<book>,<author>…</author>
• elements can be nested
• empty element: <red></red> (Can be abbrv. <red/>)



XML document: Has a single root element
Well-formed XML document: Has matching tags
Valid XML document: conforms to a schema
Database Management Systems, R. Ramakrishnan
16
Well-Formed XML
1. Declaration = <? ... ?> .
• Normal declaration is
<? XML VERSION = "1.0" STANDALONE = "yes"
?>
• “Standalone” means that there is no DTD specified.
2. Root tag surrounds the entire balance of the document.
 <FOO> is balanced by </FOO>, as in HTML.
3. Any balanced structure of tags OK.
• Option of tags that don’t require balance, like <P> in
HTML.
Database Management Systems, R. Ramakrishnan
17
XML: An Example
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<BOOKLIST>
<BOOK genre="Science" format="Hardcover">
<AUTHOR>
<FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME>
</AUTHOR>
<TITLE>The Character of Physical Law</TITLE>
<PUBLISHED>1980</PUBLISHED>
</BOOK>
<BOOK genre="Fiction">
<AUTHOR>
<FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME>
</AUTHOR>
<TITLE>Waiting for the Mahatma</TITLE>
<PUBLISHED>1981</PUBLISHED>
</BOOK>
<BOOK genre="Fiction">
<AUTHOR>
<FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME>
</AUTHOR>
<TITLE>The English Teacher</TITLE>
<PUBLISHED>1980</PUBLISHED>
</BOOK>
</BOOKLIST>
Database Management Systems, R. Ramakrishnan
18
XML – Elements
<BOOK genre="Science" format="Hardcover">…</BOOK>
attribute
open tag
element name





attribute value
data
closing tag
Xml is case and space sensitive
Element opening and closing tag names must be identical
Opening tags: “<” + element name + “>”
Closing tags: “</” + element name + “>”
Empty Elements have no data and no closing tag:
• They begin with a “<“ and end with a “/>”
<BOOK/>
Database Management Systems, R. Ramakrishnan
19
XML – Attributes
<BOOK genre="Science" format="Hardcover">…</BOOK>
attribute
open tag
attribute value
element name


data
closing tag
Attributes provide additional information for element tags.
There can be zero or more attributes in every element; each one
has the the form:
attribute_name=‘attribute_value’
- There is no space between the name and the “=‘”
- Attribute values must be surrounded by “ or ‘ characters

Multiple attributes are separated by white space (one or more
spaces or tabs).
Database Management Systems, R. Ramakrishnan
20
Elements
The segment of an XML document between an opening
and a corresponding closing tag is called an element.
element
<person>
<name> Malcolm Atchison </name>
<tel> (215) 898 4321 </tel>
<tel> (215) 898 4321 </tel>
<email> mp@dcs.gla.ac.sc </email>
</person>
element, a sub-element
of
Database Management Systems, R. Ramakrishnan
not an element
21
XML – Data and Comments
<BOOK genre="Science" format="Hardcover">…</BOOK>
attribute
open tag
attribute value
element name



closing tag
data
Xml data is any information between an opening and closing
tag
Xml data must not contain the ‘<‘ or ‘>’ characters
Comments:
<!- comment ->
Database Management Systems, R. Ramakrishnan
22
XML text
XML has only one “basic” type -- text.
It is bounded by tags, e.g.
<title> The Big Sleep </title>
<year> 1935 </ year> --- 1935 is still text
XML text is called PCDATA (for parsed
character data). It uses a 16-bit encoding.
Database Management Systems, R. Ramakrishnan
23
XML – Nesting & Hierarchy



Xml tags can be nested in a tree hierarchy
Xml documents can have only one root tag
Between an opening and closing tag you can insert:
1. Data
2. More Elements
3. A combination of data and elements
<root>
<tag1>
Some Text
<tag2>More</tag2>
</tag1>
</root>
Database Management Systems, R. Ramakrishnan
24
Representing relational DBs:
Two ways
projects:
title
employees:
name
budget
ssn
Database Management Systems, R. Ramakrishnan
managedBy
age
25
Project and Employee relations in XML
Projects and employees are intermixed
<db>
<project>
<employee>
<name> Sandra </name>
<title> Pattern recognition </title>
<ssn> 2234 </ssn>
<budget> 10000 </budget>
<age> 35 </age>
<managedBy> Joe</managedBy>
</employee>
</project>
<project>
<employee>
<title> Auto guided vehicle </title>
<budget> 70000 </budget>
<name> Joe </name>
<managedBy> Sandra </managedBy>
<ssn> 344556 </ssn>
</project>
<age> 34 < /age>
:
</employee>
</db>
Database Management Systems, R. Ramakrishnan
26
Project and Employee relations in XML (cont’d)
Employees follows projects
<db>
<projects>
<project>
<title> Pattern recognition </title>
<budget> 10000 </budget>
<managedBy>Joe </managedBy>
</project>
<project>
<title>Auto guided vehicles</title>
<budget> 70000 </budget>
<managedBy>Sandra</managedBy>
</project>
:
</projects>
Database Management Systems, R. Ramakrishnan
<employees>
<employee>
<name> Joe </name>
<ssn> 344556 </ssn>
<age> 34 </age>
</employee>
<employee>
<name> Sandra </name>
<ssn> 2234 </ssn>
<age>35 </age>
</employee>
:
<employees>
</db>
27
More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
oids and references in XML are just syntax
Database Management Systems, R. Ramakrishnan
28
XML Data Model (Graph)
db
#0
book
book
publisher
b1
b2
pub
title
#1
pcdata
mkp
author
#2
pcdata
title
#3
pcdata
pub
author
#5
#4
pcdata
pcdata
Complete... Chamberlin Principles... Bernstein
Database Management Systems, R. Ramakrishnan
author
Newcomer
name
#6
pcdata
state
#7
pcdata
Morgan... CA
29
Document Type Descriptors

Sort of like a schema but not really.
<!ELEMENT Book (title, author*) >
<!ELEMENT title #PCDATA>
<!ELEMENT author (name, address,age?)>
<!ATTLIST Book id ID #REQUIRED>
<!ATTLIST Book pub IDREF #IMPLIED>

Inherited from SGML DTD standard
BNF grammar establishing constraints on element
structure and content


Definitions of entities
Database Management Systems, R. Ramakrishnan
30
DTD – An Example
<?xml version='1.0'?>
<!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
<!ELEMENT Cherry EMPTY>
<!ATTLIST Cherry flavor CDATA #REQUIRED>
<!ELEMENT Apple EMPTY>
<!ATTLIST Apple color CDATA #REQUIRED>
<!ELEMENT Orange EMPTY>
<!ATTLIST Orange location ‘Florida’>
--------------------------------------------------------------------------------
<Basket>
<Cherry flavor=‘good’/>
<Apple color=‘red’/>
<Apple color=‘green’/>
</Basket>
Database Management Systems, R. Ramakrishnan
<Basket>
<Apple/>
<Cherry flavor=‘good’/>
<Orange/>
</Basket>
31
DTD - !ELEMENT
<!ELEMENT Basket (Cherry+, (Apple | Orange)*) >
Name
Children
!ELEMENT declares an element name, and
what children elements it should have
 Content types:

•
•
•
•
•
Other elements
#PCDATA (parsed character data)
EMPTY (no content)
ANY (no checking inside this structure)
A regular expression
Database Management Systems, R. Ramakrishnan
32
DTD - !ELEMENT (Contd.)

A regular expression has the following
structure:
• exp1, exp2, exp3, …, expk: A list of regular
expressions
• exp*: An optional expression with zero or more
occurrences
• exp+: An optional expression with one or more
occurrences
• exp1 | exp2 | … | expk: A disjunction of
expressions
Database Management Systems, R. Ramakrishnan
33
DTD - !ATTLIST
<!ATTLIST Cherry flavor CDATA #REQUIRED>
Element Attribute
Type
Flag
<!ATTLIST Orange location CDATA #REQUIRED
color ‘orange’>
!ATTLIST defines a list of attributes for an
element
 Attributes can be of different types, can be
required or not required, and they can have
default values.

Database Management Systems, R. Ramakrishnan
34
DTD – Well-Formed and Valid
<?xml version='1.0'?>
<!ELEMENT Basket (Cherry+)>
<!ELEMENT Cherry EMPTY>
<!ATTLIST Cherry flavor CDATA #REQUIRED>
--------------------------------------------------------------------------------
Not Well-Formed
Well-Formed but Invalid
<basket>
<Job>
<Cherry flavor=good>
<Location>Home</Location>
</Basket>
</Job>
Well-Formed and Valid
<Basket>
<Cherry flavor=‘good’/>
</Basket>
Database Management Systems, R. Ramakrishnan
35
Example: An Address Book
<person>
<name> MacNiel, John </name>
<greet> Dr. John MacNiel </greet>
<addr>1234 Huron Street </addr>
<addr> Rome, OH 98765 </addr>
<tel> (321) 786 2543 </tel>
<fax> (321) 786 2543 </fax>
<tel> (321) 786 2543 </tel>
<email> jm@abc.com </email>
</person>
Database Management Systems, R. Ramakrishnan
Exactly one name
At most one greeting
As many address lines
as needed (in order)
Mixed telephones
and faxes
As many
as needed
36
Specifying the structure

name
to specify a name element

greet?
to specify an optional
(0 or 1) greet elements

name,greet?
to specify a name followed by
an optional greet
Database Management Systems, R. Ramakrishnan
37
Specifying the structure (cont)

addr*
to specify 0 or more address lines

tel | fax
a tel or a fax element

(tel | fax)*
0 or more repeats of tel or fax

email*
0 or more email elements
Database Management Systems, R. Ramakrishnan
38
A DTD for the address book
<!DOCTYPE addressbook [
<!ELEMENT addressbook (person*)>
<!ELEMENT person
(name, greet?, address*, (fax | tel)*, email*)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT greet (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT tel
(#PCDATA)>
<!ELEMENT fax
(#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
Database Management Systems, R. Ramakrishnan
39
DTD for the example relational DB
<!DOCTYPE db [
<!ELEMENT db
<!ELEMENT projects
<!ELEMENT employees
<!ELEMENT project
<!ELEMENT employee
...
]>
Database Management Systems, R. Ramakrishnan
(projects,employees)>
(project*)>
(employee*)>
(title, budget, managedBy)>
(name, ssn, age)>
40
Summary of XML regular
expressions









Each element name is a tag.
Its components are the tags that appear nested
within, in the order specified.
A
The tag A occurs
e1,e2
The expression e1 followed by e2
e*
0 or more occurrences of e
e?
Optional -- 0 or 1 occurrences
e+
1 or more occurrences
e1 | e2
either e1 or e2
(e)
grouping
Database Management Systems, R. Ramakrishnan
41
XML Querying
Path Expressions :
 Bib.paper
 Bib.book.publisher
 Bib.paper.author.lastname
Given an OEM instance, the value of a path
expression p is a set of objects
Database Management Systems, R. Ramakrishnan
42
Path Expressions
Bib
&o1
Examples:
paper
paper
book
references
&o12
&o24
references
DB =
author
title
&o43
year
&o44
&o29
references
author
http
author
title publisher
title
author
author
author
&o45 &o46
&o52
page
&25
&96
1997
firstname
lastname
&o70
“Serge”
&o47 &o48 &o49 &o50 &o51
firstname
&o71
“Abiteboul”
Bib.paper={&o12,&o29}
Bib.book.publisher={&o51}
Bib.paper.author.lastname={&o71,&206}
Database Management Systems, R. Ramakrishnan
&243
“Victor”
last
lastname
first
&206
“Vianu”
122
133
43
XQuery
Emerging standard for querying XML documents. Basic form:
FOR <variables ranging over sets of elements>
WHERE <condition>
RETURN <set of elements>;

Sets of elements described by paths, consisting of:
1. URL, if necessary.
2. Element names forming a path in the semistructured data
graph, e.g., //BAR/NAME =
“start at any BAR node and go to a NAME child.”
3. Ending condition of the form
[<condition about subelements, @attributes, and values>]
Database Management Systems, R. Ramakrishnan
44
XQuery
Overview:

FOR-LET-WHERE-ORDERBY-RETURN = FLWOR
FOR/LET Clauses
List of tuples
WHERE Clause
List of tuples
ORDERBY/RETURN Clause
Database Management Systems, R. Ramakrishnan
Instance of Xquery data model
45
XQuery

FOR $x in expr -- binds $x to each value in
the list expr

LET $x = expr -- binds $x to the entire list
expr
• Useful for common subexpressions and for
aggregations
Database Management Systems, R. Ramakrishnan
46
FOR v.s. LET
FOR $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
LET $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Database Management Systems, R. Ramakrishnan
Returns:
<result> <book>...</book></result>
<result> <book>...</book></result>
<result> <book>...</book></result>
...
Returns:
<result> <book>...</book>
<book>...</book>
<book>...</book>
...
</result>
47
XQuery
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN $x/title
Result:
<title> abc </title>
<title> def </title>
<title> ghi </title>
Database Management Systems, R. Ramakrishnan
48
XQuery
For each author of a book by Morgan
Kaufmann, list all books s/he published:
FOR $a IN distinct(document("bib.xml")
/bib/book[publisher=“Morgan Kaufmann”]/author)
RETURN <result>
$a,
FOR $t IN /bib/book[author=$a]/title
RETURN $t
</result>
distinct = a function that eliminates duplicates
Database Management Systems, R. Ramakrishnan
49
XQuery
Result:
<result>
<author>Jones</author>
<title> abc </title>
<title> def </title>
</result>
<result>
<author> Smith </author>
<title> ghi </title>
</result>
Database Management Systems, R. Ramakrishnan
50
XQuery
<big_publishers>
FOR $p IN distinct(document("bib.xml")//publisher)
LET $b := document("bib.xml")/book[publisher = $p]
WHERE count($b) > 100
RETURN $p
</big_publishers>
count = a (aggregate) function that returns the number of elms
Database Management Systems, R. Ramakrishnan
51
XQuery
Find books whose price is larger than average:
LET $a=avg(document("bib.xml")/bib/book/price)
FOR $b in document("bib.xml")/bib/book
WHERE $b/price > $a
RETURN $b
Database Management Systems, R. Ramakrishnan
52
Examples for XQuery queries



FOR $x IN
doc(www.company.com/info.xml)
//employee [employeeSalary gt 70000]/employeeName
RETURN <res> $x/firstName, $x/lastName </res>
FOR $x IN
doc(www.company.com/info.xml)/company/employee
WHERE $x/employeeSalary gt 70000
RETURN <res> $x/employeeName/firstName,
$x/employeeName/lastName </res>
FOR $x IN
doc(www.company.com/info.xml)/company
/project [projectNumber = 5]/projectWorker,
$y IN
doc(www.company.com/info.xml)/company/employee
WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn
RETURN <res> $x/EmployeeName/firstName,
$y/employeeName/lastName, $x/hours </res>
Database Management Systems, R. Ramakrishnan
53
Download