XML Introduction

advertisement
XML and Databases
1
Outline (ambitious)
• Background: documents (SGML/HTML) and
databases (structured and semistructured data)
• XML Basics and Document Type Descriptors
• XML query languages: XML-QL and XSL.
• XML additions: Xlink, Xpointer, RDF, SOX, XMLData
• Document Object Model (XML API's)
2
Some Useful Articles
XML, Java, and the future of the web
http://webreview.com/wr/pub/97/12/19/xml/index.html
XML and the Second-Generation Web
http://www.sciam.com/1999/0599issue/0599bosak.html
Articles/standards for XML, XSL, XML-QL
http://www.w3c.org/
http://www.w3.org/TR/REC-xml
3
Background
What’s the difference between the world
of documents and information retrieval and
databases and query interfaces?
4
Documents
vs
Databases
Document world
> plenty of small documents
> usually static
Database world
> a few large databases
> usually dynamic
> implicit structure
> explicit structure (schema)
> tagging
> records
> human friendly
> machine friendly
> content
> content
section, paragraph, toc,
form/layout, annotation
> Paradigms
“Save as”, wysiwyg
> meta-data
author name, date, subject
schema, data, methods
> Paradigms
Atomicity, Concurrency, Isolation, Durability
> meta-data
schema description
5
What to do with them
Documents
• editing
Database
• updating
• printing
• spell-checking
• counting words
• cleaning
• retrieving (IR)
• querying
• searching
• composing/transforming
6
HTML
• Publishing hypertext on the World Wide Web
• Designed to describe how a Web browser should
arrange text, images and push-buttons on a page.
• Easy to learn, but does not convey structure.
• Fixed tag set.
Text (PCDATA)
Opening tag
<HTML>
<HEAD><TITLE>Welcome to the XML course</TITLE></HEAD>
<BODY>
<H1>Introduction</H1>
<IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >
Closing tag
</BODY>
</HTML>
“Bachelor” tag
Attribute name
Attribute value
7
The Structure of XML
• XML consists of tags and text
• Tags come in pairs <date> ...</date>
• They must be properly nested
<date> <day> ... </day> ... </date> --- good
<date> <day> ... </date>... </day> --- bad
(You can’t do <i> ... <b> ... </i> ...</b> in HTML)
8
XML text
XML has only one “basic” type -- text.
It is bounded by tags e.g.
<title> The Big Sleep </title>
<year> 1935 </ year> --- 1935 is still text
XML text is called PCDATA (for parsed
character data). It uses a 16-bit encoding,
e.g. \&\#x0152 for the Hebrew letter Mem
Later we shall see how new types are specified by
XML-data
9
XML structure
Nesting tags can be used to express various
structures. E.g. A tuple (record) :
<person>
<name> Malcolm Atchison </name>
<tel> (215) 898 4321 </tel>
<email> mp@dcs.gla.ac.sc </email>
</person>
10
XML structure (cont.)
• We can represent a list by using the same
tag repeatedly:
<addresses>
<person> ... </person>
<person> ... </person>
<person> ... </person>
...
</addresses>
11
Terminology
The segment of an XML document between an opening
and a corresponding closing tag is called an element.
element
<person>
<name> Malcolm Atchison </name>
<tel> (215) 898 4321 </tel>
<tel> (215) 898 4321 </tel>
<email> mp@dcs.gla.ac.sc </email>
</person>
element, a sub-element
of
not an element
12
XML is tree-like
person
name
tel
tel
email
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp@dcs.gla.ac.sc
Semistructured data models typically put the
labels on the edges
13
Mixed Content
An element may contain a mixture of sub-elements and
PCDATA
<airline>
<name> British Airways </name>
<motto>
World’s <dubious> favorite</dubious> airline
</motto>
</airline>
Data of this form is not typically generated from databases. It is
needed for consistency with HTML
14
A Complete XML Document
<?xml version="1.0"?>
<person>
<name> Malcolm Atchison </name>
<tel> (215) 898 4321 </tel>
<email> mp@dcs.gla.ac.sc </email>
</person>
15
Two ways of representing a DB
projects:
title
employees:
name
budget
ssn
managedBy
age
16
Project and Employee relations in XML
Projects and employees are intermixed
<db>
<project>
<title> Pattern recognition </title>
<budget> 10000 </budget>
<managedBy> Joe </managedBy>
</project>
<employee>
<name> Joe </name>
<ssn> 344556 </ssn>
<age> 34 < /age>
</employee>
<employee>
<name> Sandra </name>
<ssn> 2234 </ssn>
<age> 35 </age>
</employee>
<project>
<title> Auto guided vehicle </title>
<budget> 70000 </budget>
<managedBy> Sandra </managedBy>
</project>
:
</db>
17
Project and Employee relations in XML (cont’d)
Employees follow projects
<db>
<employees>
<projects>
<employee>
<project>
<name> Joe </name>
<title> Pattern recognition </title>
<ssn> 344556 </ssn>
<budget> 10000 </budget>
<age> 34 </age>
<managedBy> Joe </managedBy>
</employee>
</project>
<employee>
<project>
<name> Sandra </name>
<title> Auto guided vehicles </title>
<ssn> 2234 </ssn>
<budget> 70000 </budget>
<age>35 </age>
<managedBy> Sandra </managedBy>
</employee>
</project>
:
:
<employees>
</projects>
</db>
18
Project and Employee relations in XML (cont’d)
Or without “separator” tags …
<db>
<projects>
<employees>
<title> Pattern recognition </title>
<name> Joe </name>
<ssn> 344556 </ssn>
<budget> 10000 </budget>
<age> 34 </age>
<managedBy> Joe </managedBy>
<name> Sandra </name>
<title> Auto guided vehicles </title>
<ssn> 2234 </ssn>
<budget> 70000 </budget>
<age> 35 </age>
<managedBy> Sandra </managedBy>
:
</employees>
:
</db>
</projects>
19
Attributes
An (opening) tag may contain attributes. These are
typically used to describe the content of an element
<entry>
<word language = “en”> cheese </word>
<word language = “fr”> fromage </word>
<word language = “ro”> branza </word>
<meaning> A food made … </meaning>
</entry>
Order of attributes in an element does not matter
XML elements are ordered
20
Attributes (cont’d)
Another common use for attributes is to express
dimension or type
<picture>
<height dim= “cm”> 2400 </height>
<width dim= “in”> 96 </width>
<data encoding = “gif” compression = “zip”>
M05-.+C$@02!G96YE<FEC ...
</data>
</picture>
A document that obeys the “nested tags” rule and
does not repeat an attribute within a tag is said to
be well-formed .
21
When to use attributes
It’s not always clear when to use attributes
<person ssno= “123 45 6789”>
<name> F. MacNiel </name>
<email>
fmacn@dcs.barra.ac.sc
</email>
...
</person>
<person>
<ssno> 123 45 6789 </ssno>
<name> F. MacNiel </name>
<email>
fmacn@dcs.barra.ac.sc
</email>
...
</person>
22
XML Misc.
Apart from elements and attributes, XML allows
processing instructions and comments. A processing
instruction is a statement of the form:
<?xml version="1.0"?>
<?XML ENCODING="UTF-8" VERSION="1.0"?>
A comment takes the following form: enclose
comments between <!- - and - ->
<!– - A Comment -->
23
Document Type Descriptors
Imposing structure on XML documents
24
Document Type Descriptors
• Document Type Descriptors (DTDs) impose
structure on an XML document.
• There is some relationship between a DTD and a
schema, but it is not close -- hence the need for
additional “typing” systems.
• The DTD is a syntactic specification.
25
Example: The Address Book
<person>
<name> MacNiel, John </name>
<greet> Dr. John MacNiel </greet>
Exactly one name
At most one greeting
<addr>1234 Huron Street </addr> As many address lines
<addr> Rome, OH 98765 </addr>
<tel> (321) 786 2543 </tel>
<fax> (321) 786 2543 </fax>
<tel> (321) 786 2543 </tel>
<email> jm@abc.com </email>
</person>
as needed (in order)
Mixed telephones
and faxes
As many
as needed
26
Specifying the structure
• name
to specify a name element
• greet?
to specify an optional
(0 or 1) greet elements
• name,greet?
to specify a name followed by an
optional greet
27
Specifying the structure (cont)
• addr*
• tel | fax
to specify 0 or more address lines
a tel or a fax element
• (tel | fax)* 0 or more repeats of tel or fax
• email*
0 or more email elements
28
Specifying the structure (cont)
So the whole structure of a person entry is specified
by
name, greet?, addr*, (tel | fax)*, email*
This is known as a regular expression. Why is it
important?
29
Regular Expressions
Each regular expression determines a corresponding finite
state automaton. Let’s start with a simpler example:
name, addr*, email
This suggests a simple parsing program
addr
name
email
30
Another example
name,address*,(tel | fax)*,email*
address
name
email
tel
tel
email
fax
fax
email
Adding in the optional greet further
complicates things
31
A DTD for the address book
<!DOCTYPE addressbook [
<!ELEMENT addressbook (person*)>
<!ELEMENT person
(name, greet?, address*, (fax | tel)*, email*)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT greet (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT tel
(#PCDATA)>
<!ELEMENT fax
(#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
32
Our relational DB revisited
projects:
title
employees:
name
budget
ssn
managedBy
age
33
Two DTDs for the relational DB
<!DOCTYPE db [
<!ELEMENT db
(projects,employees)>
<!ELEMENT projects
(project*)>
<!ELEMENT employees (employee*)>
<!ELEMENT project
(title, budget, managedBy)>
<!ELEMENT employee (name, ssn, age)>
...
]>
<!DOCTYPE db [
<!ELEMENT db
(project | employee)*>
<!ELEMENT project
(title, budget, managedBy)>
<!ELEMENT employee (name, ssn, age)>
...
]>
34
Some things are hard to specify
Each employee element is to contain name, age and ssn elements
in some order.
<!ELEMENT employee
( (name, age, ssn) | (age, ssn, name) |
(ssn, name, age) | ...
)>
Suppose there were many more fields !
35
Summary of XML regular expressions
•
•
•
•
•
•
•
A
e1,e2
e*
e?
e+
e1 | e2
(e)
The tag A occurs
The expression e1 followed by e2
0 or more occurrences of e
Optional -- 0 or 1 occurrences
1 or more occurrences
either e1 or e2
grouping
36
Specifying attributes in the DTD
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
dimension CDATA #REQUIRED
accuracy CDATA #IMPLIED >
The dimension attribute is required; the accuracy
attribute is optional.
CDATA is the “type” of the attribute -- it means
string.
37
The DTD Language
• Default modifiers in DTD attributes:
Modifier
Description
#REQUIRED
The attributes value must be specified with
the element.
The attribute value can remain unspecified.
The attribute value is fixed and cannot be
changed by the user.
#IMPLIED
#FIXED
38
The DTD Language
• Datatypes in DTD attributes:
Type
Description
CDATA
enumerated
ENTITY
ENTITIES
Character data
A series of values of which only 1 can be chosen
An entity declared in the DTD
Multiple whitespace separated entities declared
in the DTD
ID
A unique element identifier
IDREF
The value of a unique ID type attribute
IDREFS
Multiple whitespace separated IDREFs of
elements
NMTOKEN An XML name token
NMTOKENS Multiple whitespace separated XML name tokens
NOTATION A notation declared in the DTD
39
Consistency of ID and IDREF attribute values
• If an attribute is declared as ID
– the associated values must all be distinct (no confusion)
– Id is a poor cousin of a key in relational databases.
• If an attribute is declared as IDREF
– the associated value must exist as the value of some ID
attribute (no dangling “pointers”)
– IDREF is a poor cousin of foreign key in relational
databases.
• Similarly for all the values of an IDREFS attribute
– An attribute of type IDREFS represent a spaceseparated list of strings of references to valid IDs.
• ID and IDREF attributes are not typed
40
Specifying ID and IDREF attributes
<!DOCTYPE family [
<!ELEMENT family (person)*>
<!ELEMENT person (name)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST person
id
ID
#REQUIRED
mother IDREF #IMPLIED
father
IDREF #IMPLIED
children IDREFS #IMPLIED>
]>
41
Some conforming data
<family>
<person id="jane" mother="mary" father="john">
<name> Jane Doe </name>
</person>
<person id="john" children="jane jack">
<name> John Doe </name>
</person>
<person id="mary" children="jane jack">
<name> Mary Doe </name>
</person>
<person id="jack" mother=”mary" father="john">
<name> Jack Doe </name>
</person>
</family>
42
An alternative specification
<!DOCTYPE family [
<!ELEMENT family (person)*>
<!ELEMENT person (name, mother?, father?, children?)>
<!ATTLIST person id ID #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT mother EMPTY>
<!ATTLIST mother idref IDREF #REQUIRED>
<!ELEMENT father EMPTY>
<!ATTLIST father idref IDREF #REQUIRED>
<!ELEMENT children EMPTY>
<!ATTLIST children idrefs IDREFS #REQUIRED>
]>
43
The revised data
<family>
<person id = "jane”>
<name> Jane Doe </name>
<mother idref = "mary”></mother>
<father idref = "john"></father>
</person>
<person id = "john”>
<name> John Doe </name>
<children idrefs = "jane jack"> </children>
</person>
...
</family>
44
The DTD Language
• Example: Sales Order Document
“An order document is comprised of several sales orders.
Each individual order has a number and it contains the
customer information, the date when the order was
received, and the items ordered. Each customer has a
number, a name, street, city, state, and ZIP code. Each item
has an item number, parts information and a quantity. The
parts information contains a number, a description of the
product and its unit price.
The numbers should be treated as attributes.”
45
The DTD Language
• Example: Sales Order Document DTD
<!-- DTD for example sales order document -->
<!ELEMENT Orders (SalesOrder+)>
<!ELEMENT SalesOrder (Customer,OrderDate,Item+)>
<!ELEMENT Customer (CustName,Street,City,State,ZIP)>
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ATTLIST
<!ATTLIST
<!ATTLIST
OrderDate (#PCDATA)>
Item (Part,Quantity)>
Part (Description,Price)>
CustName (#PCDATA)>
Street (#PCDATA)>
... (#PCDATA)>
SalesOrder SONumber CDATA #REQUIRED>
Customer CustNumber CDATA #REQUIRED>
Part PartNumber CDATA #REQUIRED>
Item ItemNumber CDATA #REQUIRED>
46
The DTD Language
• Example: Sales Order XML Document
<Orders>
<SalesOrder SONumber=“12345”>
<Customer CustNumber=“543”>
<CustName>ABC Industries</CustName>
<Street>123 Main St.</Street>
<City>Chicago</City>
<State>IL</State>
<ZIP>60609</ZIP>
</Customer>
<OrderDate>10222000</OrderDate>
<Item ItemNumber=“1”>
<Part PartNumber=“234”>
<Description>Turkey wrench</Description>
<Price>9.95</Price>
</Part>
<Quantity>10</Quantity>
</Item>
</SalesOrder>
</Orders>
47
A useful abbreviation
When an element has empty content we can use
<tag blahblahbla/>
for
<tag blahblahbla></tag>
For example:
<family>
<person id = "jane”>
<name> Jane Doe </name>
<mother idref = "mary”/>
<father idref = "john”/>
</person>
...
</family>
48
Schema.dtd
<!DOCTYPE db
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
[
db
(movie+, actor+)>
movie (title,director,cast,budget)>
movie id ID #REQUIRED>
title
(#PCDATA)>
director (#PCDATA)>
casts EMPTY>
casts idrefs IDREFS #REQUIRED>
budget (#PCDATA)>
49
Schema.dtd (cont’d)
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
]>
actor
actor
name
acted_In
acted_In
age
directed
(name, acted_In,age?, directed*)>
id
ID
#REQUIRED>
(#PCDATA)>
EMPTY>
idrefs IDREFS #REQUIRED>
(#PCDATA)>
(#PCDATA)>
50
Connecting the document with its DTD
In line:
<?xml version="1.0"?>
<!DOCTYPE db [<!ELEMENT ...> … ]>
<db> ... </db>
Another file:
<!DOCTYPE db SYSTEM "schema.dtd">
A URL:
<!DOCTYPE db SYSTEM
"http://www.schemaauthority.com/schema.dtd">
51
Well-formed and Valid Documents
• Well-formed applies to any document (with or
without a DTD): proper nesting of tags and unique
attributes
• Valid specifies that the document conforms to the
DTD: conforms to regular expression grammar,
types of attributes correct, and constraints on
references satisfied
52
DTDs v.s Schemas (or Types)
• By database (or programming language) standards
DTDs are rather weak specifications.
– Only one base type -- PCDATA
– No useful “abstractions” e.g., sets
– IDREFs are untyped. You point to something, but you
don’t know what!
– No constraints e.g., child is inverse of parent
– No methods
– Tag definitions are global
• Some of the XML extensions impose something
like a schema or type on an XML document. We’ll
see these later
53
Lots of possibilities for schemas
•
•
•
•
•
•
XML Schema (under W3C’s spotlight)
XDR (Microsoft’s BizTalk)
SOX (Schema for Object-Oriented XML)
Schematron
DSD (AT&T Labs and BRICS)
and more.
54
Some tools
• XML Authority
http://www.extensibility.com/tibco/solutions/xml
_authority/index.htm
• XML Spy
http://www.xmlspy.com/download.html
55
Summary
• XML is a new data format. Its main virtues are
widespread acceptance and the (important) ability
to handle semistructured data (data without
schema)
• DTDs provide some useful syntactic constraints on
documents. As schemas they are weak
• How to store large XML documents?
• How to query them?
• How to map between XML and other
representations?
56
Download