XML

advertisement
Web Data Management
XML and its Syntax
Why XML is of Interest to Us
• XML is just syntax for data
– Note: we have no syntax for relational data
– But XML is not relational: semistructured
• This is exciting because:
– Can translate any legacy data to XML
– Can ship XML over the Web (HTTP)
– Can input XML into any application
– Thus: data sharing and exchange on the Web
2
XML Data Sharing and Exchange
application
application
object-relational
Integrate
XML Data
Transform
WEB (HTTP)
Warehouse
application
relational data
Specific data management tasks
legacy data
3
From HTML to XML
HTML describes the presentation
4
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
5
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
6
Relevance of XML
 Databases are basically metadata about
actual data contained in other tables
 Actually describing & understanding the
data poses a problem
 XML binds a piece of data to what it is
supposed to accomplish
 Information is human & machine readable
 XML is easier to understand & implement
Relevance of XML

An example describing directions from one location to another
<?xml version="1.0"?>
<map>
<start>
<addr1>100 West Morgan St</addr1>
<city>Raleigh</city>
<state>NC</state>
<zip>27603</zip>
</start>
<directions>
<left distance="0.11 miles">W MORGAN ST</left>
<left distance="0.11 miles">S WILMINGTON ST</left>
<left distance="0.44 miles">E EDENTON ST</left>
<right distance="0.09 miles">N WEST ST</right>
<left distance="0.02 miles">W JONES ST</left>
</directions>
<destination>
<addr1>508 W Jones St</addr1>
<city>Raleigh</city>
<state>NC</state>
<zip>27603</zip>
</destination>
</map>
Working with Objects
 Object-Oriented approach allows greater
flexibility
 XML has object like implementations and
uses
 ‘Schema for Object-Oriented XML’ (SOX)
 SOX enforces a valid Uniform Resource
Identifier (URI); must include the file://
portion in the <schema> element
Working with Objects
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE schema SYSTEM
"urn:x-commerceone:document:com:commerceone:xdk:xml:schema.dtd$1.0">
<schema uri = "file:///S:/home_computer.sox" soxlang-version = "V0.2.2">
<elementtype name = "home_computer">
<model>
<sequence>
<element type = "monitor"/>
<element type = "housing"/>
<element type = "speakers"/>
<element type = "keyboard"/>
<element type = "mouse"/>
</sequence>
</model>
</elementtype>
<elementtype name = "monitor">
<model>
<string/>
</model>
</elementtype>
Working with Objects
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE schema SYSTEM "urn:x-commerceone:document:com:commerceone:xdk:xml:schema.dtd$1.0">
<schema uri = "file:///S:/home_computer.sox" soxlang-version = "V0.2.2">
<join system = "file:///S:/home_computer.sox"/>
<elementtype name = "work_computer">
<extends type = "home_computer">
<append>
<sequence>
<element type = "scanner"/>
<element type = "zip_drive"/>
<element type = "printer"/>
</sequence>
</append>
</extends>
</elementtype>
<elementtype name = "scanner">
<model>
<string/>
</model>
</elementtype>
<elementtype name = "zip_drive">
<model>
<string/>
</model>
</elementtype>
<elementtype name = "printer">
<model>
<string/>
</model>
</elementtype>
</schema>
Working with Objects
home_computer.sox schema
work_computer.sox schema
Working with Objects
 SOX-compliant parsers pull in the required
parent schemas
<?xml version = "1.0" encoding = "UTF-8"?>
<?soxtype file:///S:/work_computer.sox?>
<work_computer>
<monitor>15 inch</monitor>
<housing>
<cpu>1 GHz</cpu>
<ram>256 MB</ram>
<disk_space>60 GB</disk_space>
<modem>56k</modem>
</housing>
<speakers>JBL</speakers>
<keyboard>Microsoft</keyboard>
<mouse>Microsoft</mouse>
<scanner>Microtek</scanner>
<zip_drive>100 MB</zip_drive>
<printer>HP</printer>
</work_computer>
Application Messaging
 XML & SOAP enabled ‘application messaging’
 XML used to describe the API to a Web Service
 Software applications should be able to
communicate in a near automated fashion
 XML describes the Web Services themselves
 Applications can query other applications to
judge capability before making a request
Application Messaging
 NFQuery.dtd
<?xml version='1.0' encoding='UTF-8' ?>
<!ELEMENT query (news)>
<!ELEMENT news EMPTY>
<!ATTLIST news date CDATA #REQUIRED
type (global | local | financial |
sports | travel | weather ) #REQUIRED
what (count | headers | all ) #REQUIRED
limit CDATA #IMPLIED >
 Joe_query.xml
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE query SYSTEM "NFQuery.dtd">
<query>
<news date = "2001-10-10" type = "sports" what = "count"/>
</query>
Application Messaging
 NFResponse.dtd
<?xml version='1.0' encoding='UTF-8' ?>
<!ELEMENT results (count | headers | all)>
<!ATTLIST results date CDATA #REQUIRED
type (global | local | financial |
sports | travel | weather ) #REQUIRED >
<!ELEMENT count (#PCDATA)>
<!ELEMENT headers (#PCDATA)>
<!ELEMENT all (headline+)>
<!ELEMENT headline (#PCDATA)>
 Joe_query_response.xml
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE results SYSTEM "NFResponse.dtd">
<results date = "2001-10-10" type = "sports">
<count>53</count>
</results>
Application Messaging
 Joe_get_query.xml
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE query SYSTEM "NFQuery.dtd">
<query>
<news date="2001-10-10"
type="sports" what="all" limit="10"/>
</query>
Process Modeling
 XML used to model process & workflow
<map>
<start>
<addr1>100 West Morgan St</addr1>
<city>Raleigh</city>
<state>NC</state>
<zip>27603</zip>
</start>
<directions>
<left distance="0.11 miles">W MORGAN ST</left>
<left distance="0.11 miles">S WILMINGTON ST</left>
<left distance="0.44 miles">E EDENTON ST</left>
<right distance="0.09 miles">N WEST ST</right>
<left distance="0.02 miles">W JONES ST</left>
</directions>
<destination>
<addr1>508 W Jones St</addr1>
<city>Raleigh</city>
<state>NC</state>
<zip>27603</zip>
</destination>
</map>
The .NET Framework
 Shift from individual Web sites or devices to
an integrated cluster of devices & services
 People will control what, when, and how
information is delivered to them
 HTML based presentation is augmented by
XML-based information
 XML-based .NET programming model forges
the idea of XML-based Web Services
 Web services use protocols like HTTP & XML
XML within .NET
 XML obvious choice to represent commands and
typed data
 XML standard metalanguage for describing data
 SOAP is an industry standard for using XML
 Service Contract Language (SCL); XML grammar
for documenting Web Service contracts
 Businesses can create a variety of value-added
applications by combining Web Services
Required Knowledge
 XML is simply metadata used to describe
a markup language
 Knowledge of XPath or SOAP is helpful
 You may use a text editor or some IDE
 You can even write your own parser for
validating instance documents
 Refer to help resources on the web:
 http://www.w3.org
 http://www.ietf.org
 http://www.oasis-open.org
Goals of XML
 The goals behind creating the XML language:










should be compatible with SGML
should support a variety of applications
should be easily usable over the internet
xml design should be prepared quickly
design of XML shall be formal and concise
xml documents should be reasonably clear
xml documents should be uncomplicated to create
programs processing XML documents easy to write
terseness in XML markup is of minimal importance
optional features should be kept to a bare minimum
The XML Language
 Elements: represent tags or language that
you create with XML
<!ELEMENT name type>
 A customer data model..
The XML Language
 We first define our customer element
<!ELEMENT customer (name , contact)>
 You can impose some rules on this, such
as having name OR contact
 Defining <name> and <contact> is similar
as they are parents of child elements:




<!ELEMENT name (first , middle , last)>
<!ELEMENT contact (address , phone)>
<!ELEMENT address (street , city , state , zip)>
<!ELEMENT phone (home , work , mobile)>
XML Terminology
•
•
•
•
•
•
tags: book, title, author, …
start tag: <book>, end tag: </book>
elements: <book>…<book>,<author>…</author>
elements are nested
empty element: <red></red> abbrv. <red/>
an XML document: single root element
25
XML Syntax
• Another example:
<db>
<book>
<title>Complete Guide to DB2</title>
<author>Chamberlin</author>
</book>
<book>
<title>Transaction Processing</title>
<author>Bernstein</author>
<author>Newcomer</author>
</book>
<publisher>
<name>Morgan Kaufman</name>
<state>CA</state>
</publisher>
</db>
26
The XML Tree
db
book
title
author
book
title
author
publisher
author
name
“Complete
“Morgan
“Transaction
“Chamberlin”
Guide
“Bernstein” “Newcomer” Kaufman”
Processing”
to DB2”
state
“CA”
Tags on nodes
Data values on leaves
27
XML Components
• An XML file normally consists of three types of
markup, the first two of which are optional:
1. An XML processing instruction (PI) identifying the
version of XML being used, the way in which it is
encoded, and whether it references other files or
not, e,g,
<?xml version="1.0" encoding="UCS2" standalone="yes">
28
XML Components
2. A document type declaration (DTD)
– either contains the formal markup declarations in
its internal subset (between square brackets) or
– references a file containing the relevant markup
declarations (the external subset), e.g.:
<!DOCTYPE memo SYSTEM "http://www.myco.com/dtds/memo.dtd">
29
XML Components
3.A fully-tagged document instance which
– consists of a root element, whose
– element type name must match that assigned as
the document type name in the document type
declaration, within which all other markup is
nested.
30
XML Characteristics
• Validity
– If all three components are present, and
– the document instance conforms to the rules
defined in the document type definition
• Well-formed
– if each element is properly nested within its
parent elements,
– if it has matching tags
– if each attribute is specified as an attribute name
followed by a value indicator (=) and a quoted
string.
31
XML Components
• Six kinds of markup that can occur in an XML
document: elements, entity references, comments,
processing instructions, marked sections, and
document type declarations.
• Document Type Declarations
– An XML document primarily consists of a strictly
nested hierarchy of elements with a single root.
– Elements can contain character data, child
elements, or a mixture of both. In addition, they
can have attributes.
32
XML Components
• Child character data and child elements are
strictly ordered; attributes are not. For example:
<?xml version="1.0" ?>
<Book Author="Anonymous">
<Title>Sample Book</Title>
<Chapter id="1">
This is chapter 1. It is not very long or interesting.
</Chapter>
<Chapter id="2">
This is chapter 2. Although it is longer than chapter 1,
it is not any more interesting.
</Chapter>
<comments/>
</Book>
33
“Types” (or “Schemas”) for XML
• Document Type Definition – DTD
• Define a grammar for the XML document
– we use it as substitute for types/schemas
• Will be replaced by XML-Schema
34
Document Type Definition (DTD)
• The Document Type Definition(DTD) is either
– contained in a <!DOCTYPE> tag, contained in an external
file and referenced from a <!DOCTYPE> tag, or both.
<!DOCTYPE Book [
<!ELEMENT Book (Title, Chapter+,comments?)>
<!ATTLIST Book Author CDATA #REQUIRED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Chapter (#PCDATA)>
<!ATTLIST Chapter id ID #REQUIRED>
<!ELEMENT comments EMPTY>
]>
• PCDATA means Parsed Character Data (a mouthful for string)
35
An Example DTD
<!DOCTYPE db [
<!ELEMENT db ((book|publisher)*)>
<!ELEMENT book (title,author*,year?)>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
]>
• PCDATA means Parsed Character Data (a mouthful
for string)
36
DTDs as Grammars
db
book
title
author
year
publisher
::= (book|publisher)*
::= (title,author*,year?)
::= string
::= string
::= string
::= string
• A DTD is a EBNF (Extended BNF) grammar
• An XML tree is precisely a derivation tree
XML Documents that have a DTD and conform to it are called valid
37
DTD Vs XML Schema
• DTD: old style typing, still very used
• XML schema: more modern, used e.g. in Web
services
• DTD:
<!ELEMENT note (to, from, heading, body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
DTD Vs XML Schema
• The same structure in XML schema (an XML
dialect)
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string“ minOccurs=’1’ maxOccurs=’1’/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Elements
1) An element is defined as a group of one or more
subelements/subgroups, character data, EMPTY, or
ANY. For example:
Group:
<!ELEMENT A (B, C)>
Character data:
<!ELEMENT A (#PCDATA)>
Empty:
<!ELEMENT A EMPTY>
Any:
<!ELEMENT A ANY>>
40
Elements
2) Elements defined as groups of subelements/
subgroups constitute non-terminals in the
language. Elements defined as character data,
EMPTY, or ANY constitute terminals. For example:
<!-- Element A is a non-terminal. -->
<!ELEMENT A (B)>
<!-- Element B is a terminal. -->
<!ELEMENT B (#PCDATA)>
• It is legal to define a language containing non-terminals
that never resolve to terminals, such as one with purely
circular definitions
• It is generally impossible and/or useless to create any valid
documents for such languages.
41
Elements
3) Groups can be either a sequence or choice of
subelements and/or subgroups. For example:
Sequence:
<!-- Element A consists of a single element B. -->
<!ELEMENT A (B)>
<!-- Element A consists of element B followed by element C. -->
<!ELEMENT A (B, C)>
<!-- Element A consists of a sequence, including a choice subgroup. -->
<!ELEMENT A (B, (C | D), E)>
Choice:
<!-- Element A consists of either element B or element C. -->
<!ELEMENT A (B | C)>
<!-- Element A consists of a choice, including a sequence subgroup. -->
42
<!ELEMENT A (B | C | (D, E))>
Elements
4) Optional (?), one-or-more (+), and zero-or-more (*)
operators can be applied to groups, subgroups, and
subelements. For example:
Optional:
<!-- Subelement B is optional. -->
<!ELEMENT A (B?, C)>
One or more:
<!-- Subgroup (C | D) occurs one or more times. -->
<!ELEMENT A (B, (C | D)+, E)>
Zero or more:
<!-- Group (B, C) occurs zero or more times, i.e. A can be
empty. -->
<!ELEMENT A (B, C)*>
43
Elements
5) Elements containing character data can be declared
as containing only character data:
<!ELEMENT A (#PCDATA)>
or as containing a mixture of character data and
elements:
<!ELEMENT A (#PCDATA | B | C)*>
• The latter case is an example of “mixed content”
• "PCDATA" in the declarations is short for "Parsed
Character DATA".
44
Elements
6) EMPTY means that the element has no child
elements or character data.
7) ANY means that the element can contain zero or
more child elements of any declared type, as well as
character data.
– It is therefore a shorthand for mixed content containing all
declared elements.
45
Attributes
1) Elements can have zero or more attributes. For
example:
<!ELEMENT A (#PCDATA)>
<!-- Declare an attribute a for element A -->
<!ATTLIST A a CDATA #IMPLIED>
• Attributes are name-value pairs that occur inside
tags after the element name.
<div class="preface">
• In XML, all attribute values must be quoted.
• Attributes are alternative ways to represent data
46
Attributes
<book price = “55” currency = “USD”>
<title> Complete Guide to DB2 </title>
<author> Chamberlin </author>
<year> 1998 </year>
</book>
price, currency are called attributes
47
Replacing Attributes with Elements
<book>
<title> Complete Guide to DB2 </title>
<author> Chamberlin </author>
<year> 1998 </year>
<price> 55 </price>
<currency> USD </currency>
</book>
attributes are alternative ways to represent data
48
Attributes
2) A single ATTLIST statement can declare multiple
attributes for the same element. Multiple ATTLIST
statements can declare attributes for the same
element. That is, the following are equivalent:
Single ATTLIST statement declaring multiple attributes for an element:
<!-- Element A has attributes a and b -->
<!ATTLIST A
a CDATA #IMPLIED
b CDATA #IMPLIED>
Multiple ATTLIST statements declaring attributes for the
same element:
<!-- Element A has attributes a and b -->
<!ATTLIST A a CDATA #IMPLIED>
49
<!ATTLIST A b CDATA #IMPLIED>
Attributes
3) Attributes can be optional, required, or have a
fixed value. Optional attributes can have a
default; fixed attributes must have a default. For
example:
Optional without a default:
<!-- Element A has an attribute a. #IMPLIED = "optional, no default" ->
<!ATTLIST A a CDATA #IMPLIED>
Optional with a default:
<!-- If attribute a is not provided, a default of "aaa" will be used. -->
<!ATTLIST A a CDATA "aaa">
Required:
<!ATTLIST A a CDATA #REQUIRED>
Fixed:
<!-- The value of attribute a is always "aaa" -->
<!ATTLIST A a CDATA #FIXED "aaa">
50
Attributes
4) Each attribute has a type:
– Character data:
<!ATTLIST A a CDATA #IMPLIED>
– A user-defined enumerated type
<!-- Attribute a uses a simple enumeration. -->
<!ATTLIST A a (yes | no) #IMPLIED>
<!-- Attribute a uses an enumeration of notation types.-->
<!ATTLIST A a NOTATION (ps | pdf) #IMPLIED>
51
Attributes
• ID, IDREF: These attributes point from one element
to another. The value of the IDREF attribute on the
pointing element is the same as the value of the ID
attribute on the pointed-to element.
<!-- Attribute id gives the ID of element A -->
<!ATTLIST A id ID #IMPLIED>
<!-- Attribute ref points to the ID of another element -->
<!ATTLIST A ref IDREF #IMPLIED>
52
Oids and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123”
mother=“o456”><name>John</name>
</person>
oids and references in XML are just syntax
53
Attributes
 ENTITY, ENTITIES. These attributes point to
external data in the form of unparsed entities.
<!-- Attribute a points to a single unparsed entity -->
<!ATTLIST A a ENTITY #IMPLIED>
<!-- Attribute b points to multiple unparsed entities -->
<!ATTLIST A b ENTITIES #IMPLIED>
 NMTOKEN, NMTOKENS. These attributes have
single/multiple tokens as values.
<!ATTLIST A a NMTOKEN #IMPLIED>
<!ATTLIST A b NMTOKENS #IMPLIED>
54
Entity Declarations
• Entity declarations allow you to associate a
name with some other fragment of the
document.
• That construct can be a chunk of regular
text, a chunk of the document type
declaration, or a reference to an external
file containing either text or binary data.
<!ENTITY ATI
"ArborText, Inc.">
<!ENTITY boilerplate SYSTEM "/standard/legalnotice.xml">
<!ENTITY ATIlogo
SYSTEM "/standard/logo.gif" NDATA GIF87A>
55
Entity Declarations
• There are three kinds of entities: Internal,
external, and parametric.
• Internal Entities
– the replacement text is stored in the declaration.
– Using &ATI; anywhere in the document insert
“ArborText, Inc.” at that location.
– character reference, can be used to insert arbitrary
Unicode characters
– Character references take one of two forms: decimal
references, ℞ , and
– hexadecimal references, ℞ . Both of these refer
to character number U+211E from Unicode
56
Entity Declarations
• Internal entities can include references to other
internal entities, but it is an error for them to be
recursive.
• Example:
<element> this is less than < </element>
• The XML specification predefines five internal entities:
Declaration
Reference
Symbol
<!ENTITY lt "<">
<
<
<!ENTITY gt ">">
>
>
<!ENTITY amp "&">
&
&
<!ENTITY apos "'">
'
'
<!ENTITY quot """>
"
"
&
Unicode char
57
Entity Declarations
• External Entities
– Using &boilerplate; will insert the contents of
the file /standard/legalnotice.xml
– The XML processor will parse the content of
that file as if its content had been typed at the
location of the entity reference.
– The entity ATIlogo is also an external entity, but
its content is binary.
– The ATIlogo entity can only be used as the
value of an ENTITY (or ENTITIES) attribute (on a
graphic element, perhaps).
58
Entity Declarations
• Parameter Entities
– Parameter entities can only occur in the document
type declaration.
– A parameter entity is identified by placing “% ”
(percent-space) in front of its name in the
declaration.
– The percent sign is also used in references to
parameter entities, instead of the ampersand.
– Parameter entity references are immediately
expanded in the document type declaration and their
replacement text is part of the declaration, whereas
59
normal entity references are not expanded.
Notation Declarations
• specific types of external binary data.
• This information is passed to the processing
application, which may make whatever use
of it that it wishes. A typical notation
declaration is:
<!NOTATION GIF87A SYSTEM "GIF">
• Comments
<!-- and end with -->
60
Processing Instructions
• Processing instructions (PIs) are an escape hatch to
provide information to an application.
• XML processor is required to pass them to an application.
• Syntax: <?target argument?>
• Example:
<product> <name> Alarm Clock </name>
<?ringBell 20?>
<price> 19.99 </price>
</product>
• The names used in PIs may be declared as notations in
order to formally identify them.
61
CDATA Sections
• In a document, a CDATA section instructs the parser
to ignore most markup characters.
• Consider a source code listing in an XML document.
• It might contain characters that the XML parser
would ordinarily recognize as markup (< and &, for
example).
<![CDATA[
*p = &q;
b = (i <= 3);
]]>
• comments are not recognized in a CDATA section.
62
XML Namespaces
• http://www.w3.org/TR/REC-xml-names (1/99)
• A particular label, e.g., number, may denote
different notions in different contexts
• name ::= [prefix:]localpart
<book xmlns:isbn=“www.isbn-org.org/def”>
<title> … </title>
<number> 15 </number>
<isbn:number> …. </isbn:number>
</book>
63
XML Namespaces
• syntactic: <number> , <isbn:number>
• semantic: provide URL for schema
<tag xmlns:mystyle = “http://…”>
…
defined here
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
64
Download