PPTX

advertisement
SEMISTRUCTURED DATA AND XML
HOW THE WEB IS TODAY

HTML documents
often generated by applications
 consumed by humans only
 easy access: across platforms, across organizations
 only layout, no semantic information


No application interoperability:

HTML not understood by applications


screen scraping brittle
Database technology: client-server

still vendor specific
2
XML DATA EXCHANGE FORMAT
A standard from the W3C (World Wide Web
Consortium, http://www.w3.org).
 The mission of the W3C
„. . . developing common protocols that promote
its evolution and ensure its interoperability. . .“.
 Basic ideas

XML = data
 XML generated by applications
 XML consumed by applications
 Easy access: across platforms, organizations.

3
PARADIGM SHIFT ON THE WEB

For web search engines:
From documents (HTML) to data (XML)
 From document management to document
understanding (e.g., question answering)
 From information retrieval to data management


For database systems:
From relational (structured) model to semistructured
data
 From data processing to data /query translation
 From storage to transport

4
THE SEMISTRUCTURED DATA MODEL
Bib
Object Exchange
Model (OEM)
&o1
complex object
paper
paper
book
references
&o12
&o24
references
author
title
year
&o29
references
author
http
page
author
title publisher
title
author
author
author
&o43
&25
&96
1997
last
firstname
firstname
lastname
&243
“Serge”
“Abiteboul”
“Victor”
lastname
first
&206
“Vianu”
122
133
atomic object
5
THE SEMISTRUCTURED DATA MODEL
Data is self-describing, i.e. the data description is
integrated with the data itself rather than in a
separate schema.
 Database is a collection of nodes and arcs
(directed graph).
 Leaf nodes represent data of some atomic type
(atomic objects, such as numbers or strings).
 Interior nodes represent complex objects consisting
of components (child nodes), connected by arcs to
this node.
 Arcs are directed and connect two nodes.

6
THE SEMISTRUCTURED DATA MODEL
Arc labels indicates the relationship between the
two corresponding nodes.
 The root node is the only interior node without inarcs, representing the entire database.
 All database objects are children of the root node.
 Every node must be reachable from the root.
 A general graph structure is possible, i.e. the
graph need not be a tree structure.

7
SYNTAX FOR SEMISTRUCTURED DATA
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
Observe: Nested tuples, set-values, oids!
8
SYNTAX FOR SEMISTRUCTURED DATA
May omit oids:
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
9
VS. RELATIONAL MODEL
Missing attributes
 Additional attributes
 Multiple attribute values (set-valued attributes)
 Objects as attribute values
 No global schema

 only the first characteristics supported by relational
model, all others are not
10
VS. RELATIONAL MODEL

Semistructured data




Self-describing,
Irregular data,
No a-priori structure.
Relational DB
 Separate schema,


Regular data,
A-priori structure.
11
XML
IMPORTANT XML STANDARDS
XSL/XSLT: presentation and transformation
standards
 RDF: resource description framework (meta-info
such as ratings, categorizations, etc.)
 Xpath/Xpointer/Xlink: standard for linking to
documents and elements within
 Namespaces: for resolving name clashes
 DOM: Document Object Model for manipulating
XML documents
 SAX: Simple API for XML parsing
 XQuery: query language

13
XML
A W3C standard to complement HTML
 Origins: Structured text SGML

Large-scale electronic publishing
 Data exchange on the web


Motivation:
HTML describes presentation
 XML describes content

 http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,
10/2000)
HTML4.0  XML  SGML
14
FROM HTML TO XML
HTML describes the presentation
15
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
HTML describes the presentation
16
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
17
WHY ARE WE DB’ERS INTERESTED?

It’s data. That’s us.

Database issues:
How are we going to model XML? (graphs).
 How are we going to query XML? (XQuery)
 How are we going to store XML (in a relational
database? object-oriented? native?)
 How are we going to process XML efficiently? (many
interesting research questions!)

18
ELEMENTS

Tags
book, title, author, …
start tag: <book>, end tag: </book>
 defined by user / programmer (different from HTML!)


Elements <book>…<book>,<author>…</author>
An element consists of a matching start and end tag
and the enclosed content.
 Elements can be nested, i.e. content of one element
can consist of sequence of other elements.

19
ATTRIBUTES

Attributes can be associated with any element.

Provide additional information about elements.

Attributes can have only one value.

Example
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>

Attributes can also be used to connect elements.
20
NON-TREE-LIKE XML
So far: only tree-like XML documents,
i.e. each element is nested within at most one
other element.
 Attributes can also be used to create non-tree
XML documents.
 Attributes with a domain of ID serve as primary
keys of elements.
 Attributes with a domain of IDREF serve as
foreign keys referencing the ID of another
element.

21
NON-TREE-LIKE XML
 Example
of a non-tree structure
<persons>
<person personid=“o555”>
<name> Jane </name>
</person>
<person personid=“o456”>
<name> Mary </name>
<children refs=“o123 o555”</children >
</person>
<person personid=“o123” mother=“o456”>
<name>John</name>
</person>
</persons>
22
NAMESPACES


An XML document can involve tags that come
for multiple sources.
One and the same tag can appear in more than
one source.
<table> <tr>
<td>Apples</td>
<td>Bananas</td>
</tr> </table>
<table>
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>
23
NAMESPACES

Name conflicts can be resolved by prefixing tag
names according to their source.
<h:table>
<h:tr> <h:td>Apples</h:td>
<h:td>Bananas</h:td> </h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>


When using prefixes in XML, a namespace for
the prefix must be defined.
The namespace must be referenced (via an URI)
in the start tag of an enclosing element .
24
WELL-FORMED XML



A well-formed XML document satisfies the
following conditions:

Begins with a declaration that it is XML.

Has a single root element that encloses the whole
document.

Consists of properly nested elements, i.e. start and end
tag of an element are within the same enclosing
element.
standalone =“yes” states that document has no
DTD.
In this mode, you can invent your own tags, like in
semistructured data model.
25
WELL-FORMED XML
<?XML version=“1.0” standalone =“yes” ?>
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
<book> <title> … </title>
...
</book>
…
</bibliography>
26
WELL-FORMED XML




HTML browsers will display documents with
errors (like missing end tags).
The W3C XML specification states that a
program should stop processing an XML
document if it finds an error.
The main reason is that XML is being consumed
by programs rather than by humans (as HTML).
W3C provides a validator that checks whether an
XML document is well-formed.
27
VALID XML




The validator can also check whether an XML
document is valid, i.e. conforms to a Document
Type Definition (DTD).
A DTD specifies the allowable tags and how they
can be nested.
XML with a DTD is no longer semistructured
(self-describing).
However, a DTD is less rigid than the schema of a
relational DB. E.g., a DTD allows missing and
multiple attributes / elements.
28
DTD
DOCUMENT TYPE DEFINITIONS



Document Type Definition (DTD): set of rules
(grammar) specifying elements, attributes and all
other aspects of XML documents.
For each element, specify name and content type.
Content type can, e.g., be
#PCDATA (character string),
 other elements,
 regular expression made of the above content types
* = zero or more occurrences
? = zero or one occurrence
+ = one or more occurrences
, = sequence of elements.

30
DOCUMENT TYPE DESCRIPTORS

Sort of like a schema but not really.
<!ELEMENT Book (title, author*) >
<!ELEMENT title #PCDATA>
<!ELEMENT author (name, address,age?)>
<!ATTLIST Book id ID #REQUIRED>
<!ATTLIST Book pub IDREF #IMPLIED>
Inherited from SGML DTD standard
 BNF grammar establishing constraints on element
structure and content
 Definitions of entities

31
EXAMPLE DTD: PRODUCT CATALOG
<!DOCTYPE CATALOG [
<!ELEMENT CATALOG (PRODUCT+)>
<!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)>
<!ATTLIST PRODUCT NAME CDATA #IMPLIED
CATEGORY (HandTool|Table|Shop-Professional) "HandTool"
PARTNUM CDATA #IMPLIED
PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago"
INVENTORY (InStock|Backordered|Discontinued) "InStock">
<!ELEMENT SPECIFICATIONS (#PCDATA)>
<!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED
POWER CDATA #IMPLIED>
<!ELEMENT OPTIONS (#PCDATA)>
<!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte"
ADAPTER (Included|Optional|NotApplicable) "Included"
CASE (HardShell|Soft|NotApplicable) "HardShell">
<!ELEMENT PRICE (#PCDATA)>
<!ATTLIST PRICE MSRP CDATA #IMPLIED
WHOLESALE CDATA #IMPLIED
STREET CDATA #IMPLIED
SHIPPING CDATA #IMPLIED>
<!ELEMENT NOTES (#PCDATA)> ]>
32
SHORTCOMINGS OF DTDS
Useful for documents, but not so good for data:
 Element name and type are associated globally
 No support for structural re-use


No support for data types


Object-oriented-like structures aren’t supported
Can’t do data validation
Can have a single key item (ID), but:



No support for multi-attribute keys
No support for foreign keys (references to other keys)
No constraints on IDREFs (reference only a Section)
33
XML SCHEMA
XML SCHEMA

The successor of DTDs to specify a schema for
XML documents.

A W3C standard.

Includes and extends functionality of DTDs.


In particular, XML Schemas support data types.
This makes it easier to validate the correctness of
data and to work with data from a database.
XML Schemas are written in XML. You don't have
to learn a new language and can use your XML
parser to parse your Schema files.
35
EXAMPLE XML SCHEMA
<schema version=“1.0”
xmlns=“http://www.w3.org/1999/XMLSchema”>
<element name=“author” type=“string” />
<element name=“date” type = “date” />
<element name=“abstract”>
<type> … </type>
</element>
<element name=“paper”>
<type>
<attribute name=“keywords” type=“string”/>
<element ref=“author” minOccurs=“0”
maxOccurs=“*” />
<element ref=“date” />
<element ref=“abstract” minOccurs=“0”
maxOccurs=“1” />
<element ref=“body” />
</type>
</element>
</schema>
36
SIMPLE ELEMENTS

Simple elements contain only text.

They can have one of the built-in datatypes:
xs:string, xs:decimal, xs:integer, xs:boolean
xs:date, xs:time.

Example
<xs:element name="lastname“ type="xs:string"/>
<xs:element name="age" type="xs:integer"/>
<xs:element name="dateborn" type="xs:date"/>
37
SIMPLE ELEMENTS

Restrictions allow you to further constrain the
content of simple elements.
<xs:element name="age">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="120"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
38
ATTRIBUTES

Attributes can be specified using the attribute
element:
<xs:attribute name="xxx" type="yyy"/>

Attribute elements are nested within the element
of the element with which they are associated.

By default, attributes are optional.

To make an attribute mandatory, use
<xs:attribute name="lang“ type="xs:string“use="required"/>

Attributes can have the same built-in datatypes
as simple elements.
39
COMPLEX ELEMENTS




Complex elements can contain other elements and can
have attributes.
Nested elements need to occur in the order specified.
The number of repetitions of elements are controlled by
the attributes minOccurs and maxOccurs. The default
is one repetition.
A complex element with an attribute:
<xs:element name="product">
<xs:complexType>
<xs:attribute name="prodid" type="xs:positiveInteger"/>
</xs:complexType>
</xs:element>
40
COMPLEX ELEMENTS

A complex element containing a sequence of
nested (simple) elements:
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
41
COMPLEX ELEMENTS

If you name the complex element, other elements
can reference and include it:
<xs:complexType name="persontype">
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
<xs:element name="person" type="persontype"/>
42
EXAMPLE XML SCHEMA
<schema version=“1.0”
xmlns=“http://www.w3.org/1999/XMLSchema”>
<element name=“author” type=“string” />
<element name=“date” type = “date” />
<element name=“abstract”>
<type> … </type>
</element>
<element name=“paper”>
<type>
<attribute name=“keywords” type=“string”/>
<element ref=“author” minOccurs=“0”
maxOccurs=“*” />
<element ref=“date” />
<element ref=“abstract” minOccurs=“0”
maxOccurs=“1” />
<element ref=“body” />
</type>
</element>
</schema>
43
XML VS. SEMISTRUCTURED DATA
Both described best by a graph.
 Both are schema-less, self-describing
(XML without DTD / XML schema).
 XML is ordered, semistructured data is not.
 XML can mix text and elements:

<talk> Making Java easier to type and easier to type
<speaker> Phil Wadler </speaker>
</talk>

XML has lots of other stuff: attributes, entities,
processing instructions, comments.
44
XML-PATH = XPATH
QUERY LANGUAGES FOR XML
XPath is a simple query language based on
describing similar paths in XML documents.
 XQuery extends XPath in a style similar to SQL,
introducing iterations, subqueries, etc.
 XPath and XQuery expressions are applied to an
XML document and return a sequence of
qualifying items.
 Items can be primitive values or nodes (elements,
attributes, documents).
 The items returned do not need to be of the same
type.

46
XPATH
A path expression returns the sequence of all
qualifying items that are reachable from the input
item following the specified path.
 A path expression is a sequence consisting of tags
or attributes and special characters such as
slashes (“/”).
 Absolute path expressions are applied to some
XML document and returns all elements that are
reachable from the document’s root element
following the specified path.
 Relative path expressions are applied to an
arbitrary node.

47
XPATH
<?XML version=“1.0” standalone =“yes” ?>
<bibliography>
<book bookID = “b100“> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year> </book>
…
</bibliography>

Applied to the above document, the XPath expression
/bibliography/book/author returns the sequence
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author> . . .
48
ATTRIBUTES

If we do not want to return the qualifying elements, but the value
one of their attributes, we end the path expression with @attribute.
<?XML version=“1.0” standalone =“yes” ?>
<bibliography>
<book bookID = “b100“> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year> </book>
the XPath expression
/bibliography/book/@bookID
returns the sequence
“b100“ . . .
49
WILDCARDS
We can use wildcards instead of actual tags and
attributes:
* means any tag, and
@* means any attribute.
 Examples
/bibliography/*/author returns the sequence
<author> Abiteboul </author>


<author> Hull </author>.
/bibliography//author/@* returns the sequence
“IBM“
“a739“.
50
PATH EXPRESSIONS
Examples:
 Bib.paper
 Bib.book.publisher
 Bib.paper.author.lastname
Given an OEM instance, the value of a path
expression p is a set of objects
51
PATH EXPRESSIONS
Examples:
Bib
&o1
DB =
paper
paper
book
references
&o12
&o24
&o29
references
author
&o43
title
year
&o44
author
http
references
author
title publisher
author
author
&o45 &o46
page
title
author
&o52
&25
&96
1997
firstname
lastname
&o70
“Serge”
&o47 &o48 &o49 &o50 &o51
firstname
&o71
“Abiteboul”
Bib.paper={&o12,&o29}
Bib.book.publisher={&o51}
Bib.paper.author.lastname={&o71,&206}
&243
“Victor”
last
lastname
first
&206
“Vianu”
122
133
52
XML-QUERY = XQUERY
XQUERY
Summary:

FOR-LET-WHERE-ORDERBY-RETURN = FLWOR
FOR/LET Clauses
List of tuples
WHERE Clause
List of tuples
ORDERBY/RETURN Clause
Instance of Xquery data model
54
XQUERY
FLWOR expressions are similar to SQL select . .
from . . . where . . . queries.
 XQuery allows zero, one or more for and let
clauses.
 The where clause is optional.
 There is one optional order-by clause.
 Finally, there is exactly one return clause.
 XQuery is case-sensitive.
 XQuery (and XPath) is a W3C standard.

55
XQUERY CLAUSES

for $x in expr
Defines node variable $x.
 The expression expr evaluates to a sequence of items.
 The variable $x is assigned to each item, in turn, and
the body of the for clause is executed once for each
assignment.


let $x := expr
Defines collection variable $x.
 The expression expr evaluates to a sequence of items.
 The variable is bound to the entire sequence of items.
 Useful for common subexpressions and for
aggregations.

56
XQUERY CLAUSES

where condition
The condition is a boolean expression.
 The clause is applied to some item.
 If and only if the condition evaluates to true, the
following return clause is executed for that item.


return expression
The result of a FLWOR clause is a sequence of items.
 Expression defines the result format for the current
(qualifying) item.
 The sequence of items produced by expression is
appended to the sequence of items produced so far.

57
INTERPRETATION AS XQUERY
XQuery expressions can be used wherever an
XML expression of any kind is permitted.
 Any text string is acceptable as content of a tag or
value of an attribute.
 If a string contains an XQuery expression that
should be evaluated, this substring must be
surrounded by curly brackets {}.
 Example

for $b in doc("bib.xml")/bibliography/book
return <result id = {$b/@bookID}>{$b/title}</result>
58
FOR V.S. LET

Find all books
FOR $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Returns:
<result> <book>...</book></result>
<result> <book>...</book></result>
<result> <book>...</book></result>
...
Returns:
<result> <book>...</book>
LET $x IN document("bib.xml")/bib/book
<book>...</book>
<book>...</book>
RETURN <result> $x </result>
...
</result>
59
XQUERY
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN $x/title
Result:
<title> abc </title>
<title> def </title>
<title> ghi </title>
60
ORDERING THE QUERY RESULT
The order-by clause allows you to order the
results of an XQuery expression.
order-by list of expressions
 The sort order is based on the value of the first
expression. Ties are broken based on the value of
the second (if necessary third etc.) expression.
 By default, the order is ascending.
 A descending sort order can be specified using
descending.

61
ELIMINATION OF DUPLICATES
The built-in function distinct-values eliminates
duplicates from a sequence of result items.
 In principle, it applies only to primitive (atomic)
types.
 It can also be applied to elements, but then it will
remove their tags, replacing them by quotes “”.
 Example
If return $b/title produces
<title> aaa </title> <title> bbb </title>
<title> aaa </title>
then distinct-values (return $b/title) produces
“aaa” “bbb”.

62
XQUERY
For each author of a book by Morgan Kaufmann, list
all books she published:
FOR $a IN distinct(document("bib.xml")
/bib/book[publisher=“Morgan Kaufmann”]/author)
RETURN <result>
$a,
FOR $t IN /bib/book[author=$a]/title
RETURN $t
</result>
Result:
<result>
<author>Jones</author>
<title> abc </title>
<title> def </title>
</result>
<result>
distinct = a function that
eliminates duplicates
<author> Smith
</author>
<title> ghi </title>
</result>
63
JOINS





We can join two or more documents, by using one
variable for each of the documents .
We let a variable range over the elements of the
corresponding document, within a for-clause.
Need to be careful when comparing elements for
equality, since their equality is by element
identity, not by element content.
Typically, we want to compare the element
content.
The built-in function data(E) returns the content
of an element E.
64
XQUERY
Find books whose price is larger than average:
LET $a=avg(document("bib.xml")/bib/book/price)
FOR $b in document("bib.xml")/bib/book
WHERE $b/price > $a
RETURN $b
65
SORTING IN XQUERY
<publisher_list>
FOR $p IN distinct(document("bib.xml")//publisher)
ORDERBY $p
RETURN <publisher> <name> $p/text() </name> ,
FOR $b IN document("bib.xml")//book[publisher = $p]
ORDERBY $b/price DESCENDING
RETURN <book>
$b/title ,
$b/price
</book>
</publisher>
</publisher_list>
66
IF-THEN-ELSE
FOR $h IN //holding
ORDERBY $h/title
RETURN <holding>
$h/title,
IF $h/@type = "Journal"
THEN $h/editor
ELSE $h/author
</holding>
67
EXISTENTIAL QUANTIFIERS
FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
contains($p, "sailing")
AND contains($p, "windsurfing")
RETURN $b/title
68
QUANTIFICATION




XQuery supports the existential and the universal
quantifier.
Universal quantifier
every $v in expression1 satisfies expression 2
Existential quantifier
some $v in expression1 satisfies expression 2
Expression1 evaluates to a sequence of items,
expression 2 is a boolean expression.
69
AGGREGATION

XQuery provides built-in functions for the
standard aggregations such as SUM, MIN,
COUNT and AVG.

They can be applied to any XQuery expression, i.e.
to any sequence of items.

Example
avg(doc("bib.xml")/bibliography/book/price)
count(doc("bib.xml")/bibliography/book/price)
Computes the average book price and the number of
books, resp.
70
XQUERY EXAMPLES

Find books whose price is larger than the average
price.
let $a:=avg(doc("bib.xml")/bibliography/book/price)
for $b in doc("bib.xml")/bibliography/book
where $b/price > $a
return $b

Uses aggregate operator (avg), applied to the result of
a path expression.
71
XQUERY EXAMPLES

Find title of books with a paragraph containing the
terms “sailing” and “windsurfing”.
for $b in doc("bib.xml")//book
where some $p in $b//para satisfies
contains($p, "sailing") and contains($p, "windsurfing")
return $b/title

Uses existential quantifier (some) and string
matching (contains).
72
Download