PPT

advertisement
C20.0046: Database
Management Systems
Lecture #25
M.P. Johnson
Stern School of Business, NYU
Spring, 2005
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
1
Agenda

Querying XML

Data Warehousing

Next week:



Data Mining
Websearch
Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
2
Goals after today:
1.
Be aware of some of the important XML
standards
2.
Know how to write some DW queries in
Oracle
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
3
XML: Semi-structured data

Not too random



Data organized into entities
Similar/related grouped to form other entities
Not too structured


Some attributes may be missing
Size of attributes may vary


Support of lists/sets
Juuust Right

Data is self-describing
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
4
<movieinfo>
<movie id="o111">
<title>Lost in Translation</title>
<year>2003</year>
<stars idref="o333 o444"/>
</movie>
<movie id="o222">
<title>Hamlet</title>
<year>1999</year>
<stars idref="o333"/>
</movie>
<person id="o111">
<name>Bill Murray</name>
<movies idref="o111 o222"/>
</person>
</movieinfo>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
5
New topic: Querying XML

XPath



XQuery


Simple protocol for accessing node
Will use in XQuery and conversion from relations
SQL : relations :: XQuery : XML
XSLT


sophisticated transformations
Sometimes for presentation
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
6
XQuery

Queries are FLWR expressions

Based on Quilt and XML-QL
FOR/LET...
WHERE...
RETURN...
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan
Kaufmann" AND $b/year = "1998"
RETURN $b/title
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
7
XQuery

Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN { $x/title }
Result:
<title>abc</title>
<title>def</title>
<title>ghi</title>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
8
SQL v. XQuery
Product(pid, name, maker)
Find all products made in NYC
Company(cid, name, city)
SELECT x.name
FROM Product x, Company y
WHERE x.maker=y.cid and
y.city="NYC"
XQuery
SQL
FOR $r in document("db.xml")/db,
$x in $r/Product/row,
$y in $r/Company/row
WHERE $x/maker/text()=$y/cid/text() and
$y/city/text() = "NYC"
RETURN { $x/name }
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
9
SQL v. XQuery
For each company with revenues < 1M count the products over $100
SELECT y.name, count(*)
FROM Product x, Company y
WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000
GROUP BY y.cid, y.name
FOR $r in document("db.xml")/db,
$y in $r/Company/row[revenue/text()<1000000]
RETURN
<Company>
<companyName>{ $y/name/text() }</companyName>
<numberOfExpensiveProducts> { count(
$r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) }
</numberOfExpensiveProducts>
</Company>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
10
XSLT: XSL Transformations

Converts XML docs to other XML docs


Or to HTML, PDF, etc.
E.g.: Have data in XML, want to display to all users



Users view web with IE, Firefox, Treo…
Have XSLT convert to HTML that looks good on each
XSLT processor takes XML doc and XSL template for view
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
11
XSLT v. XQuery
<xsl:transform
version="1.0"
 FLWR expressions:
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 Often much
simpler than XSLT
<xsl:template
match="/">
<xsl:for-each select="document('bib.xml')//book">
FOR<xsl:if
$b IN test="publisher='Morgan
document("bib.xml")//book
Kaufmann' and
WHERE $b/publisher
= "Morgan Kaufmann" AND
year='1998'">
<xsl:copy-of
$b/year = select="title"/>
"1998“
</xsl:if>
RETURN
$b/title
</xsl:for-each>
</xsl:template>
</xsl:transform>

XSLT v. XQuery:

http://www.xmlportfolio.com/xquery.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
12
Displaying XML with XSL/XSLT

XSL: style sheet language for XML


Menu in XML:


http://www.w3schools.com/xml/simple.xsl
XSL applied to the XML:


http://www.w3schools.com/xml/simple.xml
XSL file for displaying it:


XSL : XML :: CSS : HTML
http://www.w3schools.com/xml/simplexsl.xml
More info on Java with XSLT and XPath:

http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
13
From XML to relations (Oracle)

To move single values from XML to tables, can
simply use extractvalue in UPDATE statements:
SQL> UPDATE purchase_order
SET order_nbr = 7101,
customer_po_nbr =
extractvalue(purchase_order_doc,
'/purchase_order/po_number'),
customer_inception_date =
to_date(extractvalue(purchase_order_doc,
'/purchase_order/po_date'), 'yyyy-mm-dd');
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
14
From relations to XML (Oracle)


Saw how to put XML in a table
Conversely, can convert ordinary rel data to XML


XMLElement() generates an XML node
Now can call XMLElement ftn to wrap vals in tags:
SELECT XMLElement("supplier_id", s.supplier_id) ||
XMLElement("name", s.name) xml_fragment
FROM supplier s;

And can build it up recursively:
SELECT XMLElement("supplier",
XMLElement("supplier_id", s.supplier_id),
XMLElement("name", s.name))
FROM supplier s;
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
15
Why XML matters

Hugely popular


To past few years what Java was to mid-90s
Buzzword-compliant

XML databases won’t likely replace RDBMSs
(remember OODBMSs?), but:

Allows for comm. between DBMSs disparate
architectures, tools, languages, etc.


Basis for Web Services
DBMS vendors are adding XML support

MS, Oracle, et al.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
16
For more info





APIs: SAX, JAXP
Editors: XML Spy, MS XML Notepad:
http://www.webattack.com/get/xmlnotepad.shtml
Parsers: Saxon, Xalan, MS XML Parser
Lectures drew on resources from:
Nine-week course on XML:


W3C XML Tutorial:


http://www.cs.rpi.edu/~puninj/XMLJ/classes.html
http://www.w3schools.com/xml/default.asp
http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%20XPath,%20&%20Xslt.ppt
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
17
Recent XML news/etc.

Group at Sun planning “binary XML”


XML is “simple and sloppy”


http://www.adambosworth.net/archives/000031.html
RDF: Resource Definition Framework




http://developers.slashdot.org/article.pl?sid=05/01/14/1650206&tid=156
Metadata for the web  “Semantic web”
Content, authors, relations to other content
http://www.w3.org/DesignIssues/RDFnot.html
Web + XML = the “global mind”

http://novaspivack.typepad.com/nova_spivacks_weblog/2004/06/minding_the_pla.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
18
New topic: Data Warehousing

Physical warehouse: stores different kinds of items





combined from different sources in supply chain
access items as a combined package
“Synergy”
DW is the sys containing the data from many DBs
OLAP is the system for easily querying the DW


Online analytical processing
front-end to DW & stats
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
19
Integrating Data

Ad hoc combination of DBs from different sources
can be problematic

Data may be spread across many systems



geographically
by division
different systems from before mergers…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
20
Conversion/scrubbing/merging

Lots of issues…

different types of data


Different values for data


‘GREEN’/’GR/’2
Semantic differences


Varchar(255) v. char(30)
Cars v. Automobiles
Missing values

Handle with nulls or XML
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
21
Federated DBs

Situ: n different DBs must work together

One idea: write programs for each to talk to each
other one


How many programs required?
Like ambassadors for each country
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
22
Federated DBs

Better idea: introduce another DB


Now how many programs?


write programs for it to talk to each other DB
English in business, French in diplomacy
 Warehousing

Refreshed nightly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
23
OLTP v. OLAP

DWs usually not updated in real-time




data is usually not live
but care about higher-level, longer-term patterns
For “knowledge workers”/decision-makers
Live data is in system used by OLTP



online transaction processing
E.g., airline reservations
OLTP data loaded into DW periodically, say nightly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
24
Utilizing Data

Situ: each time manager has hunch




 requests custom reports
 direct programmers to write/modify SQL app to produce
these results
on higher or lower levels, for different specifics
Problem: too difficult/expensive/slow

too great a time lag
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
25
EISs

Could just write queries at command-prompt

But decision makes aren’t (all) SQL programmers

Soln: create an executive information system




provides friendly front-end to common, important queries
basically a simple DB front-end
your project part 5
GROUP BY queries are particularly applicable…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
26
EISs v. OLAP

Okay for fixed set of queries
But what if queries are open-ended?

Q: What’s driving sales in the Northeast?




What’s the source cause?
Result from one query influences next query tried
OLAP systems are interactive:




run query
analyze results
think of new query
repeat
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
27
Star Schemas

Popular schema for DW data

One central DB surrounded by specific DBs

Center: fact table

Extremities: data tables

Fields in fact table are foreign keys to data tables

Normalization  Snowflake Schema

May not be worthwhile…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
28
Dates and star schemas

OLAP behaves as though you had a Days table,
with every possible row



Dates(day, week, month, year, DID)
(5, 27, 7, 2000)
Can join on Days like any other table
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
29
Dates and star schemas

E.g.: products x salesperson x region x date


Regular dim tables:




Products sold by salespeople in regions on dates
Product(PID, name, color)
Emp(name, SSN, sal)
Region(name, RID)
Fact table:


Sales(PID, DID, SSN, RID)
Interpret as a cube (cross product of all dimensions)

Can have both data and stats
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
30
Drill-down & roll-up


Imagine: notice some region’s sales way up
Why? Good salesperson? Some popular product
there?

Maybe need to search by month, or month and
product, abstract back up to just product…

“slicing & dicing”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
31
OLAP and data warehousing

Could write GROUP BY queries for each

OLAP systems provide simpler, non-SQL interface
for this sort of thing

Vendors: MicroStrategy, SAP, etc.

Otoh: DW-style operators have been added to SQL
and some DBMSs…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
32
DW extensions in SQL: ROLLUP (Oracle)

Suppose have orders table (from two years), with
region and date info:
SQL> column month format a10
SQL> @mosql2_data
SQL> describe all_orders;

Can select total sales:
SELECT sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id;


Examples derived/from Mastering Oracle SQL, 2e (O’Reilly)
Get data here: http://examples.oreilly.com/mastorasql2/mosql2_data.sql
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
33
DW extensions in SQL: ROLLUP (Oracle)

Can write GROUP BY queries for year or region or
both:
SELECT r.name region, o.year, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY (r.name, o.year);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
34
DW extensions in SQL: ROLLUP (Oracle)

ROLLUP operator




Extension of GROUP BY
Does GROUP BY on several levels, simultaneously
Order matters
Get sales totals for each region/year pair each region,
and the grand total:
SELECT r.name region, o.year, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY ROLLUP (r.name, o.year);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
35
DW extensions in SQL: ROLLUP (Oracle)


Change the order of the group fields to get a
different sequence of groups
To get totals for each year/region pair, each year,
and the grand total, and just reverse group-by order:
SELECT o.year, r.name region, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY ROLLUP (o.year, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
36
DW extensions in SQL: ROLLUP (Oracle)

Adding more dimensions, like month, is easy (apart
from formatting):
SELECT o.year,
to_char(to_date(o.month, 'MM'),'Month') month,
r.name region, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY ROLLUP (o.year, o.month, r.name);

NB: summing happens on each level
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
37
DW extensions in SQL: ROLLUP (Oracle)

If desired, can combine fields for the sake of
grouping:
SELECT o.year,
to_char(to_date(o.month, 'MM'),'Month') month,
r.name region, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY ROLLUP ((o.year, o.month), r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
38
DW extensions in SQL: CUBE (Oracle)

Another GROUP BY extension: CUBE




Subtotals all possible combins of group-by fields (powerset)
Syntax: “ROLLUP”  “CUBE”
Order of fields doesn’t matter (apart from ordering)
To get subtotals for each region/month pair, each region, each
month, and the grand total:
SELECT
to_char(to_date(o.month, 'MM'),'Month') month,
r.name region,
sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY CUBE (o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
39
DW extensions in SQL: CUBE (Oracle)

Again, can easily add more dimensions:
SELECT o.year,
to_char(to_date(o.month, 'MM'),'Month') month,
r.name region, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY CUBE (o.year, o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
40
DW SQL exts: GROUPING SETS (Oracle)


That’s a lot of rows
Instead of a cube of all combinations, maybe we just
want the totals for each individual field:
SELECT o.year,
to_char(to_date(o.month, 'MM'),'Month') month,
r.name region, sum(o.tot_sales)
FROM all_orders o join region r
ON r.region_id = o.region_id
GROUP BY GROUPING SETS (o.year, o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
41
Next time

Overview of data mining

Some other odds & ends…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
42
Download