PPT - NYU Stern School of Business

advertisement
C20.0046: Database
Management Systems
Lecture #24
Matthew P. Johnson
Stern School of Business, NYU
Spring, 2004
M.P. Johnson, DBMS, Stern/NYU, Sp2004
1
Agenda


Previously: XML
Next:






Finish XML & related technologies
Hardware
Indices
Hw3 up soon
1-minute responses
Grading
M.P. Johnson, DBMS, Stern/NYU, Sp2004
2
XML Applications/dialects

Copy from: http://pages.stern.nyu.edu/~mjohnson/dbms/eg/xml.txt

MathML: Mathematical Markup Language

http://wwwasdoc.web.cern.ch/wwwasdoc/WWW/publications/ictp
99/ictp99N8059.html

ChemML: Chemical Markup Language

X4ML: XML for Merrill Lynch

XHMTL: HTML retrofitted as an XML application

Validation: http://pages.stern.nyu.edu/~mjohnson/dbms/
M.P. Johnson, DBMS, Stern/NYU, Sp2004
3
XML Applications/dialects

VoiceXML:



http://newmedia.purchase.edu/~Jeanine/interfaces/rps.xml
AT&T Directory Assistance
http://phone.yahoo.com/
Image from http://www.voicexml.org/tutorials/intro2.html
M.P. Johnson, DBMS, Stern/NYU, Sp2004
4
More XML Apps

FIXML


swiftML


XML equiv. of SWIFT: Society for Worldwide Interbank
Financial Telecommunications message format
Apache’s Ant



XML equiv. of FIX: Financial Information eXchange
Scripting language for Java build management
http://ant.apache.org/manual/using.html
Many more:

http://www-106.ibm.com/developerworks/xml/library/x-stand4/
M.P. Johnson, DBMS, Stern/NYU, Sp2004
5
More XML Applications/Protocols

RSS: Rich Site Summary/Really Simple
Syndication



http://slate.msn.com/rss/
http://slashdot.org/index.rss
Screenshot


http://paulboutin.weblogger.com/pictures/viewer$673
<channel>
More
info: http://slate.msn.com/id/2096660/
<title>my channel</title>
<item>
<title>story 1</title>
<link>…</link>
</item>
// other items
</channel>
M.P. Johnson, DBMS, Stern/NYU, Sp2004
6
More XML Applications/Protocols

SOAP: Simple Object Access Protocol
XML-based messaging format
 Used by Google API: http://www.google.com/apis/
 Amazon API: http://amazon.com/gp/aws/landing.html
 Amazon light: http://kokogiak.com/amazon/
 Other examples:
<SOAP:Envelope
http://www.wired.com/wired/archive/12.03/google.html?pg=
xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
10&topic=&topic_set=
<SOAP:Header></SOAP:Header>

<SOAP:Body>
<GetSalesTax>
 SOAP envelope with header and body
<SalesTotal>100</SalesTotal>
 Request sales tax for total
<GetSalesTax>
</SOAP:Body>
</SOAP:Envelope>
M.P. Johnson, DBMS, Stern/NYU, Sp2004
7
More XML Applications/Protocols
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<gs:doGoogleSearch xmlns:gs="urn:GoogleSearch">
<key>%(key)s</key>
<start>0</start>
<maxResults>10</maxResults>
<filter>true</filter>
<restrict/>
<safeSearch>false</safeSearch>
<lr/>
</gs:doGoogleSearch>
</soap:Body>
</soap:Envelope>
M.P. Johnson, DBMS, Stern/NYU, Sp2004
8
RDF

RDF: Resource Definition Framework





Describe info on web
Metadata for the web
Content, authors, relations to other content
“Semantic web”
See http://www.w3.org/DesignIssues/RDFnot.html
M.P. Johnson, DBMS, Stern/NYU, Sp2004
9
New topic: Querying XML

XPath


Simple protocol for accessing node
Won’t discuss

XQuery: SQL of XML

XSLT: sophisticated transformations
M.P. Johnson, DBMS, Stern/NYU, Sp2004
10
XQuery

XQuery: FLWR expressions

Based on Quilt and XML-QL
FOR/LET...
WHERE...
RETURN...
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan
Kaufmann" AND $b/year = "1998"
RETURN $b/title
M.P. Johnson, DBMS, Stern/NYU, Sp2004
11
XQuery
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN { $x/title }
Result:
<title> abc </title>
<title> def </title>
<title> ghi </title>
M.P. Johnson, DBMS, Stern/NYU, Sp2004
12
SQL and XQuery Side-by-side
Product(pid, name, maker)
Company(cid, name, city)
SELECT x.name
FROM Product x, Company y
WHERE x.maker=y.cid
and y.city=“Seattle”
SQL
Find all products made in Seattle
FOR $r in document(“db.xml”)/db,
$x in $r/Product/row,
$y in $r/Company/row
WHERE
$x/maker/text()=$y/cid/text()
and $y/city/text() = “Seattle”
RETURN { $x/name }
XQuery
M.P. Johnson, DBMS, Stern/NYU, Sp2004
13
SQL and XQuery Side-by-side
For each company with revenues < 1M count the products over $100
SELECT y.name, count(*)
FROM Product x, Company y
WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000
GROUP BY y.cid, y.name
FOR $r in document(“db.xml”)/db,
$y in $r/Company/row[revenue/text()<1000000]
RETURN
<proudCompany>
<companyName> { $y/name/text() } </companyName>
<numberOfExpensiveProducts>
{ count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) }
</numberOfExpensiveProducts>
</proudCompany>
M.P. Johnson, DBMS, Stern/NYU, Sp2004
14
XSLT: XST: Transformations

Converts XML docs to other XML docs


Or to HTML, PDF, etc.
E.g.: Have data in XML, want to display to all
users



Users view web with IE, Netscape, Palm…
Have XSLT convert to HTML that looks good on
each
XSLT processor takes XML doc and XSL template
for view
M.P. Johnson, DBMS, Stern/NYU, Sp2004
15
Querying XML with XQuery
 FLWR expressions:
<xsl:transform
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 Often much simpler than XSLT
<xsl:template match="/">
<xsl:for-each
FOR $b IN select="document('bib.xml')//book">
document("bib.xml")//book
<xsl:if
WHERE test="publisher='Morgan
$b/publisher = "MorganKaufmann' and
year='1998'">
Kaufmann" AND $b/year = "1998"
<xsl:copy-of
select="title"/>
RETURN $b/title
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:transform>

XSLT v. XQuery:

http://www.xmlportfolio.com/xquery.html
M.P. Johnson, DBMS, Stern/NYU, Sp2004
16
Displaying XML with XSL/XSLT

XSL: style sheet language for XML


Menu in XML:


http://www.w3schools.com/xml/simple.xsl
XSL applied to the XML:


http://www.w3schools.com/xml/simple.xml
XSL file for displaying it:


As CSS is for HTML
http://www.w3schools.com/xml/simplexsl.xml
More info on Java with XSLT and Xpath:

http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html
M.P. Johnson, DBMS, Stern/NYU, Sp2004
17
Why XML matters

Hugely popular


To millennium what Java was to mid-90s
Buzzword compliant

XML databases won’t likely replace RDBMSs
(remember OODBMSs?), but:

Allows for comm. between DBMSs disparate
architectures, tools, languages, etc.


Basis for Web Services
DBMS vendors are adding XML support

MS, Oracle, et al.
M.P. Johnson, DBMS, Stern/NYU, Sp2004
18
For more info





APIs: SAX, JAXP
Editors: XML Spy, MS XML Notepad:
http://www.webattack.com/get/xmlnotepad.shtml
Parsers: Saxon, Xalan, MS XML Parser
Lecture drew on resources from:
Nine-week course on XML:

http://www.cs.rpi.edu/~puninj/XMLJ/classes.html

W3C XML Tutorial:

http://www.w3schools.com/xml/default.asp
http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%
20XPath,%20&%20Xslt.ppt

M.P. Johnson, DBMS, Stern/NYU, Sp2004
19
Next topic: Hardware



Types of memory
Disks
Mergesort/TPMMS
M.P. Johnson, DBMS, Stern/NYU, Sp2004
20
What should a DBMS do?

Store large amounts of data
Process queries efficiently
Allow multiple users to access the database
concurrently and safely.
Provide durability of the data.

How will we do all this?



M.P. Johnson, DBMS, Stern/NYU, Sp2004
21
User/
Application
Transaction
commands
Let’s get physical
Query
update
Query compiler/optimizer
Record,
index
requests
Transaction manager:
•Concurrency control
•Logging/recovery
Read/write
pages
Execution engine
Query execution
plan
Index/record mgr.
Page
commands
Buffer manager
Storage manager
storage
M.P. Johnson, DBMS, Stern/NYU, Sp2004
22
Types of memory
Main Memory
Disk
Tape
• 5-10 MB/S
• 1.5 MB/S transfer rate
transmission rates • 280 GB typical
• 100s GB storage
capacity
• average time to
• Only sequential access
access a block:
• Not for operational
10-15 msecs.
data
• Need to consider
seek, rotation,
transfer times.
Cache:
• Keep records “close”
access time 10 nano’s
to each other.
•Volatile
•limited address
spaces
• expensive
• average access
time:
10-100 ns
M.P. Johnson, DBMS, Stern/NYU, Sp2004
23
Main Memory



Fastest, most expensive
Today: O(1 GB) are common on PCs
Some databases could fit in memory

New industry trend: Main Memory Database

But many cannot

RAM is volatile and small

Still need to store on disk
M.P. Johnson, DBMS, Stern/NYU, Sp2004
24
Secondary Storage




Disks
Slower, cheaper than main memory
Persistent!
Used with a main memory buffer
M.P. Johnson, DBMS, Stern/NYU, Sp2004
25
$200 worth of disk space
M.P. Johnson, DBMS, Stern/NYU, Sp2004
26
Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
DB
choice of frame dictated
by replacement policy

Data must be in RAM for DBMS to operate on it!
Table of <frame#, pageid> pairs is maintained.

LRU is not always good.

M.P. Johnson, DBMS, Stern/NYU, Sp2004
27
Buffer Manager

Why not just use the OS?

DBMS may be able to anticipate access
patterns
Hence, may also be able to perform
prefetching
DBMS needs the ability to force pages to
disk.


M.P. Johnson, DBMS, Stern/NYU, Sp2004
28
Tertiary Storage

CDs, DVDs, jukeboxes


Tapes, tape silos


ROM
sequential access
Bi but very slow

long term archiving only
M.P. Johnson, DBMS, Stern/NYU, Sp2004
29
The Mechanics of Disk

Mechanical characteristics:




Rotation speed (5400RPM)
Number of platters (1-30)
Number of tracks (<=10000)
Number of bytes/track(105)
Cylinder
Disk head
Spindle
Tracks
Sector
Arm movement
Platters
Arm assembly
M.P. Johnson, DBMS, Stern/NYU, Sp2004
30
Disk Access Characteristics

Disk latency = time between when command is
issued and when data is in memory

Disk latency = seek time + rotational latency




Seek time = time for the head to reach cylinder
 10ms – 40ms
Rotational latency = time for the sector to rotate
 Rotation time = 10ms
 Average latency = 10ms/2
Transfer time = typically 40MB/s
Disks read/write one block at a time (typically 4kB)
M.P. Johnson, DBMS, Stern/NYU, Sp2004
31
A little CS…

In main memory: CPU time


Big O notation !
In databases time is dominated by I/O cost


Big O too, but for I/O’s
Often big O becomes a constant

 The I/O Model of Computation

Consequence: need to redesign certain
algorithms
M.P. Johnson, DBMS, Stern/NYU, Sp2004
32
Mergesort

Alg

E.g.

Complexity
M.P. Johnson, DBMS, Stern/NYU, Sp2004
33
Sorting


Problem: sort 1 GB of data with 1MB of RAM.
Where we need this:





Data requested in sorted order (ORDER BY)
Needed for grouping operations
First step in sort-merge join algorithm
Duplicate removal
Bulk loading of B+-tree indexes.
M.P. Johnson, DBMS, Stern/NYU, Sp2004
34
Two-Way Merge-sort



Requires 3 Buffers in RAM
Pass 1: Read a page, sort it, write it.
Pass 2, 3, …, etc.: merge two runs, write them
Runs of length 2L
Runs of length L
INPUT 1
OUTPUT
INPUT 2
Disk
Main memory
buffers
M.P. Johnson, DBMS, Stern/NYU, Sp2004
Disk
35
Two-Way External Merge Sort

Assume block size is B = 4Kb

Step 1  runs of length L = 4Kb
Step 2  runs of length L = 8Kb
Step 3  runs of length L = 16Kb = 23-1 * 4Kb
…
Step 9  runs of length L = 1MB
…
Step 19  runs of length L = 1GB (why?)




Need 19 iterations over the disk data to sort 1GB
M.P. Johnson, DBMS, Stern/NYU, Sp2004
36
Can we do better?
M.P. Johnson, DBMS, Stern/NYU, Sp2004
37
Large Two-Way External Merge Sort

We've got a meg!


Divide RAM into thirds
Read, write in blocks of 333kb

How much improvement?
M.P. Johnson, DBMS, Stern/NYU, Sp2004
38
Can we do better?
M.P. Johnson, DBMS, Stern/NYU, Sp2004
39
Cost Model for Our Analysis




B: Block size ( = 4KB)
M: Size of main memory ( = 1MB)
N: Number of records in the file
R: Size of one record
M.P. Johnson, DBMS, Stern/NYU, Sp2004
40
External Merge-Sort

Phase one: load M bytes in memory, sort

Result: SIZE/M lists of length M bytes (1MB)
...
Disk
M/R records
M bytes of main memory
M.P. Johnson, DBMS, Stern/NYU, Sp2004
...
Disk
41
Phase Two

Merge M/B – 1 lists into a new list


M/B-1 = 1MB / 4kb -1 = 250
Result: lists of size M *(M/B – 1) bytes

249 * 1MB ~= 250 MB
Input 1
...
Input 2
....
Output
...
Input M/B
Disk
M bytes of main memory
M.P. Johnson, DBMS, Stern/NYU, Sp2004
Disk
42
Phase Three


Merge M/B – 1 lists into a new list
Result: lists of size M*(M/B – 1)2 bytes

249 * 250 MB ~= 62,500 MB = 625 GB
Input 1
...
Input 2
....
Output
...
Input M/B
Disk
M bytes of main memory
M.P. Johnson, DBMS, Stern/NYU, Sp2004
Disk
43
Cost of External Merge Sort

Number of passes:
1  log M/B1 Size/M 

How much data can we sort with 1MB RAM?




1 pass  1MB
2 passes  250MB (M/B = 250)
3 passes  625GB
Time:



assume read/write block ~ 10 ms = .01 s
eac pass: read, write all data
eac pass: 2*625GB/4kb*.01s = 2*1562500s = 2*26041m =
2*434 = 2*18 days = 36 days
M.P. Johnson, DBMS, Stern/NYU, Sp2004
44
Cost of External Merge Sort

Number of passes:
1  log M/B1 Size/M 

How much data can we sort with 10MB RAM
(M/B = 2500)?



1 pass  10MB
2 passes  10MB * 2500 = 25,000MB = 25GB
3 passes  2500 * 25GB = 62,500GB
M.P. Johnson, DBMS, Stern/NYU, Sp2004
45
Cost of External Merge Sort

Number of passes:
1  log M/B1 Size/M 

How much data can we sort with 100MB
RAM (M/B = 25,000)?



1 pass  100MB
2 passes  100MB * 25,000 = 2,500,000MB =
2,500GB = 2.5TB
3 passes  25,000 * 2.5TB = 62,500TB = 62.5PB
M.P. Johnson, DBMS, Stern/NYU, Sp2004
46
Next time

Next: Indices
For next time: reading from chapter 13
posted today

Hw3 up soon

Now: one-minute responses

M.P. Johnson, DBMS, Stern/NYU, Sp2004
47
Download