C20.0046: Database Management Systems Lecture #24 Matthew P. Johnson Stern School of Business, NYU Spring, 2004 M.P. Johnson, DBMS, Stern/NYU, Sp2004 1 Agenda Previously: XML Next: Finish XML & related technologies Hardware Indices Hw3 up soon 1-minute responses Grading M.P. Johnson, DBMS, Stern/NYU, Sp2004 2 XML Applications/dialects Copy from: http://pages.stern.nyu.edu/~mjohnson/dbms/eg/xml.txt MathML: Mathematical Markup Language http://wwwasdoc.web.cern.ch/wwwasdoc/WWW/publications/ictp 99/ictp99N8059.html ChemML: Chemical Markup Language X4ML: XML for Merrill Lynch XHMTL: HTML retrofitted as an XML application Validation: http://pages.stern.nyu.edu/~mjohnson/dbms/ M.P. Johnson, DBMS, Stern/NYU, Sp2004 3 XML Applications/dialects VoiceXML: http://newmedia.purchase.edu/~Jeanine/interfaces/rps.xml AT&T Directory Assistance http://phone.yahoo.com/ Image from http://www.voicexml.org/tutorials/intro2.html M.P. Johnson, DBMS, Stern/NYU, Sp2004 4 More XML Apps FIXML swiftML XML equiv. of SWIFT: Society for Worldwide Interbank Financial Telecommunications message format Apache’s Ant XML equiv. of FIX: Financial Information eXchange Scripting language for Java build management http://ant.apache.org/manual/using.html Many more: http://www-106.ibm.com/developerworks/xml/library/x-stand4/ M.P. Johnson, DBMS, Stern/NYU, Sp2004 5 More XML Applications/Protocols RSS: Rich Site Summary/Really Simple Syndication http://slate.msn.com/rss/ http://slashdot.org/index.rss Screenshot http://paulboutin.weblogger.com/pictures/viewer$673 <channel> More info: http://slate.msn.com/id/2096660/ <title>my channel</title> <item> <title>story 1</title> <link>…</link> </item> // other items </channel> M.P. Johnson, DBMS, Stern/NYU, Sp2004 6 More XML Applications/Protocols SOAP: Simple Object Access Protocol XML-based messaging format Used by Google API: http://www.google.com/apis/ Amazon API: http://amazon.com/gp/aws/landing.html Amazon light: http://kokogiak.com/amazon/ Other examples: <SOAP:Envelope http://www.wired.com/wired/archive/12.03/google.html?pg= xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1"> 10&topic=&topic_set= <SOAP:Header></SOAP:Header> <SOAP:Body> <GetSalesTax> SOAP envelope with header and body <SalesTotal>100</SalesTotal> Request sales tax for total <GetSalesTax> </SOAP:Body> </SOAP:Envelope> M.P. Johnson, DBMS, Stern/NYU, Sp2004 7 More XML Applications/Protocols <?xml version="1.0" encoding="UTF-8"?> <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <gs:doGoogleSearch xmlns:gs="urn:GoogleSearch"> <key>%(key)s</key> <start>0</start> <maxResults>10</maxResults> <filter>true</filter> <restrict/> <safeSearch>false</safeSearch> <lr/> </gs:doGoogleSearch> </soap:Body> </soap:Envelope> M.P. Johnson, DBMS, Stern/NYU, Sp2004 8 RDF RDF: Resource Definition Framework Describe info on web Metadata for the web Content, authors, relations to other content “Semantic web” See http://www.w3.org/DesignIssues/RDFnot.html M.P. Johnson, DBMS, Stern/NYU, Sp2004 9 New topic: Querying XML XPath Simple protocol for accessing node Won’t discuss XQuery: SQL of XML XSLT: sophisticated transformations M.P. Johnson, DBMS, Stern/NYU, Sp2004 10 XQuery XQuery: FLWR expressions Based on Quilt and XML-QL FOR/LET... WHERE... RETURN... FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title M.P. Johnson, DBMS, Stern/NYU, Sp2004 11 XQuery Find all book titles published after 1995: FOR $x IN document("bib.xml")/bib/book WHERE $x/year > 1995 RETURN { $x/title } Result: <title> abc </title> <title> def </title> <title> ghi </title> M.P. Johnson, DBMS, Stern/NYU, Sp2004 12 SQL and XQuery Side-by-side Product(pid, name, maker) Company(cid, name, city) SELECT x.name FROM Product x, Company y WHERE x.maker=y.cid and y.city=“Seattle” SQL Find all products made in Seattle FOR $r in document(“db.xml”)/db, $x in $r/Product/row, $y in $r/Company/row WHERE $x/maker/text()=$y/cid/text() and $y/city/text() = “Seattle” RETURN { $x/name } XQuery M.P. Johnson, DBMS, Stern/NYU, Sp2004 13 SQL and XQuery Side-by-side For each company with revenues < 1M count the products over $100 SELECT y.name, count(*) FROM Product x, Company y WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000 GROUP BY y.cid, y.name FOR $r in document(“db.xml”)/db, $y in $r/Company/row[revenue/text()<1000000] RETURN <proudCompany> <companyName> { $y/name/text() } </companyName> <numberOfExpensiveProducts> { count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </proudCompany> M.P. Johnson, DBMS, Stern/NYU, Sp2004 14 XSLT: XST: Transformations Converts XML docs to other XML docs Or to HTML, PDF, etc. E.g.: Have data in XML, want to display to all users Users view web with IE, Netscape, Palm… Have XSLT convert to HTML that looks good on each XSLT processor takes XML doc and XSL template for view M.P. Johnson, DBMS, Stern/NYU, Sp2004 15 Querying XML with XQuery FLWR expressions: <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> Often much simpler than XSLT <xsl:template match="/"> <xsl:for-each FOR $b IN select="document('bib.xml')//book"> document("bib.xml")//book <xsl:if WHERE test="publisher='Morgan $b/publisher = "MorganKaufmann' and year='1998'"> Kaufmann" AND $b/year = "1998" <xsl:copy-of select="title"/> RETURN $b/title </xsl:if> </xsl:for-each> </xsl:template> </xsl:transform> XSLT v. XQuery: http://www.xmlportfolio.com/xquery.html M.P. Johnson, DBMS, Stern/NYU, Sp2004 16 Displaying XML with XSL/XSLT XSL: style sheet language for XML Menu in XML: http://www.w3schools.com/xml/simple.xsl XSL applied to the XML: http://www.w3schools.com/xml/simple.xml XSL file for displaying it: As CSS is for HTML http://www.w3schools.com/xml/simplexsl.xml More info on Java with XSLT and Xpath: http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html M.P. Johnson, DBMS, Stern/NYU, Sp2004 17 Why XML matters Hugely popular To millennium what Java was to mid-90s Buzzword compliant XML databases won’t likely replace RDBMSs (remember OODBMSs?), but: Allows for comm. between DBMSs disparate architectures, tools, languages, etc. Basis for Web Services DBMS vendors are adding XML support MS, Oracle, et al. M.P. Johnson, DBMS, Stern/NYU, Sp2004 18 For more info APIs: SAX, JAXP Editors: XML Spy, MS XML Notepad: http://www.webattack.com/get/xmlnotepad.shtml Parsers: Saxon, Xalan, MS XML Parser Lecture drew on resources from: Nine-week course on XML: http://www.cs.rpi.edu/~puninj/XMLJ/classes.html W3C XML Tutorial: http://www.w3schools.com/xml/default.asp http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,% 20XPath,%20&%20Xslt.ppt M.P. Johnson, DBMS, Stern/NYU, Sp2004 19 Next topic: Hardware Types of memory Disks Mergesort/TPMMS M.P. Johnson, DBMS, Stern/NYU, Sp2004 20 What should a DBMS do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide durability of the data. How will we do all this? M.P. Johnson, DBMS, Stern/NYU, Sp2004 21 User/ Application Transaction commands Let’s get physical Query update Query compiler/optimizer Record, index requests Transaction manager: •Concurrency control •Logging/recovery Read/write pages Execution engine Query execution plan Index/record mgr. Page commands Buffer manager Storage manager storage M.P. Johnson, DBMS, Stern/NYU, Sp2004 22 Types of memory Main Memory Disk Tape • 5-10 MB/S • 1.5 MB/S transfer rate transmission rates • 280 GB typical • 100s GB storage capacity • average time to • Only sequential access access a block: • Not for operational 10-15 msecs. data • Need to consider seek, rotation, transfer times. Cache: • Keep records “close” access time 10 nano’s to each other. •Volatile •limited address spaces • expensive • average access time: 10-100 ns M.P. Johnson, DBMS, Stern/NYU, Sp2004 23 Main Memory Fastest, most expensive Today: O(1 GB) are common on PCs Some databases could fit in memory New industry trend: Main Memory Database But many cannot RAM is volatile and small Still need to store on disk M.P. Johnson, DBMS, Stern/NYU, Sp2004 24 Secondary Storage Disks Slower, cheaper than main memory Persistent! Used with a main memory buffer M.P. Johnson, DBMS, Stern/NYU, Sp2004 25 $200 worth of disk space M.P. Johnson, DBMS, Stern/NYU, Sp2004 26 Buffer Management in a DBMS Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK DB choice of frame dictated by replacement policy Data must be in RAM for DBMS to operate on it! Table of <frame#, pageid> pairs is maintained. LRU is not always good. M.P. Johnson, DBMS, Stern/NYU, Sp2004 27 Buffer Manager Why not just use the OS? DBMS may be able to anticipate access patterns Hence, may also be able to perform prefetching DBMS needs the ability to force pages to disk. M.P. Johnson, DBMS, Stern/NYU, Sp2004 28 Tertiary Storage CDs, DVDs, jukeboxes Tapes, tape silos ROM sequential access Bi but very slow long term archiving only M.P. Johnson, DBMS, Stern/NYU, Sp2004 29 The Mechanics of Disk Mechanical characteristics: Rotation speed (5400RPM) Number of platters (1-30) Number of tracks (<=10000) Number of bytes/track(105) Cylinder Disk head Spindle Tracks Sector Arm movement Platters Arm assembly M.P. Johnson, DBMS, Stern/NYU, Sp2004 30 Disk Access Characteristics Disk latency = time between when command is issued and when data is in memory Disk latency = seek time + rotational latency Seek time = time for the head to reach cylinder 10ms – 40ms Rotational latency = time for the sector to rotate Rotation time = 10ms Average latency = 10ms/2 Transfer time = typically 40MB/s Disks read/write one block at a time (typically 4kB) M.P. Johnson, DBMS, Stern/NYU, Sp2004 31 A little CS… In main memory: CPU time Big O notation ! In databases time is dominated by I/O cost Big O too, but for I/O’s Often big O becomes a constant The I/O Model of Computation Consequence: need to redesign certain algorithms M.P. Johnson, DBMS, Stern/NYU, Sp2004 32 Mergesort Alg E.g. Complexity M.P. Johnson, DBMS, Stern/NYU, Sp2004 33 Sorting Problem: sort 1 GB of data with 1MB of RAM. Where we need this: Data requested in sorted order (ORDER BY) Needed for grouping operations First step in sort-merge join algorithm Duplicate removal Bulk loading of B+-tree indexes. M.P. Johnson, DBMS, Stern/NYU, Sp2004 34 Two-Way Merge-sort Requires 3 Buffers in RAM Pass 1: Read a page, sort it, write it. Pass 2, 3, …, etc.: merge two runs, write them Runs of length 2L Runs of length L INPUT 1 OUTPUT INPUT 2 Disk Main memory buffers M.P. Johnson, DBMS, Stern/NYU, Sp2004 Disk 35 Two-Way External Merge Sort Assume block size is B = 4Kb Step 1 runs of length L = 4Kb Step 2 runs of length L = 8Kb Step 3 runs of length L = 16Kb = 23-1 * 4Kb … Step 9 runs of length L = 1MB … Step 19 runs of length L = 1GB (why?) Need 19 iterations over the disk data to sort 1GB M.P. Johnson, DBMS, Stern/NYU, Sp2004 36 Can we do better? M.P. Johnson, DBMS, Stern/NYU, Sp2004 37 Large Two-Way External Merge Sort We've got a meg! Divide RAM into thirds Read, write in blocks of 333kb How much improvement? M.P. Johnson, DBMS, Stern/NYU, Sp2004 38 Can we do better? M.P. Johnson, DBMS, Stern/NYU, Sp2004 39 Cost Model for Our Analysis B: Block size ( = 4KB) M: Size of main memory ( = 1MB) N: Number of records in the file R: Size of one record M.P. Johnson, DBMS, Stern/NYU, Sp2004 40 External Merge-Sort Phase one: load M bytes in memory, sort Result: SIZE/M lists of length M bytes (1MB) ... Disk M/R records M bytes of main memory M.P. Johnson, DBMS, Stern/NYU, Sp2004 ... Disk 41 Phase Two Merge M/B – 1 lists into a new list M/B-1 = 1MB / 4kb -1 = 250 Result: lists of size M *(M/B – 1) bytes 249 * 1MB ~= 250 MB Input 1 ... Input 2 .... Output ... Input M/B Disk M bytes of main memory M.P. Johnson, DBMS, Stern/NYU, Sp2004 Disk 42 Phase Three Merge M/B – 1 lists into a new list Result: lists of size M*(M/B – 1)2 bytes 249 * 250 MB ~= 62,500 MB = 625 GB Input 1 ... Input 2 .... Output ... Input M/B Disk M bytes of main memory M.P. Johnson, DBMS, Stern/NYU, Sp2004 Disk 43 Cost of External Merge Sort Number of passes: 1 log M/B1 Size/M How much data can we sort with 1MB RAM? 1 pass 1MB 2 passes 250MB (M/B = 250) 3 passes 625GB Time: assume read/write block ~ 10 ms = .01 s eac pass: read, write all data eac pass: 2*625GB/4kb*.01s = 2*1562500s = 2*26041m = 2*434 = 2*18 days = 36 days M.P. Johnson, DBMS, Stern/NYU, Sp2004 44 Cost of External Merge Sort Number of passes: 1 log M/B1 Size/M How much data can we sort with 10MB RAM (M/B = 2500)? 1 pass 10MB 2 passes 10MB * 2500 = 25,000MB = 25GB 3 passes 2500 * 25GB = 62,500GB M.P. Johnson, DBMS, Stern/NYU, Sp2004 45 Cost of External Merge Sort Number of passes: 1 log M/B1 Size/M How much data can we sort with 100MB RAM (M/B = 25,000)? 1 pass 100MB 2 passes 100MB * 25,000 = 2,500,000MB = 2,500GB = 2.5TB 3 passes 25,000 * 2.5TB = 62,500TB = 62.5PB M.P. Johnson, DBMS, Stern/NYU, Sp2004 46 Next time Next: Indices For next time: reading from chapter 13 posted today Hw3 up soon Now: one-minute responses M.P. Johnson, DBMS, Stern/NYU, Sp2004 47