Scaleable Systems Research at Microsoft (really: what we do at BARC) • Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA. 1 Outline • PowerCast, FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • WolfPack Failover • NTFS IO measurements • NT-Cluster-Sort • AlwaysUp 2 Telepresence • The next killer app • Space shifting: » Reduce travel • Time shifting: » Retrospective » Offer condensations » Just in time meetings. • Example: ACM 97 » NetShow and Web site. » More web visitors than attendees • People-to-People communication 3 Telepresence Prototypes • PowerCast: multicast PowerPoint » Streaming - pre-sends next anticipated slide » Send slides and voice rather than talking head and voice » Uses ECSRM for reliable multicast » 1000’s of receivers can join and leave any time. » No server needed; no pre-load of slides. » Cooperating with NetShow • FileCast: multicast file transfer. » Erasure encodes all packets » Receivers only need to receive as many bytes as the length of the file » Multicast IE to solve Midnight-Madness problem • NT SRM: reliable IP multicast library for NT • Spatialized Teleconference Station » Texture map faces onto spheres » Space map voices 4 RAGS: RAndom SQL test Generator • Microsoft spends a LOT of money on testing. (60% of development according to one source). • Idea: test SQL by » generating random correct queries » executing queries against database » compare results with SQL 6.5, DB2, Oracle, Sybase • Being used in SQL 7.0 testing. » 375 unique bugs found (since 2/97) » Very productive test tool 5 Sample Rags Generated Statement This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan. SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) 6 Automation • Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) • Control statement attributes » complexity, kind, depth, ... • Multi-user stress tests » tests concurrency, allocation, recovery 7 One 4-Vendor Rags Test 3 of them vs Us • 60 k Selects on MSS, DB2, Oracle, Sybase. • 17 SQL Server Beta 2 suspects • • 1 suspect per 3350 statements. Examine 10 suspects, filed 4 Bugs! One duplicate. Assume 3/10 are new Note: This is the SS Beta 2 Product Quality rising fast (and RAGS sees that) 8 Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort 9 Billions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous Billions Of Clients Need Millions Of Servers All clients networked to servers May be nomadic or on-demand Fast clients want faster servers Servers provide Shared Data Control Coordination Communication Clients Mobile clients Fixed clients Servers Server Super server Thesis Many little beat few big $1 million 3 1 MM $100 K $10 K Pico Processor Micro Mini Mainframe Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" 9" 5.25" 3.5" 2.5" 1.8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance? 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP Performance = Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks for AlphaSort Components Disc Wait Disc Wait Sort Sort OS Memory Wait B-Cache Data Miss 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not I-Cache Miss D-Cache Miss Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard SMP Super Server Grow Out with Cluster Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs Microsoft TerraServer: Scaleup to Big Databases • • Build a 1 TB SQL Server database Data must be • Loaded • • On the web (world’s largest atlas) Sell images with commerce server. » 1 TB » Unencumbered » Interesting to everyone everywhere » And not offensive to anyone anywhere » 1.5 M place names from Encarta World Atlas » 3 M Sq Km from USGS (1 meter resolution) » 1 M Sq Km from Russian Space agency (2 m) 15 Microsoft TerraServer Background • Earth is 500 Tera-meters square • Someday • • • • • • » USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it » 3 tsm from USGS » 2 tsm from Russian Space Agency » multi-spectral image » of everywhere » once a day / hour 1.8x1.2 km2 tile Compress 5:1 (JPEG) to 1.5 TB. 10x15 km2 thumbnail Slice into 10 KB chunks 20x30 km2 browse image Store chunks in DB Navigate with 40x60 km2 jump image » Encarta™ Atlas • globe • gazetteer » StreetsPlus™ in the USA 16 USGS Digital Ortho Quads (DOQ) • US Geologic Survey • 4 Tera Bytes • Most data not yet published • Based on a CRADA » Microsoft TerraServer makes data available. 1x1 meter 4 TB Continental US New Data Coming USGS “DOQ” 17 Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor) • • • • SPIN-2• • 1.5 Meter Geo Rectified imagery of (almost) anywhere Almost equal-area projection De-classified satellite photos (from 200 KM), More data coming (1 m) Selling imagery on Internet. 18 2 Putting 2 tm onto Microsoft TerraServer. Demo • navigate by coverage map to White House • Download image • buy imagery from USGS • navigate by name to Venice • buy SPIN2 image & Kodak photo • Pop out to Expedia street map of Venice • Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) 19 Hardware Internet Map Site Server Servers SPIN-2 100 Mbps Ethernet Switch DS3 Web Servers STK Enterprise Storage Array 9710 48 48 48 DLT 9 GB 9 GB 9 GB Tape Drives Drives Drives Library Alpha Server 48 8400 9 GB 8 x 440MHz Drives Alpha cpus 10 GB DRAM 48 9 GB Drives 48 9 GB Drives 48 9 GB Drives 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 ) 20 The Microsoft TerraServer Hardware • Compaq AlphaServer 8400 • 8x400Mhz Alpha cpus • 10 GB DRAM • 324 9.2 GB StorageWorks Disks » 3 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (4 TB) • WindowsNT 4 EE, SQL Server 7.0 21 Software Web Client Image Server Active Server Pages Internet Information Server 4.0 Java Viewer browser MTS Terra-Server Stored Procedures HTML The Internet Internet Info Server 4.0 SQL Server 7 Microsoft Automap ActiveX Server TerraServer DB Automap Server TerraServer Web Site Internet Information Server 4.0 Microsoft Site Server EE Image Delivery SQL Server Application 7 22 Image Provider Site(s) System Management & Maintenance • Backup and Recovery » STK 9710 Tape robot » Legato NetWorker™ » SQL Server 7 Backup & Restore » Clocked at 80 MBps (peak) (~ 200 GB/hr) • SQL Server Enterprise Mgr » DBA Maintenance » SQL Performance Monitor 23 Microsoft TerraServer File Group Layout • Convert 324 disks to 28 RAID5 sets plus 28 spare drives • Make 4 WinNT volumes (RAID 50) 595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files E: F: G: H: 24 Image Delivery and Load Incremental load of 4 more TB in next 18 months DLT Tape DLT Tape “tar” NT DoJob \Drop’N’ LoadMgr DB Wait 4 Load Backup LoadMgr LoadMgr ESA Alpha Server 4100 100mbit EtherSwitch 60 4.3 GB Drives Alpha Server 4100 ImgCutter \Drop’N’ \Images ... 10: ImgCutter 20: Partition 30: ThumbImg 40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place Enterprise Storage Array STK DLT Tape Library 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives Alpha Server 8400 25 Technical Challenge Key idea • Problem: Geo-Spatial Search without • geo-spatial access methods. (just standard SQL Server) Solution: Geo-spatial search key: Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) Z-transform X & Y into single Z value, build B-tree on Z Adjacent images stored next to each other Search Method: Latitude and Longitude => X, Y, then Z Select on matching Z value 26 Sloan Digital Sky Survey • Digital Sky » 30 TB raw » 3TB cooked (1 billion 3KB objects) » Want to scan it frequently • Using cyberbricks • Current status: » 175 MBps per node » 24 nodes => 4 GBps » 5 minutes to scan whole archive 27 Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) » 15 PB by 2007 • Federal Clearing house: images of checks » 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program » 10 Exabytes (???!!) 28 Kilo A letter A novel Info Capture • You can record Mega Giga A Movie • Tera Library of Congress (text) • Peta LoC (image) everything you see or hear or read. What would you do with it? How would you organize & analyze it? Exa Video All Disks Audio Read or write: Zetta All Tapes See: http://www.lesk.com/mlesk/ksg97/ksg.html Yotta 8 PB per lifetime (10GBph) 30 TB (10KBps) 8 GB (words) 29 Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. 30 A novel Kilo A letter Mega Library of Congress (text) LoC (sound + cinima) All Disks All Tapes Giga Tera Peta Exa A Movie LoC (image) All Photos Zetta All Information! Yotta 31 Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort 32 Scalability 1 billion transactions 100 million web hits • Scale up: to large SMP nodes • Scale out: to clusters of SMP nodes 4 terabytes of data 1.8 million mail messages 33 Billion Transactions per Day Project • • • • • • • Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks All off-the-shelf parts Using SQL Server & DTC distributed transactions DebitCredit Transaction Each node has 1/20 th of the DB Each node does 1/20 th of the work 15% of the transactions are “distributed” Billion Transactions Per Day Hardware • 45 nodes (Compaq Proliant) • Clustered with 100 Mbps Switched Ethernet • 140 cpu, 13 GB, 3 TB. Type Workflow MTS SQL Server Distributed Transaction Coordinator TOTAL nodes CPUs DRAM ctlrs disks 20 Compaq Proliant 2500 20 Compaq Proliant 5000 5 Compaq Proliant 5000 45 20x 20x 20x 20x RAID space 20x 2 128 1 1 2 GB 20x 20x 20x 20x 4 512 4 20x 36x4.2GB 7x9.1GB 130 GB 5x 5x 5x 5x 5x 4 256 1 3 8 GB 140 13 GB 105 895 3 TB 35 1.2 B tpd • 1 B tpd ran for 24 hrs. • Out-of-the-box software • Off-the-shelf hardware • AMAZING! •Sized for 30 days •Linear growth •5 micro-dollars per transaction 36 How Much Is 1 Billion Tpd? Mtpd Millions of Transactions Per Day 1,000. 900. 800. 100. 700. 600. 500. 10. 400. 300. 1. 200. 100. 0. 0.1 • 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) • ATT » 185 million calls per peak day (worldwide) • Visa ~20 million tpd » » » 1 Btpd Visa ATT BofA NYSE 400 million customers 250K ATMs worldwide 7 billion transactions (card+cheque) in 1994 • New York Stock Exchange » 600,000 tpd • Bank of America • » » 20 million tpd checks cleared (more than any other bank) 1.4 million tpd ATM transactions Worldwide Airlines Reservations: 250 Mtpd 37 NCSA Super Cluster http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html • National Center for Supercomputing Applications • • • • • University of Illinois @ Urbana 512 Pentium II cpus, 2,096 disks, SAN Compaq + HP +Myricom + WindowsNT A Super Computer for 3M$ Classic Fortran/MPI programming DCOM programming model 38 Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort 39 NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image » Naming » Protection/security » Management/load balance • Fault tolerance »“Wolfpack” • Hot pluggable hardware & software 40 Symmetric Virtual Server Failover Example Browser Server Server 11 Server 2 Web site Web site Web site Database Database Database Web site files Web site files Database files Database files 41 Clusters & BackOffice • Research: Instant & Transparent failover • Making BackOffice PlugNPlay on Wolfpack » Automatic install & configure • Virtual Server concept makes it easy » simpler management concept » simpler context/state migration » transparent to applications • SQL 6.5E & 7.0 Failover • MSMQ (queues), MTS (transactions). 42 Next Steps in Availability • Study the causes of outages • Build AlwaysUp system: » Two geographically remote sites » Users have instant and transparent failover to 2nd site. » Working with WindowsNT and SQL Server groups on this. 43 Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort 44 Storage Latency: How Far Away is the Data? 109 Andromeda Tape /Optical Robot 106 Disk 100 10 2 1 Memory On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min 45 The Memory Hierarchy • Measuring & Modeling Sequential IO • Where is the bottleneck? • How does it scale with » SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory File cache Mem bus App address space PCI Adapter SCSI Controller 46 PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application Data 10-15 MBps 7.2 MB/s File System Buffers SCSI Disk 133 MBps 7.2 MB/s PCI 47 The Best Case: Temp File, NO IO • Temp file Read / Write File System Cache • Program uses small (in cpu cache) buffer. • So, write/read time is bus move time (3x better • MBps • than copy) Paradox: fastest way to move data is to write then read it. Temp File Read/Write 200 This hardware is 148 136 150 limited to 150 MBps 100 per processor 54 50 0 Temp read Temp write 48 Memcopy () Bottleneck Analysis • Drawn to linear scale Disk R/W ~9MBps Memory MemCopy Read/Write ~50 MBps ~150 MBps Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits 49 3 Stripes and Your Out! • 3 disks can saturate adapter • CPU time goes • Similar story with UltraWide down with request size • Ftdisk (striping is = cheap) WriteThroughput vs Stripes 3 deep Fast Throughput (MB/s) 20 20 Throughput (MB/s) 15 10 5 0 100 15 1 Disk 2 Disks 10 3 Disks 4 Disks 5 4 8 16 32 64 128 192 Request Size (K bytes) 10 1 0 2 CPU miliseconds per MB Cost (CPU ms/MB) Read Throughput vs Stripes 3 deep Fast 2 4 8 16 32 64 128 192 Request Size (K bytes) 2 4 8 16 32 64 128 Request Size (bytes) 192 50 Parallel SCSI Busses Help • Second SCSI bus nearly One or Two SCSI Busses Read Write WCE Read Write WCE Throughput (MB/s) 25 20 15 2 busses 1 Bus • • doubles read and wce throughput Write needs deeper buffers Experiment is unbuffered (3-deep +WCE) 10 5 0 2 4 8 16 32 64 Request Size (K bytes) 128 192 2x 51 File System Buffering & Stripes (UltraWide Drives) • FS buffering helps small reads • Write peaks at 20 MBps • FS buffered writes peak at • Read peaks at 30 MBps Three Disks, 1 Deep 35 Three Disks, 3 Deep 35 FS Read Read FS Write WCE Write WCE 30 25 30 25 Throughput (MB/s) Throughput (MB/s) • 12MBps 3-deep async helps 20 20 15 15 10 10 5 5 0 0 2 4 8 16 32 64 128 Request Size (K Bytes) 192 2 4 8 16 32 64 128 Request Size (K Bytes) 192 52 PAP vs RAP • Reads are easy, writes are hard • Async write can match WCE. 422 MBps 142 MBps SCSI Application Data Disks 40 MBps File System 10-15 MBps 31 MBps 9 MBps • 133 MBps 72 MBps PCI SCSI 53 Bottleneck Analysis • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter Memory ~30 MBps PCI Read/Write ~70 MBps ~150 MBps Adapter 54 Hypothetical Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter ~30 MBps Adapter PCI ~70 MBps Memory Read/Write ~150 MBps Adapter PCI Adapter 55 Year 2002 Disks • Big disk (10 $/GB) » 3” » 100 GB » 150 kaps (k accesses per second) » 20 MBps sequential • Small disk (20 $/GB) » 3” » 4 GB » 100 kaps » 10 MBps sequential • Both running Windows NT™ 7.0? (see below for why) 56 How Do They Talk to Each Other? • Each node has an OS • Each node has local resources: A federation. • Each node does not completely trust the others. • Nodes use RPC to talk to each other RMI? Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. VIAL/VIPL ? RPC streams datagrams Applications » CORBA? DCOM? IIOP? » One or all of the above. h Wire(s) 57 Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort 58 Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. » Hardware and Software cost » Depreciated over 3 years » 1M$ system gets about 1 second, » 1K$ system gets about 1,000 seconds. » Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is » 100-byte records (random data) » key is first 10 bytes. • Must create output file • and fill with sorted version of input file. Daytona (product) and Indy (special) categories 59 PennySort • Hardware » 266 Mhz Intel PPro » 64 MB SDRAM (10ns) » Dual Fujitsu DMA 3.2GB EIDE • Software » NT workstation 4.3 » NT 5 sort • Performance PennySort Machine (1107$ ) Disk 25% Cabinet + Assembly 7% Memory 8% Other 22% board 13% Network, Video, floppy 9% Software 6% cpu 32% » sort 15 M 100-byte records » Disk to disk » elapsed time 820 sec (~1.5 GB) • cpu time = 404 sec 60 Cluster Sort •Multiple Data Sources A AAA BBB CCC Conceptual Model •Multiple Data Destinations AAA AAA AAA •Multiple nodes AAA AAA AAA •Disks -> Sockets -> Disk -> Disk B C AAA BBB CCC CCC CCC CCC AAA BBB CCC BBB BBB BBB BBB BBB BBB CCC CCC CCC 61 Cluster Install & Execute •If this is to be used by others, it must be: •Easy to install •Easy to execute • Installations of distributed systems take time and can be tedious. (AM2, GluGuard) • Parallel Remote execution is non-trivial. (GLUnix, LSF) How do we keep this “simple” and “built-in” to NTClusterSort ? 62 Remote Install •Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx() 63 Cluster Execution •Setup : MULTI_QI struct COSERVERINFO struct •CoCreateInstanceEx() MULT_QI COSERVERINFO HANDLE HANDLE HANDLE •Retrieve remote object handle from MULTI_QI struct Sort() •Invoke methods as usual Sort() Sort() 64 SAN: Standard Interconnect Gbps Ethernet: 110 MBps • LAN faster than PCI 32: 70 MBps UW Scsi: 40 MBps • • • memory bus? 1 GBps links in lab. 300$ port cost soon Port is computer FW scsi: 20 MBps scsi: 5 MBps 65