Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray 1 Benchmark History IBM TP 1-7 CA and Tony Lukes 1970 Debit Credit Gray Wisconsin Bitton Boral DeWitt Turbyfill 1980 Sort TPC-A TPC-B 1990 PennySort MinuteSort 2000 Datamation Anon et al TPC-C TPC-W ? MCC Boral &... Teradata Bollinger &... TPC-D 2 A Short History of Sort • April Fools 1995: Datamation Sort – Sort 1M 100 B records – An IO benchmark: 15-min to 1 hr! • 1993: {Minute | Penny}x{Daytona | Indy} • 1998: TeraByte Sort • Web site: http://research.Microsoft.com/barc/SortBenchmark/ 3 Ground Rules • How much can you sort for a penny (in a minute). – – – – – Hardware and Software cost Depreciated over 3 years 1M$ system gets about 1 second, 1K$ system gets about 1,000 seconds. Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is – 100-byte records (random data) – key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories 4 Bottleneck Analysis • Drawn to linear scale Disk R/W ~15MBps Memory MemCopy Read/Write ~50 MBps ~150 MBps Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits 5 Bottleneck Analysis • NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI 64 ~ 155 MBps Unbuffered read (175 raw) ~ 95 MBps Unbuffered write • Recently: SQL Server on Xeon: 190MBps scan. Good, but 10x down from S390/SGI/UE10k Adapter ~70 MBps Adapter PCI ~110 MBps Memory Read/Write ~250 MBps Adapter PCI Adapter 6 PennySort • Hardware – 266 Mhz Intel PPro – 64 MB SDRAM (10ns) – Dual Fujitsu DMA 3.2GB EIDE disks • Software – NT workstation 4.3 – NT 5 sort • Performance PennySort Machine (1107$ ) Disk 25% – sort 15 M 100-byte records (~1.5 GB) board 13% – Disk to disk – elapsed time 820 sec • cpu time = 404 sec Cabinet + Assembly 7% Memory 8% Other 22% Network, Video, floppy 9% Software 6% cpu 32% 7 Recent Results • NCSAsort: 10.3 GB in .9 minute 60 Intel/NT/Myranet nodes • MilleniumSort: 16x Dell NT cluster: 100 MB in 1.08 Sec (Datamation) 8 1999 PennySort • Daytona & Indy: 2.58 GB in 917 sec • HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC • Intel 400Mhz 2 IDE disks 9 1998 TB Sort • Chris Nyberg Nsort SGI 32x Origin2000 151 Minutes 10 1999 Terabyte Sort • Daytona: Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Tandem/Sandia TSort: 68 CPU ServerNet 47 minutes • Indy: IBM SPsort 408 nodes, 1952 cpu 2168 disks 17.6 minutes = 1057sec (all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect 11 Sandia/Compaq/ServerNet/NT Sort • Sort 1.1 Terabyte (13 Billion records) in 47 minutes • 68 nodes (dual 450 Mhz processors) Compaq Proliant 1850R Server 2 400 MHz CPUs To X Fabric 543 disks, 1.5 M$ • 1.2 GBps network rap (2.8 GBps pap) • 5.2 GBps of disk rap (same as pap) • (rap=real application performance, pap= peak advertised performance) To Y fabric 6-port ServerNet I crossbar sw itch 512 MB SDRAM PCI Bus ServerNet I dual-ported PCI NIC 4 SCSI busses, each with 2 data disks Bisection Line (Each switch on this line adds 3 links to bisection width) X Fabric (10 bidirectional bisection links) Y Fabric (14 bidirectional bisection links) 6-port ServerNet I crossbar sw itch The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National 12 Labs SP sort 4.0 • 2 – 4 GBps! 3.5 GPFS read GPFS write 3.0 Local read Local write GB/s 2.5 2.0 1.5 1.0 0.5 0.0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds) 56 nodes 18 racks Storage 432 nodes 37 racks compute 488 nodes 55 racks 1952 processors, 732 GB RAM, 2168 disks 56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node) Compute rack: 16 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 1 32x33 PCI bus 9 GB scsi disk 150MBps full duplex SP switch Storage rack: 8 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 3 32x33 PCI bus 30x4 GB scsi disk (4+113RAID5) 150MBps full duplex SP switch 1999 Sort Records 1999 Sort Records Daytona Penny Minute TeraByte Indy 2.58 GB in 917 sec 2.58 GB in 917 sec HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC 7.6 GB in 60 seconds 10.3 GB in 56.51 sec Ordinal Nsort SGI 32 cpu Origin IRIX NOW+MPI HPVMsort Luis Rivera UIUC & Andrew Chien UCSD 49 minutes 1057 seconds Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck 68x2 Compaq &Sandia Labs Datamation SPsort 1952 SP Jm Wyllie cluster 2168 disks PDF SPsort.pdf (80KB) 1.18 Seconds Phillip Buonadonna, Spencer Low, Josh Coates, UC Berkeley Millennium Sort 16x2 Dell NT Myrinet 14 2x/year! • Partly hardware • Partly software • Partly economics Records Sorted per Second Doubles Every Year 1.E+06 1.E+03 1.E+00 GB Sorted per Dollar Doubles Every Year 1.E-03 1985 1990 1995 15 2000 Progress on Sorting • Speedup comes from Moore’s law 40%/year • Processor/Disk/Network arrays: 60%/year (this is a software speedup). SPsort 1.E+08 Records Sorted per Second Doubles Every Year Sort Re cords/se cond vs T ime SPsort/ IB 1.E+07 1.E+06 NOW IBM RS6000 1.E+06 IBM 3090 Sandia/Compaq /NT Ordinal+SGI Alpha NT/PennySort 1.E+03 1.E+05 Cray YMP Sequent 1.E+04 1.E+03 Compaq/NT Intel HyperCube Penny NT sort 1.E+00 Kitsuregawa Hardware Sorter Tandem 1.E+02 1985 Bitton M68000 1990 1995 2000 1.E-03 1985 GB Sorted per Dollar Doubles Every Year 1990 1995 16 2000 Musings: PennySort=TBsort • • • • • • • • Sorts 1TB in 1Minute 2 pass so 3TB of disk = 10 disks if 330GB/disk = 5Gps (if each disk is 50Mbps) So, 600 seconds (3TB/5GBps) So, node costs 1.5k$ Costs 100x that today maybe in 10 years? 17 Data Gravity Processing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in – Modem – Display – Microphones (speech recognition) & cameras (vision) – Storage: Data storage and analysis • System is “distributed” (a cluster/mob) 18 Disk = Node • • • • has magnetic storage (100 GB?) has processor & DRAM has SAN attachment has execution Applications environment Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel 19 SAN: Standard Interconnect Gbps SAN: 110 MBps PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps • LAN faster than memory bus? • • • • 1 GBps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) 20