What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey Sense of scale • How fat is your pipe? • Fattest pipe on MS campus is the WAN! 300 MBps OC48 = G2 Or memcpy() 94 MBps Coast to Coast 90 MBps PCI 20 MBps disk / ATM / OC3 Redmond/Seattle, WA Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop New York HSCC (high speed connectivity consortium) DARPA Arlington, VA San Francisco, CA 5626 km 10 hops The Path DC -> SEA C:\tracert -d 131.107.151.194 Tracing route to 131.107.151.194 over a maximum 0 Arlington Virginia, ISI 1 16 ms <10 ms <10 ms 140.173.170.65 Arlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 Redmond Washington, Microsoft of 30 hops ------- DELL 4400 Win2K WKS Alteon GbE ------- Juniper M40 GbE ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Juniper M40 OC48 ------- Juniper M40 OC48 ------- Cisco GSR OC48 ------- Compaq SP750 Win2K WKS SysKonnect GbE 750mbps over 5000 km (957 mbps multi-stream) ~ 4e15 bit meters per second 4 Peta bmps (“peta bumps”) Single Stream tcp/ip throughput Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA 5 Peta bmps multi-stream “ PetaBumps” • 751 mbps for 300 seconds = (~28 GB) single-thread single-stream tcp/ip desktop-to-desktop out of the box performance* • 5626 km x 751Mbps = ~ 4.2e15 bit meter / second ~ 4.2 Peta bmps • Multi-steam is 952 mbps ~5.2 Peta bmps •4470 byte MTUs were enabled on all routers. •20 MB window size Pointers • The single-stream submission: http://research.microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm • The multi-stream submission: http://research.Microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm • The code: http://research.Microsoft.com/~gray/papers/speedy.htm speedy.h speedy.c And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/ Windows2000_WAN_Speed_Record.ppt What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey TPC-C high performance clusters Standard transaction processing benchmark Mix of 5 simple transaction types. Database scales with workload Measures balanced system. • Single Site Clusters – Billions of transactions per day – Tera-Ops & Peta-Bytes (10 k node clusters) – Micro-dollar/transaction tpmC Scalability Successes 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 tpmC vs Time h Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 • Hardware + Software advances – TPC & Sort examples (2x/year) – Many other examples Records Sorted per Second Doubles Every Year 1.E+06 1.E+03 1.E+00 GB Sorted per Dollar Doubles Every Year 1.E-03 1985 1990 1995 2000 Progress since Jan 99: Running out of gas? • 50% better peak perf (not 2x) • 2x better Price/Performance • At a cost ceiling Systems cost 7M$-13M$ • June 98 result: “hero” effort 240,000 230,000 220,000 210,000 200,000 190,000 180,000 170,000 160,000 150,000 tpmC (Compaq/Alpha/Oracle 96 cpu, 8node cluster, 102,542 tpmC @139$/tpmC, 5/5/98) 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Out’a gas? tpmC vs Time Out’a gas? 140,000 TpmC (off-scale good!) tpmC vs Time 130,000 120,000 110,000 100,000 90,000 80,000 70,000 60,000 50,000 h Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 40,000 30,000 20,000 10,000 0 Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00 tpmC vs Time 2/17/00: back on Schedule!! Back on tpmC 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 240,000 Schedule! 230,000 220,000 210,000 200,000 190,000 180,000 170,000 160,000 150,000 140,000 TpmC • First proof point of commoditized scale-out • 1.7x Better Performance 3x Better price/performance • 4M$ vs 7M$-13M$ • Much more to do, but… tpmC vs Time great start! 130,000 120,000 110,000 100,000 90,000 80,000 70,000 60,000 h Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 50,000 40,000 30,000 20,000 10,000 0 Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00 Year 2000 Sort Results Penny Minute TeraByte Daytona Indy 4.5 GB (45 m records) 4.5 GB (45 m records) 886 seconds on a $1010 Win2K/Intel system HMsort: doc (74KB), pdf (32KB). Brad Helmkamp, Keith McCready, Stenograph LLC 886 seconds on a $1010 Win2K/Intel system HMsort: doc (74KB), pdf (32KB). Brad Helmkamp, Keith McCready, Stenograph LLC 7.6 GB in 60 seconds Ordinal Nsort SGI 32 cpu Origin IRIX 21.8 GB 218 M records in 56.51 sec 49 minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck 68x2 Compaq Tandem Sandia Labs NOW+HPVMsort 64 nodes WinNT pdf (170KB). Luis Rivera , Xianan Zhang, Andrew Chien UCSD 1057 seconds SPsort 1952 SP cluster 2168 disks Jim Wyllie PDF SPsort.pdf (80KB) 1 M records in .998 Seconds Datamatio Datamation n (doc 703KB) or (pdf 50KB) Mitsubishi DIAPRISM Hardware Sorter with HP 4 x 550MHz Xeon PC server + 32 SCSI disks, Windows NT4 Shinsuke Azuma, Takao Sakuma, Tetsuya Takeo, Takaaki Ando, Kenji Shirai Mitsubishi Electric Corp. What’s a Balanced System? System Bus PCI Bus PCI Bus Rules of Thumb in Data Engineering • • • • • • • Moore’s law -> an address bit per 18 months. Storage grows 100x/decade (except 1000x last decade!) Disk data of 10 years ago now fits in RAM (iso-price). Device bandwidth grows 10x/decade – so need parallelism RAM:disk:tape price is 1:10:30 going to 1:10:10 Amdahl’s speedup law: S/(S+P) Amdahl’s IO law: bit of IO per instruction/second (tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars) • Amdahl’s memory law: byte per instruction/second (going to 10) (1 TB RAM per TOP: 1 TeraDollars) • PetaOps anyone? • Gilder’s law: aggregate bandwidth doubles every 8 months. • 5 Minute rule: cache disk data that is reused in 5 minutes. • Web rule: cache everything! http://research.Microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Cheap Storage • Disks are getting cheap: • 7 k$/TB disks (25 40 GB disks @ 230$ each) 900 40 Price vs disk capacity 800 700 35 IDE 600 30 SCSI y = 15.895x + 13.446 400 IDE SCSI 25 $ $ 500 raw k$/TB 20 15 300 200 10 y = 5.7156x + 47.857 100 5 0 0 0 10 20 30 40 Raw Disk unit Size GB 50 60 7 0 10 20 30 40 Disk unit size GB 50 60 Cheap Storage or Balanced System • Low cost storage (2 x 1.5k$ servers) 10K$ TB 2x (1K$ system + 8x70GB disks + 100MbEthernet) • Balanced server (9k$/.5 TB) – – – – – 2x800Mhz (2k$) 256 MB (500$) 8 x 73 GB drives (4K$) Gbps Ethernet + switch (1.5k$) 18k$ TB, 36K$/RAIDED TB 2x800 Mhz 256 MB 160 GB, 2k$ (now) 300 GB by year end. • 4x40 GB ID (2 hot plugable) – (1,100$) • SCSI-IDE bridge – 200k$ • Box – – – – 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ • Or 8 disks/box 600 GB for ~3K$ ( or 300 GB RAID) Hot Swap Drives for Archive or Data Interchange • 25 MBps write (so can write N x 74 GB in 3 hours) • 74 GB/overnite = ~N x 2 MB/second @ 19.95$/nite Doing Studies of IO bandwidth • SCSI & IDE bandwidth – ~15-30 MBps sequential – SCSI 10rpm ~ 110 kaps @ 600$ – IDE 7.2krpm ~ 80 kaps @ 250$ • Get 2 disks for the price of 1 – More bandwidth for reads – RAID – 10K$ raid TB by 2001 What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey The Sloan Digital Sky Survey A project run by the Astrophysical Research Consortium (ARC) The University of Chicago Princeton University The Johns Hopkins University The University of Washington Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study SLOAN Foundation, NSF, DOE, NASA Goal: To create a detailed multicolor map of the Northern Sky over 5 years, with a budget of approximately $80M Data Size: 40 TB raw, 1 TB processed Scientific Motivation Create the ultimate map of the Universe: The Cosmic Genome Project! Study the distribution of galaxies: What is the origin of fluctuations? What is the topology of the distribution? Measure the global properties of the Universe: How much dark matter is there? Local census of the galaxy population: How did galaxies form? Find the most distant objects in the Universe: What are the highest quasar redshifts? First Light Images Telescope: First light May 9th 1998 Equatorial scans Camera: The First Stripes 5 color imaging of >100 square degrees Multiple scans across the same fields Photometric limits as expected SDSS Data Flow SDSS Data Products Object catalog parameters of >108 objects Redshift Catalog parameters of 106 objects 400 GB 1 GB Atlas Images 5 color cutouts of >108 objects 1.5 TB Spectra in a one-dimensional form 60 GB Derived Catalogs - clusters - QSO absorption lines 20 GB 4x4 Pixel All-Sky Map heavily compressed 60 GB All raw data saved in a tape vault at Fermilab Distributed Implementation User Interface Analysis Engine Master SX Engine Objectivity Federation Objectivity Slave Slave Slave Objectivity Slave Objectivity RAID Objectivity RAID Objectivity RAID RAID Color Magnitude Diff/Ratio Distribution What We Have Been Doing – Database design – Data loading 1.E+6 1.E+5 u-g 1.E+4 Counts • Helping move the data to SQL 1.E+7 g-r r-i 1.E+3 i-z 1.E+2 1.E+1 1.E+0 -30 -20 -10 0 10 20 Magnitude Diff/Ratio • Experimenting with queries on a 4 M object DB – 20 questions like “find gravitational lens candidates” – Queries use parallelism, most run in a few seconds.(auto parallel) – Some run in hours (neighbors within 1 arcsec) – EASY to ask questions. • Helping with an “outreach” website: SkyServer • Personal goal: Try datamining techniques to “re-discover” Astronomy 30 What I have been Doing Peta Bumps 10k$ TB Scaleable Computing Sloan Digital Sky Survey