Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray Reality Check • Good news – In the limit, processing & storage & network is free – Processing & network is infinitely fast • Bad news – Most of us live in the present. – People are getting more expensive. Management/programming cost exceeds hardware cost. – Speed of light not improving. – WAN prices have not changed much in last 8 years. Interesting Topics • I’ll talk about server-side hardware • What about client hardware? – Displays, cameras, speech,…. • What about Software? – Databases, data mining, PDB, OODB – Objects / class libraries … – Visualization – Open Source movement How Much Information Is there? • Soon everything can be recorded and indexed • Most data never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. Everything ! Recorded All Books Yotta Zetta Exa MultiMedia Peta All LoC books (words) .Movi e A Photo Tera Giga Mega www.lesk.com/mlesk/ksg97/ksg.html A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Kilo Moore’s Law • Performance/Price doubles every 18 months • 100x per decade • Progress in next 18 months = ALL previous progress – New storage = sum of all old storage (ever) – New processing = sum of all old processing. • E. coli double ever 20 minutes! 15 years ago Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doubling 1.E+09 ops per second/$ doubles every 1.0 years 1.E+06 1.E+03 1.E+00 1.E-03 doubles every 7.5 years doubles every 2.3 years 1.E-06 1880 1900 1920 1940 1960 1980 2000 What’s a Balanced System? System Bus PCI Bus PCI Bus Storage capacity beating Moore’s law Disk TB Shipped per Year 1E+7 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. ExaByte 1E+6 • 5 k$/TB today (raw disk) disk TB growth: 112%/y 1E+5 Moore's Law: 58.7%/y 1E+4 1E+3 1988 Moores law Revenue TB growth Price decline 1991 1994 1997 2000 58.70% /year 7.47% 112.30% (since 1993 50.70% (since 1993 Cheap Storage • Disks are getting cheap: • 7 k$/TB disks (25 40 GB disks @ 230$ each) 900 40 Price vs disk capacity 800 700 35 IDE 600 30 SCSI y = 15.895x + 13.446 400 IDE SCSI 25 $ $ 500 raw k$/TB 20 15 300 200 10 y = 5.7156x + 47.857 100 5 0 0 0 10 20 30 40 Raw Disk unit Size GB 50 60 7 0 10 20 30 40 Disk unit size GB 50 60 Cheap Storage or Balanced System • Low cost storage (2 x 1.5k$ servers) 7K$ TB 2x (1K$ system + 8x60GB disks + 100MbEthernet) • Balanced server (7k$/.5 TB) – – – – – 2x800Mhz (2k$) 256 MB (400$) 8 x 60 GB drives (3K$) Gbps Ethernet + switch (1.5k$) 14k$ TB, 28K$/RAIDED TB 2x800 Mhz 256 MB The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 100 MB/s 200 Kaps 1 TB Hot Swap Drives for Archive or Data Interchange • 25 MBps write (so can write N x 60 GB in 40 minutes) • 60 GB/overnite = ~N x 2 MB/second @ 19.95$/nite 17$ 260$ 240 GB, 2k$ (now) 300 GB by year end. • 4x60 GB IDE (2 hot plugable) – (1,100$) • SCSI-IDE bridge – 200k$ • Box – – – – 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ • Or 8 disks/box 600 GB for ~3K$ ( or 300 GB RAID) Hot Swap Drives for Archive or Data Interchange • 25 MBps write (so can write N x 74 GB in 3 hours) • 74 GB/overnite = ~N x 2 MB/second @ 19.95$/nite It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?). A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) Disk vs Tape • Disk • Tape – 60 GB – 30 MBps – 5 ms seek time – 3 ms rotate latency – 7$/GB for drive 3$/GB for ctlrs/cabinet – 4 TB/rack – 1 hour scan – – – – – 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media 8$/GB for drive+library – 10 TB/rack – 1 week scan Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. Trends: Gilder’s Law: 3x bandwidth/year for 25 more years • Today: – 10 Gbps per channel – 4 channels per fiber: 40 Gbps – 32 fibers/bundle = 1.2 Tbps/bundle • • • • In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps Sense of scale • How fat is your pipe? • Fattest pipe on MS campus is the WAN! 300 MBps OC48 = G2 Or memcpy() 94 MBps Coast to Coast 90 MBps PCI 20 MBps disk / ATM / OC3 Redmond/Seattle, WA Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop New York HSCC (high speed connectivity consortium) DARPA Arlington, VA San Francisco, CA 5626 km 10 hops The Path DC -> SEA C:\tracert -d 131.107.151.194 Tracing route to 131.107.151.194 over a maximum 0 Arlington Virginia, ISI 1 16 ms <10 ms <10 ms 140.173.170.65 Arlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 Redmond Washington, Microsoft of 30 hops ------- DELL 4400 Win2K WKS Alteon GbE ------- Juniper M40 GbE ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Cisco GSR OC48 ------- Juniper M40 OC48 ------- Juniper M40 OC48 ------- Cisco GSR OC48 ------- Compaq SP750 Win2K WKS SysKonnect GbE “ PetaBumps” • 751 mbps for 300 seconds = (~28 GB) single-thread single-stream tcp/ip desktop-to-desktop out of the box performance* • 5626 km x 751Mbps = ~ 4.2e15 bit meter / second ~ 4.2 Peta bmps • Multi-steam is 952 mbps ~5.2 Peta bmps •4470 byte MTUs were enabled on all routers. •20 MB window size The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ • Yesterday: – 10 MBps (100 Mbps Ethernet) – ~20 MBps tcp/ip saturates 2 cpus – round-trip latency ~250 µs • Now 250 Time µs to Send 1KB 200 150 Transmit receivercpu sender cpu 100 – Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… – Fast user-level communication • tcp/ip ~ 100 MBps 10% cpu • round-trip latency is 15 us • 1.6 Gbps demoed on a WAN 50 0 100Mbps Gbps SAN Pointers • The single-stream submission: http://research.microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm • The multi-stream submission: http://research.Microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm • The code: http://research.Microsoft.com/~gray/papers/speedy.htm speedy.h speedy.c And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/ Windows2000_WAN_Speed_Record.ppt Networking • WANS are getting faster than LANS G8 = OC192 = 8Gbps is “standard” • Link bandwidth improves 4x per 3 years • Speed of light (60 ms round trip in US) • Software stacks have always been the problem. Time = SenderCPU + ReceiverCPU + bytes/bandwidth This has been the problem Rules of Thumb in Data Engineering • • • • • • • Moore’s law -> an address bit per 18 months. Storage grows 100x/decade (except 1000x last decade!) Disk data of 10 years ago now fits in RAM (iso-price). Device bandwidth grows 10x/decade – so need parallelism RAM:disk:tape price is 1:10:30 going to 1:10:10 Amdahl’s speedup law: S/(S+P) Amdahl’s IO law: bit of IO per instruction/second (tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars) • Amdahl’s memory law: byte per instruction/second (going to 10) (1 TB RAM per TOP: 1 TeraDollars) • PetaOps anyone? • Gilder’s law: aggregate bandwidth doubles every 8 months. • 5 Minute rule: cache disk data that is reused in 5 minutes. • Web rule: cache everything! http://research.Microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Dealing With TeraBytes (Petabytes): Requires Parallelism parallelism: use many little devices in parallel 1,000 x parallel: At 10 MB/s: 1.2 days to scan 100 seconds scan. 1 Terabyte 1 Terabyte 10 MB/s Use 100 processors & 1,000 disks Parallelism Must Be Automatic • There are thousands of MPI programmers. • There are hundreds-of-millions of people using parallel database search. • Parallel programming is HARD! • Find design patterns and automate them. • Data search/mining has parallel design patterns. Scalability: Up and Out Up •“Scale Up” –Use “big iron” (SMP) –Cluster into packs for availability •“Scale Out” clones & partitions –Use commodity servers –Add clones & partitions as needed Out Everyone scales out What’s the Brick? • 1M$/slice – IBM S390? – Sun E 10,000? • 100 K$/slice – HPUX/AIX/Solaris/IRIX/EMC • 10 K$/slice – Utel / Wintel 4x • 1 K$/slice – Beowulf / Wintel 1x Terminology for scaleability • Farms of servers: Farm – Clones: identical Partition • Scaleability + availability Clone – Partitions: • Scaleability – Packs Pack Shared Nothing • Partition availability via fail-over • GeoPlex for disaster tolerance. Shared Disk Shared Nothing ActiveActive ActivePassive Shared Nothing Clones Farm Shared Disk Clones Partition Clone Pack Partitions Packed Partitions Shared Nothing Shared Disk Shared Nothing ActiveActive ActivePassive Unpredictable Growth • The TerraServer Story: – – – – We expected 5 M hits per day We got 50 M hits on day 1 We peak at 15-20 M hpd on a “hot” day Average 5 M hpd after 1 year • Most of us cannot predict demand – Must be able to deal with NO demand – Must be able to deal with HUGE demand An Architecture for Internet Services? • Need to be able to add capacity – New processing – New storage – New networking • Need continuous service – Online change of all components (hardware and software) – Multiple service sites – Multiple network providers • Need great development tools – Change the application several times per year. – Add new services several times per year. Farm Premise: Each Site is a • Buy computing by the slice (brick): Building 11 – Rack of servers + disks. • Grow by adding slices Internal WWW Staging Servers (7) Log Processing Av e CFG: 4xP6, 1 GB RAM, 180 GB HD Av e Cost: $128K FY98 Fcst: 2 • Two styles: The Microsoft.Com Site SQLNet Feeder LAN Router Liv e SQL Serv ers MOSWest Admin LAN Live SQL Server All servers in Building11 are accessable from corpnet. w w w .microsoft.com (4) register.microsoft.com (2) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 3 Av e CFG: 4xP6, 512 RAM, 160 GB HD Av e Cost: $83K FY98 Fcst: 12 Av e CFG: 4xP6, 512 RAM, 50 GB HD Av e Cost: $35K FY98 Fcst: 2 home.microsoft.com (4) Av e CFG: 4xP6 512 RAM 28 GB HD Av e Cost: $35K FY98 Fcst: 17 FDDI Ring (MIS1) home.microsoft.com (3) FDDI Ring (MIS2) Av e CFG: 4xP6, 256 RAM, 30 GB HD Av e Cost: $25K FY98 Fcst: 2 Router Internet register.msn.com (2) Switched Ethernet search.microsoft.com (1) Japan Data Center w w w .microsoft.com SQL SERVERS (2) premium.microsoft.com (3) Av e CFG: 4xP6, (1) Av e CFG: 4xP6, 512 RAM, Av e CFG: 4xP6, 512 RAM, 30 GB HD Av e Cost: $35K FY98 Fcst: 1 512 RAM, 50 GB HD Av e Cost: $50K FY98 Fcst: 1 160 GB HD Av e Cost: $80K FY98 Fcst: 1 msid.msn.com (1) Switched Ethernet FTP Download Serv er (1) HTTP Download Serv ers (2) search.microsoft.com (2) Router Secondary Gigaswitch support.microsoft.com search.microsoft.com (1) (3) Router support.microsoft.com (2) 13 DS3 (45 Mb/Sec Each) Ave CFG: 4xP5, 512 RAM, 30 GB HD Ave Cost: $28K FY98 Fcst: 0 register.microsoft.com (2) register.microsoft.com (1) (100Mb/Sec Each) Internet Router FTP.microsoft.com (3) msid.msn.com (1) 2 OC3 Primary Gigaswitch Router Router Av e CFG: 4xP5, 256 RAM, 20 GB HD Av e Cost: $29K FY98 Fcst: 2 Av e CFG: 4xP6, 512 RAM, 30 GB HD Av e Cost: $28K FY98 Fcst: 7 activex.microsoft.com (2) Av e CFG: 4xP6, 512 RAM, 30 GB HD Av e Cost: $28K FY98 Fcst: 3 Router home.microsoft.com (2) SQL SERVERS (2) Av e CFG: 4xP6, 512 RAM, 160 GB HD Av e Cost: $80K FY98 Fcst: 1 Router premium.microsoft.com (1) FDDI Ring (MIS3) FTP Download Serv er (1) Router Router msid.msn.com (1) 512 RAM, 30 GB HD Av e Cost: $35K FY98 Fcst: 1 msid.msn.com (1) search.microsoft.com (3) cdm.microsoft.com (1) Av e CFG: 4xP5, 256 RAM, 12 GB HD Av e Cost: $24K FY98 Fcst: 0 Av e CFG: 4xP6, 1 GB RAM, 160 GB HD Av e Cost: $83K FY98 Fcst: 2 msid.msn.com (1) w w w .microsoft.com (4) 512 RAM, 30 GB HD Ave Cost: $43K FY98 Fcst: 10 Av e CFG: 4xP6, 512 RAM, 50 GB HD Av e Cost: $50K FY98 Fcst: 17 w w w .microsoft.com (3) w w w .microsoft.compremium.microsoft.com (1) Av e CFG: 4xP6, Av e CFG: 4xP6, (3) 512 RAM, 50 GB HD Av e Cost: $50K FY98 Fcst: 1 SQL Consolidators DMZ Staging Serv ers Router SQL Reporting Av e CFG: 4xP6, 512 RAM, 160 GB HD Av e Cost: $80K FY98 Fcst: 2 European Data Center IDC Staging Serv ers MOSWest FTP Servers Ave CFG: 4xP5, 512 RAM, Download 30 GB HD Replication Ave Cost: $28K FY98 Fcst: 0 premium.microsoft.com (2) – Spread data and computation to new slices Ave CFG: 4xP5, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 12 w w w .microsoft.com (5) Internet FDDI Ring (MIS4) home.microsoft.com (5) Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 9 \\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd 12/15/97 – Clones: anonymous servers – Parts+Packs: Partitions fail over within a pack • In both cases, remote farm for disaster recovery 2 Ethernet (100 Mb/Sec Each) Clones: Availability+Scalability • Some applications are – Read-mostly – Low consistency requirements – Modest storage requirement (less than 1TB) • Examples: – HTML web servers (IP sprayer/sieve + replication) – LDAP servers (replication via gossip) • Replicate app at all nodes (clones) • • • • Spray requests across nodes. Grow by adding clones Fault tolerance: stop sending to that clone. Growth: add a clone. Two Clone Geometries • Shared-Nothing: exact replicas • Shared-Disk (state stored in server) Shared Nothing Clones Shared Disk Clones Facilities Clones Need • Automatic replication – Applications (and system software) – Data • Automatic request routing – Spray or sieve • Management: – Who is up? – Update management & propagation – Application monitoring. • Clones are very easy to manage: – Rule of thumb: 100’s of clones per admin Partitions for Scalability • Clones are not appropriate for some apps. – Statefull apps do not replicate well – high update rates do not replicate well • Examples – Email / chat / … – Databases • Partition state among servers • Scalability (online): – Partition split/merge – Partitioning must be transparent to client. Partitioned/Clustered Apps • Mail servers – Perfectly partitionable • Business Object Servers – Partition by set of objects. • Parallel Databases • Transparent access to partitioned tables • Parallel Query Packs for Availability • Each partition may fail (independent of others) • Partitions migrate to new node via fail-over – Fail-over in seconds • Pack: the nodes supporting a partition – – – – – VMS Cluster Tandem Process Pair SP2 HACMP Sysplex™ WinNT MSCS (wolfpack) • Cluster In A Box now commodity • Partitions typically grow in packs. What Parts+Packs Need • Automatic partitioning (in dbms, mail, files,…) – Location transparent – Partition split/merge – Grow without limits (100x10TB) • Simple failover model – Partition migration is transparent – MSCS-like model for services • Application-centric request routing • Management: – Who is up? – Automatic partition management (split/merge) – Application monitoring. Partitions and Packs • Packs for availabilty Partitions Packed Partitions GeoPlex: Farm pairs • • • • Two farms Changes from one sent to other When one farm fails other provides service Masks – Hardware/Software faults – Operations tasks (reorganize, upgrade move – Environmental faults (power fail) Services on Clones & Partitions • Application provides a set of services • If cloned: – Services are on subset of clones • If partitioned: – Services run at each partition • System load balancing routes request to – Any clone – Correct partition. – Routes around failures. Cluster Scenarios: 3- tier systems A simple web site Clones for availability Packs for availability Web File Store SQL Temp State Load Balance Web Clients SQL Database Front End Cluster Scale Out Scenarios The FARM: Clones and Packs of Partitions Packed Partitions: Database Transparency SQL Partition 3 Cloned Packed file servers SQL Partition 2 replication Web File StoreA Web File StoreB Web Clients SQL SQLPartition1 Database SQL Temp State Load Balance Cloned Front Ends (firewall, sprayer, web server) Terminology • Terminology for scaleability • Farms of servers: – Clones: identical Farm Partition Clone • Scaleability + availability – Partitions: • Scaleability Pack Shared Nothing – Packs • Partition availability via fail-over • GeoPlex for disaster tolerance. Shared Disk Shared Nothing ActiveActive ActivePassive • Helping move the data to SQL – Database design – Data loading 1.E+7 1.E+6 1.E+5 u-g 1.E+4 Counts What we have been doing with SDSS Color Magnitude Diff/Ratio Distribution g-r r-i 1.E+3 i-z 1.E+2 1.E+1 1.E+0 -30 -20 -10 0 10 20 Magnitude Diff/Ratio • Experimenting with queries on a 4 M object DB – 20 questions like “find gravitational lens candidates” – Queries use parallelism, most run in a few seconds.(auto parallel) – Some run in hours (neighbors within 1 arcsec) – EASY to ask questions. • Helping with an “outreach” website: SkyServer • Personal goal: Try datamining techniques to “re-discover” Astronomy 30 References (.doc or .pdf) • Technology forecast: http://research.microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc • Gbps experiments: http://research.microsoft.com/~gray/ • Disk experiments (10K$ TB) http://research.microsoft.com/~gray/papers/Win2K_IO_MSTR_2000_55.doc • Scaleability Terminology http://research.microsoft.com/~gray/papers/MS_TR_99_85_Scalability_Terminology.doc