How Computer Architecture Trends May Affect Future Distributed Systems Mark D. Hill Computer Sciences Department University of Wisconsin--Madison http://www.cs.wisc.edu/~markhill PODC ‘00 Invited Talk (C) 2000 Mark D. Hill University of Wisconsin-Madison Three Questions • What is a System Area Network (SAN) and how will it affect clusters? – E.g., InfiniBand • How fat will multiprocessor servers be and how to we build larger ones? – E.g. Wisconsin Multifacet’s Multicast & Timestamp Snooping • Future of multiprocessor servers & clusters? – A merging of both? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Outline • Motivation • System Area Networks • Designing Multiprocessor Servers • Server & Cluster Trends (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Technology Push: Moore’s Law • What do following intervals have in common? – Prehistory to 2000 – 2001 to 2002 • Answer: Equal progress in absolute processor speed (and more doubling 2003-4, 2005-6, etc.) – Consider salary doubling • Corollary: Cost halves every two years – Jim Gray: In a decade you can buy a computer for less than its sales tax today (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Application Pull • Should use computers in currently wasteful ways – Already computers in electric razors & greeting cards • New business models – B2C, B2B, C2B, C2C – Mass customization • More proactive (beyond interactive) [Tennenhouse] – – – – – Today: P2C where P==Person & C==Computer More C2P: mattress adjusts to save your back More C2C: Agents surf the web for optimal deal More sensors (physical/logic worlds coupled) More hidden computers (c.f., electric motors) • Furthermore, I am wrong (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends The Internet Iceberg • Internet Components – – – – Clients -- mobile, wireless “On Ramp” -- LANs/DSL/Cable Modems WAN Backbone -- IPv6, massive BW and ... • SERVICES – – – – Scale Storage Scale Bandwidth Scale Computation High Availability (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Outline • Motivation • System Area Networks – – – – What is a SAN? InfiniBand Virtualizing I/O with Queue Pairs Predictions • Designing Multiprocessor Servers • Server & Cluster Trends (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Regarding Storage/Bandwidth • Currently resides on I/O Bus (PCI) – HW & SW protocol stacks – Must add hosts to add storage/bandwidth proc proc memory interconnect memory bridge i/o bus i/o slot 0 (C) 2000 Mark D. Hill i/o slot n-1 PODC00: Computer Architecture Trends Want System Area Network (SAN) • SAN vs. Local Area Nework (LAN) – – – – – Higher bandwidth (10 Gbps) Lower latency (few microseconds or less) More limited size Other (e.g., single administrative domain, short distance) Examples: Tandem Servernet & Myricom Myrinet • Emerging Standard: InfiniBand – www.inifinibandTA.org w/ spec 1.0 Summer 2000 – Compaq, Dell, HP, IBM, Intel, Microsoft, Sun, & others – 2.5 Gbits/s times 1, 4, or 12 wires (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends InfiniBand Model (from website) proc proc memory interconnect memory Other networks router X C A HCA (host channel adapter) link switch T C A target (disks) Other switches, hosts, targets, etc. (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Inifiniband Advantages • Storage/Network made orthogonal from Computation • Reduce “hardware” stack -- no i/o bridge • Reduce “software” stack; hardware support for – – – – – Connected Reliable Connected Unreliable Datagram Reliable Datagram Raw Datagram • Can eliminate system call for SAN use (next slide) (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Virtualizing InfiniBand • I/O traditionally virtualized with system call – System enforces isolation – System permits authorized sharing • Memory virtualized – System trap/call for setup – Virtual memory hardware for common-case translation • Infiniband exploits “queue pairs” (QPs) in memory – C.f., Intel Virtual Interface Architecture (VIA) [IEEE Micro, Mar/Apr ‘98] – Users issue sends, receives, & remote DMA reads/writes (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Queue Pair proc • QP setup system call Main Memory dma-W4 – Connect with process – Connect with remote QP (not shown here) dma-R3 send2 receive1 • QP placed in “pinned” virtual memory send1 receive2 • User directly access QP HCA (C) 2000 Mark D. Hill – E.g., sends, receives & remote DMA reads/writes PODC00: Computer Architecture Trends InfiniBand, cont. • Roadmap – NGIO/FIO merger in ‘99 – Spec in ‘00 – Products in ‘03-’10 • My Assessment – – – – PCI needs successor InfiniBand has the necessary features (but also many others) InifiniBand has considerable industry buy-in (but it is recent) Gigabit Ethernet will be only competitor • Good name with backing from Cisco et al. • But TCP/IP is a killer – Infiniband for storage will be key (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends InfiniBand Research Issues • Software Wide Open – Industry will do local optimization (e.g., still have device driver virtualized with system calls) – But what is the “right” way to do software? – Is there a theoretical model for this software? • Other SAN Issues – – – – A theoretical model of a service-providers site? How to trade performance and availability? Utility of broadcast or multicast support? Obtaining quasi-real-time performance? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Outline • Motivation • System Area Networks • Designing Multiprocessor Servers – – – – How Fat? Coherence for Servers E.g., Multicast Snooping E.g., Timestamp Snooping • Server & Cluster Trends (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends How Fat Should Servers Be? • Use – PCs -- cheap but small – Workgroup servers -- medium cost; medium size – Large servers -- premium cost & size • One answer: “yes” PCs w/ “soft” state (C) 2000 Mark D. Hill Servers running databases for “hard” state PODC00: Computer Architecture Trends How Do We Build the Big Servers? • (Industry knows how to build the small ones) • A key problem is the memory system – Memory Wall: E.g., 100ns memory access = 400 instruction opportunities for 4-way 1GHz processor • Use per-processor caches to reduce – Effective Latency – Effective Bandwidth Used • But cache coherence problem ... (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends “4” r0<-m[100] P0 m[100]<-5 cache 100 : X 45 Coherence 101 “4” r1<-m[100] P1 “?” Pn-1 “?” r2<-m[100] r3<-m[100] cache 100 : 4 cache interconnection network memory (C) 2000 Mark D. Hill memory 100 4 PODC00: Computer Architecture Trends Broadcast Snooping P2:GETX P2:GETX P1:GETX Ordered Address Network P2:GETX P1:GETX P1:GETX P2:GETX P1:GETX P2:GETX P1:GETX P2:GETX Mem P0 data P1 data data P2 data data Data Network (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Broadcast Snooping • Symmetric Multiprocessor (SMP) – Most commercially-successful parallel computer architecture – Performs well by finding data directly – Scales poorly • Improvements, e.g., Sun E10000 – – – – Split address & data transactions Split address & data network (e.g., bus & crossbar) Multiple address buses (e.g., four multiplexed by address) Address bus is broadcast tree (not shared wires) • But… – Broadcast all address transactions (expensive) – All processors must snoop all transactions (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Directories Address Network P2:GETXP2:GETX P1:GETX send send Dir/Mem P1:GETX P2:GETX P0 data P1 data data P2 data data Data Network (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Directories • Directory Based Cache Coherence – E.g., SGI/Cray Origin2000 – Allows arbitrary point-to-point interconnection network – Scales up well • But – Cache-to-cache transfers common in demanding apps (55-62% sharing misses for OLTP [Barroso ISCA ‘98]) – Many applications can’t use 100s of processors – Must also “scale down” well (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Wisconsin Multifacet: Big Picture • Build Servers For Internet economy – Moderate multiprocessor sizes: 2-8 then 16-64, but not 1K – Optimize for these workloads (e.g. cache-to-cache transfers) • Key Tool: Multiprocessor Prediction & Speculation – Make a guess... verify it later – Uniprocessor predecessors: branch & set predictors – Recent multiprocessor work: [Mukherjee/Hill ISCA98], [Kaxiras/Goodman HPCA99] & [Lai/Falsafi ISCA99] – Multicast Snooping – Timestamp Snooping (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Comparison of Coherence Methods Coherence Attribute Find previous owner directly? Always broadcast? Ordering w/o acks? Stateless at memory? Ordered network? Snooping Yes Multicast Snooping Sometimes Usually (good) No (good) No Yes No Yes (good) Yes No Yes No No but simpler Yes, a challenge Yes Directories Use prediction to improve on both? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Multicast Snooping • On cache miss – Predict "multicast mask" (e.g., bit vector of processors) – Issue transaction on multicast address network • Networks – Address network that totally-orders address multicasts – Separate point-to-point data network • Processors snoop all incoming transactions – If it's your own, it "occurs" now – If another's, then invalidate and/or respond • Simplified directory (at memory) – Purpose: Allows masks to be wrong (explained later) (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Predicting Masks block address feedback predicted mask Mask Predictor • Performed at Requesting Processor – Include owner (GETS/GETX) & all sharers (GETX only) – Exclude most other processors • Techniques – Many straightforward cases (e.g., stack, code, space-sharing) – Many options (network load, PC, software, local/global) (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Implementing an Ordered Multicast Network • Address Network – Must create the illusion of total order of multicasts – May deliver a multicast to destinations at different times • Wish List – – – – High throughput for multicasts No centralized bottlenecks Low latency and cost (~ pipelined broadcast tree) ... • Sample Solutions – Isotach Networks [Reynolds et al., IEEE TPDS 4/97] – Indirect Fat Tree [ISCA `99] – Direct Torus (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Indirect Fat Tree [ISCA ‘99] (C) 2000 Mark D. Hill P$ DM PODC00: Computer Architecture Trends Indirect Fat Tree, cont. • Basic Idea – – – – Processors send transactions up to roots Roots send transactions down with logical timestamp Switches stall transactions to keep in order Null transaction sent to avoid deadlock • Assessment – Viable & high cross-section bandwidth – Many "backplane" ASICs means higher cost – Often stalls transactions • Want – Lower cost of direct connections – Always delivery transactions as soon as possible (ASAP) – Sacrifice some cross-section bandwidth (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Direct 2-D Torus (work in progress) • Features – Each processor is switch – Switches directly connected – E.g., network of Compaq 21364 0 1 14 15 • Network order? – Broadcasts unordered – Snooping needs total order • Solution – Create order with logical timestamps instead of network delivery order – Called Timestamp Snooping [ASPLOS ‘00] (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Timestamp Snooping • Timestamp Snooping – Snooping with order determined by logical timestamps – Broadcast (not multicast) in ASPLOS ‘00 • Basic Idea – – – – Assign timestamp to coherence transactions at sender Broadcast transactions over unordered network ASAP Transaction carry timestamp (2 bits) Processors process transactions in timestamp order (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Timestamp Snooping Issues • More address bandwidth – For 16-processors, 4-ary butterfly, 64-byte blocks – Directory: 3*8 + 3*72 + more = 240 + more – Timestamp Snooping 21*8 + 3*72 = 384 (< 60% more) • Network must guarantee timestamps – Assert future transactions will have greater timestamps (so processor can processor older transactions) – Isotach [Reynolds IEEE TPDS 4/97] more aggressively • Other – Priority queue at processor to order transactions – Flow control and buffering issues (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Initial Multifacet Results • Multicast Snooping [ISCA ‘99] – – – – Ordered multicast of coherence transactions Find data directly from memory or caches Reduce bandwidth to permit some scaling 32-processor results show 2-6 destinations per multicast • Timestamp Snooping [ASPLOS ‘00] – Broadcast snooping with “order” determined by logical timestamps carried by coherence transactions – No bus: Allows arbitrary memory interconnects – No directory or directory indirection – 16-processor results show 25% faster for 25% more traffic (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Selected Issues • Multicast Snooping – What program property are mask predictors exploiting? – Why is there no good model of locality or the “90-10” rule in general? – How does one build multicast networks? – What about fault tolerance? • Timestamp Snooping – What is an optimal network topology? – What about buffering, deadlock, etc.? – Implementing switches and priority queues? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Outline • Motivation • System Area Networks • Designing Multiprocessor Servers • Server & Cluster Trends – Out-of-box and highly-available servers – High-performance communication for clusters (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Multiprocessor Servers • High-Performance Communication “within box” – SMPs (e.g., Intel PentiumPro Quads) – Directory-based (SGI Origin2000) • Trend toward hierarchical “out of box” solutions – Build bigger servers from smaller ones – Intel Profusion, Sequent NUMA-Q, Sun WildFire (pictured) (C) 2000 Mark D. Hill SMP SMP SMP SMP PODC00: Computer Architecture Trends Multiprocessor Servers, cont. • Traditionally had poor error isolation – Double-bit ECC error crashes everything – Kernel error crashes everything – Poor match for highly available Internet infrastructure • Improve error isolation – IBM 370 “virtual machines” – Stanford HIVE “cells” (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Clusters • Traditionally – Good error isolation – Poor communication performance (especially latency) – LANs are not optimized for clusters • Enter Early SANs – Berkeley NOW w/ Myricom Myrinet – IBM SP w/ proprietary network • What now with InfiniBand SAN (or alternatives)? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends A Prediction • Blurring of cluster & server boundaries • Clusters – High communication performance • Servers – Better error isolation – Multi-box solutions • Use same hardware & configure in the field • Issues – How do we model these hybrids? – Should PODC & SPAA also converge? (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends Three Questions • What is a System Area Network (SAN) and how will it affect clusters? – E.g., InfiniBand – Make computation, storage, & network orthogonal • How fat will multiprocessor servers be and how to we build larger ones? – Varying sizes for soft & hard state – E.g., Multicast Snooping & Timestamp Snooping • Future of multiprocessor servers & clusters? – Servers will support higher availability & extra-box solutions – Clusters will get server communication performance (C) 2000 Mark D. Hill PODC00: Computer Architecture Trends