How Computer Architecture Trends May Affect Future Distributed

advertisement
How Computer Architecture Trends
May Affect Future Distributed Systems
Mark D. Hill
Computer Sciences Department
University of Wisconsin--Madison
http://www.cs.wisc.edu/~markhill
PODC ‘00 Invited Talk
(C) 2000 Mark D. Hill
University of Wisconsin-Madison
Three Questions
• What is a System Area Network (SAN)
and how will it affect clusters?
– E.g., InfiniBand
• How fat will multiprocessor servers be
and how to we build larger ones?
– E.g. Wisconsin Multifacet’s Multicast & Timestamp Snooping
• Future of multiprocessor servers & clusters?
– A merging of both?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Technology Push: Moore’s Law
• What do following intervals have in common?
– Prehistory to 2000
– 2001 to 2002
• Answer: Equal progress in absolute processor speed
(and more doubling 2003-4, 2005-6, etc.)
– Consider salary doubling
• Corollary: Cost halves every two years
– Jim Gray: In a decade you can buy a computer
for less than its sales tax today
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Application Pull
• Should use computers in currently wasteful ways
– Already computers in electric razors & greeting cards
• New business models
– B2C, B2B, C2B, C2C
– Mass customization
• More proactive (beyond interactive) [Tennenhouse]
–
–
–
–
–
Today: P2C where P==Person & C==Computer
More C2P: mattress adjusts to save your back
More C2C: Agents surf the web for optimal deal
More sensors (physical/logic worlds coupled)
More hidden computers (c.f., electric motors)
• Furthermore, I am wrong
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
The Internet Iceberg
• Internet Components
–
–
–
–
Clients -- mobile, wireless
“On Ramp” -- LANs/DSL/Cable Modems
WAN Backbone -- IPv6, massive BW
and ...
• SERVICES
–
–
–
–
Scale Storage
Scale Bandwidth
Scale Computation
High Availability
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
–
–
–
–
What is a SAN?
InfiniBand
Virtualizing I/O with Queue Pairs
Predictions
• Designing Multiprocessor Servers
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Regarding Storage/Bandwidth
• Currently resides on I/O Bus (PCI)
– HW & SW protocol stacks
– Must add hosts to add storage/bandwidth
proc
proc
memory interconnect
memory
bridge
i/o bus
i/o slot 0
(C) 2000 Mark D. Hill
i/o slot n-1
PODC00: Computer Architecture Trends
Want System Area Network (SAN)
• SAN vs. Local Area Nework (LAN)
–
–
–
–
–
Higher bandwidth (10 Gbps)
Lower latency (few microseconds or less)
More limited size
Other (e.g., single administrative domain, short distance)
Examples: Tandem Servernet & Myricom Myrinet
• Emerging Standard: InfiniBand
– www.inifinibandTA.org w/ spec 1.0 Summer 2000
– Compaq, Dell, HP, IBM, Intel, Microsoft, Sun, & others
– 2.5 Gbits/s times 1, 4, or 12 wires
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
InfiniBand Model (from website)
proc
proc
memory interconnect
memory
Other
networks
router
X
C
A
HCA (host channel adapter)
link
switch
T
C
A
target
(disks)
Other switches, hosts, targets, etc.
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Inifiniband Advantages
• Storage/Network made orthogonal from Computation
• Reduce “hardware” stack -- no i/o bridge
• Reduce “software” stack; hardware support for
–
–
–
–
–
Connected Reliable
Connected Unreliable
Datagram
Reliable Datagram
Raw Datagram
• Can eliminate system call for SAN use (next slide)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Virtualizing InfiniBand
• I/O traditionally virtualized with system call
– System enforces isolation
– System permits authorized sharing
• Memory virtualized
– System trap/call for setup
– Virtual memory hardware for common-case translation
• Infiniband exploits “queue pairs” (QPs) in memory
– C.f., Intel Virtual Interface Architecture (VIA)
[IEEE Micro, Mar/Apr ‘98]
– Users issue sends, receives, & remote DMA reads/writes
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Queue Pair
proc
• QP setup system call
Main
Memory
dma-W4
– Connect with process
– Connect with remote QP
(not shown here)
dma-R3
send2 receive1
• QP placed in “pinned”
virtual memory
send1 receive2
• User directly access QP
HCA
(C) 2000 Mark D. Hill
– E.g., sends, receives &
remote DMA reads/writes
PODC00: Computer Architecture Trends
InfiniBand, cont.
• Roadmap
– NGIO/FIO merger in ‘99
– Spec in ‘00
– Products in ‘03-’10
• My Assessment
–
–
–
–
PCI needs successor
InfiniBand has the necessary features (but also many others)
InifiniBand has considerable industry buy-in (but it is recent)
Gigabit Ethernet will be only competitor
• Good name with backing from Cisco et al.
• But TCP/IP is a killer
– Infiniband for storage will be key
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
InfiniBand Research Issues
• Software Wide Open
– Industry will do local optimization
(e.g., still have device driver virtualized with system calls)
– But what is the “right” way to do software?
– Is there a theoretical model for this software?
• Other SAN Issues
–
–
–
–
A theoretical model of a service-providers site?
How to trade performance and availability?
Utility of broadcast or multicast support?
Obtaining quasi-real-time performance?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
–
–
–
–
How Fat?
Coherence for Servers
E.g., Multicast Snooping
E.g., Timestamp Snooping
• Server & Cluster Trends
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
How Fat Should Servers Be?
• Use
– PCs -- cheap but small
– Workgroup servers -- medium cost; medium size
– Large servers -- premium cost & size
• One answer: “yes”
PCs w/
“soft” state
(C) 2000 Mark D. Hill
Servers running
databases for
“hard” state
PODC00: Computer Architecture Trends
How Do We Build the Big Servers?
• (Industry knows how to build the small ones)
• A key problem is the memory system
– Memory Wall: E.g., 100ns memory access =
400 instruction opportunities for 4-way 1GHz processor
• Use per-processor caches to reduce
– Effective Latency
– Effective Bandwidth Used
• But cache coherence problem ...
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
“4”
r0<-m[100]
P0
m[100]<-5
cache
100 : X
45
Coherence 101
“4” r1<-m[100]
P1 “?”
Pn-1
“?”
r2<-m[100] r3<-m[100]
cache
100 : 4
cache
interconnection network
memory
(C) 2000 Mark D. Hill
memory
100
4
PODC00: Computer Architecture Trends
Broadcast Snooping
P2:GETX
P2:GETX
P1:GETX
Ordered Address Network
P2:GETX P1:GETX
P1:GETX
P2:GETX P1:GETX
P2:GETX P1:GETX
P2:GETX
Mem
P0
data
P1
data
data
P2
data
data
Data Network
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Broadcast Snooping
• Symmetric Multiprocessor (SMP)
– Most commercially-successful parallel computer architecture
– Performs well by finding data directly
– Scales poorly
• Improvements, e.g., Sun E10000
–
–
–
–
Split address & data transactions
Split address & data network (e.g., bus & crossbar)
Multiple address buses (e.g., four multiplexed by address)
Address bus is broadcast tree (not shared wires)
• But…
– Broadcast all address transactions (expensive)
– All processors must snoop all transactions
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Directories
Address Network
P2:GETXP2:GETX
P1:GETX
send send
Dir/Mem
P1:GETX P2:GETX
P0
data
P1
data
data
P2
data
data
Data Network
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Directories
• Directory Based Cache Coherence
– E.g., SGI/Cray Origin2000
– Allows arbitrary point-to-point interconnection network
– Scales up well
• But
– Cache-to-cache transfers common in demanding apps
(55-62% sharing misses for OLTP [Barroso ISCA ‘98])
– Many applications can’t use 100s of processors
– Must also “scale down” well
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Wisconsin Multifacet: Big Picture
• Build Servers For Internet economy
– Moderate multiprocessor sizes: 2-8 then 16-64, but not 1K
– Optimize for these workloads (e.g. cache-to-cache transfers)
• Key Tool: Multiprocessor Prediction & Speculation
– Make a guess... verify it later
– Uniprocessor predecessors: branch & set predictors
– Recent multiprocessor work: [Mukherjee/Hill ISCA98],
[Kaxiras/Goodman HPCA99] & [Lai/Falsafi ISCA99]
– Multicast Snooping
– Timestamp Snooping
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Comparison of Coherence Methods
Coherence
Attribute
Find previous
owner directly?
Always
broadcast?
Ordering w/o
acks?
Stateless at
memory?
Ordered
network?
Snooping
Yes
Multicast
Snooping
Sometimes Usually
(good)
No (good)
No
Yes
No
Yes (good)
Yes
No
Yes
No
No but
simpler
Yes, a
challenge
Yes
Directories
Use prediction to improve on both?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Multicast Snooping
• On cache miss
– Predict "multicast mask" (e.g., bit vector of processors)
– Issue transaction on multicast address network
• Networks
– Address network that totally-orders address multicasts
– Separate point-to-point data network
• Processors snoop all incoming transactions
– If it's your own, it "occurs" now
– If another's, then invalidate and/or respond
• Simplified directory (at memory)
– Purpose: Allows masks to be wrong (explained later)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Predicting Masks
block address
feedback
predicted mask
Mask Predictor
• Performed at Requesting Processor
– Include owner (GETS/GETX) & all sharers (GETX only)
– Exclude most other processors
• Techniques
– Many straightforward cases (e.g., stack, code,
space-sharing)
– Many options (network load, PC, software, local/global)
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Implementing an Ordered Multicast Network
• Address Network
– Must create the illusion of total order of multicasts
– May deliver a multicast to destinations at different times
• Wish List
–
–
–
–
High throughput for multicasts
No centralized bottlenecks
Low latency and cost (~ pipelined broadcast tree)
...
• Sample Solutions
– Isotach Networks [Reynolds et al., IEEE TPDS 4/97]
– Indirect Fat Tree [ISCA `99]
– Direct Torus
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Indirect Fat Tree [ISCA ‘99]
(C) 2000 Mark D. Hill
P$
DM
PODC00: Computer Architecture Trends
Indirect Fat Tree, cont.
• Basic Idea
–
–
–
–
Processors send transactions up to roots
Roots send transactions down with logical timestamp
Switches stall transactions to keep in order
Null transaction sent to avoid deadlock
• Assessment
– Viable & high cross-section bandwidth
– Many "backplane" ASICs means higher cost
– Often stalls transactions
• Want
– Lower cost of direct connections
– Always delivery transactions as soon as possible (ASAP)
– Sacrifice some cross-section bandwidth
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Direct 2-D Torus (work in progress)
• Features
– Each processor is switch
– Switches directly connected
– E.g., network of Compaq 21364
0
1
14 15
• Network order?
– Broadcasts unordered
– Snooping needs total order
• Solution
– Create order with logical timestamps
instead of network delivery order
– Called Timestamp Snooping [ASPLOS ‘00]
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Timestamp Snooping
• Timestamp Snooping
– Snooping with order determined by logical timestamps
– Broadcast (not multicast) in ASPLOS ‘00
• Basic Idea
–
–
–
–
Assign timestamp to coherence transactions at sender
Broadcast transactions over unordered network ASAP
Transaction carry timestamp (2 bits)
Processors process transactions in timestamp order
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Timestamp Snooping Issues
• More address bandwidth
– For 16-processors, 4-ary butterfly, 64-byte blocks
– Directory: 3*8 + 3*72 + more = 240 + more
– Timestamp Snooping 21*8 + 3*72 = 384 (< 60% more)
• Network must guarantee timestamps
– Assert future transactions will have greater timestamps
(so processor can processor older transactions)
– Isotach [Reynolds IEEE TPDS 4/97] more aggressively
• Other
– Priority queue at processor to order transactions
– Flow control and buffering issues
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Initial Multifacet Results
• Multicast Snooping [ISCA ‘99]
–
–
–
–
Ordered multicast of coherence transactions
Find data directly from memory or caches
Reduce bandwidth to permit some scaling
32-processor results show 2-6 destinations per multicast
• Timestamp Snooping [ASPLOS ‘00]
– Broadcast snooping with “order” determined by
logical timestamps carried by coherence transactions
– No bus: Allows arbitrary memory interconnects
– No directory or directory indirection
– 16-processor results show 25% faster for 25% more traffic
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Selected Issues
• Multicast Snooping
– What program property are mask predictors exploiting?
– Why is there no good model of locality
or the “90-10” rule in general?
– How does one build multicast networks?
– What about fault tolerance?
• Timestamp Snooping
– What is an optimal network topology?
– What about buffering, deadlock, etc.?
– Implementing switches and priority queues?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Outline
• Motivation
• System Area Networks
• Designing Multiprocessor Servers
• Server & Cluster Trends
– Out-of-box and highly-available servers
– High-performance communication for clusters
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Multiprocessor Servers
• High-Performance Communication “within box”
– SMPs (e.g., Intel PentiumPro Quads)
– Directory-based (SGI Origin2000)
• Trend toward hierarchical “out of box” solutions
– Build bigger servers from smaller ones
– Intel Profusion, Sequent NUMA-Q, Sun WildFire (pictured)
(C) 2000 Mark D. Hill
SMP
SMP
SMP
SMP
PODC00: Computer Architecture Trends
Multiprocessor Servers, cont.
• Traditionally had poor error isolation
– Double-bit ECC error crashes everything
– Kernel error crashes everything
– Poor match for highly available Internet infrastructure
• Improve error isolation
– IBM 370 “virtual machines”
– Stanford HIVE “cells”
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Clusters
• Traditionally
– Good error isolation
– Poor communication performance (especially latency)
– LANs are not optimized for clusters
• Enter Early SANs
– Berkeley NOW w/ Myricom Myrinet
– IBM SP w/ proprietary network
• What now with InfiniBand SAN (or alternatives)?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
A Prediction
• Blurring of cluster & server boundaries
• Clusters
– High communication performance
• Servers
– Better error isolation
– Multi-box solutions
• Use same hardware & configure in the field
• Issues
– How do we model these hybrids?
– Should PODC & SPAA also converge?
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Three Questions
• What is a System Area Network (SAN)
and how will it affect clusters?
– E.g., InfiniBand
– Make computation, storage, & network orthogonal
• How fat will multiprocessor servers be
and how to we build larger ones?
– Varying sizes for soft & hard state
– E.g., Multicast Snooping & Timestamp Snooping
• Future of multiprocessor servers & clusters?
– Servers will support higher availability & extra-box solutions
– Clusters will get server communication performance
(C) 2000 Mark D. Hill
PODC00: Computer Architecture Trends
Download