Network Connected Mulitiprocessors (.ppt)

advertisement
Computer Architecture
Fall 2008
Lecture 26: Network Connected
Multiprocessors
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Plus, slides from Chapter 18 Parallel Processing by Stallings
Lecture 26
Fall 2008
Review: Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O

Caches are used to reduce latency and to lower bus traffic

Must provide hardware for cache coherence and process
synchronization

Bus traffic and bandwidth limits scalability (<~ 36
processors)
Lecture 26
Fall 2008
Network Connected Multiprocessors

Q1 – How do they share data?

Q2 – How do they coordinate?

Q3 – How scalable is the architecture? How many
processors?
# of Proc
Communication Message passing 8 to 2048
model
Shared NUMA 8 to 256
address UMA
2 to 64
Physical
connection
Lecture 26
Network
8 to 256
Bus
2 to 36
Fall 2008
Network Connected Multiprocessors
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Interconnection Network (IN)

Either a single address space (NUMA and ccNUMA) with
implicit processor communication via loads and stores or
multiple private memories with message passing
communication with sends and receives

Lecture 26
Interconnection network supports interprocessor communication
Fall 2008
Summing 100,000 Numbers on 10 Processors

Start by distributing 10000 elements of vector A to each of
the local memories and summing each subset in parallel
sum = 0;
for (i = 0; i<10000; i = i + 1)
sum = sum + Al[i];
/* sum local array subset

The processors then coordinate in adding together the sub
sums (Pn is the number of the processor, send(x,y)
sends value y to processor x, and receive() receives a
value)
half = 10;
limit = 10;
repeat
half = (half+1)/2;
/*dividing line
if (Pn>= half && Pn<limit) send(Pn-half,sum);
if (Pn<(limit/2)) sum = sum + receive();
limit = half;
until (half == 1);
/*final sum in P0’s sum
Lecture 26
Fall 2008
An Example with 10 Processors
sum
sum
sum
sum
sum
sum
sum
sum
sum
sum
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
Lecture 26
half = 10
Fall 2008
An Example with 10 Processors
sum
sum
sum
sum
sum
sum
sum
sum
sum
sum
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9 half = 10
send
P0
P1
P2
P3
P4 receive
send
limit = 10
half = 5
limit = 5
receive
P0
P1
P2
send
receive
P0
P1
send
P0
receive
Lecture 26
half = 3
limit = 3
half = 2
limit = 2
half = 1
Fall 2008
Communication in Network Connected Multi’s


Implicit communication via loads and stores

hardware designers have to provide coherent caches and
process synchronization primitive

lower communication overhead

harder to overlap computation with communication

more efficient to use an address to remote data when demanded
rather than to send for it in case it might be used (such a
machine has distributed shared memory (DSM))
Explicit communication via sends and receives
Lecture 26

simplest solution for hardware designers

higher communication overhead

easier to overlap computation with communication

easier for the programmer to optimize communication
Fall 2008
Cache Coherency in NUMAs

For performance reasons we want to allow the shared
data to be stored in caches

Once again have multiple copies of the same data with
the same address in different processors


bus snooping won’t work, since there is no single bus on which all
memory references are broadcast
Directory-base protocols

keep a directory that is a repository for the state of every block in
main memory (which caches have copies, whether it is dirty, etc.)

directory entries can be distributed (sharing status of a block
always in a single known location) to reduce contention

directory controller sends explicit commands over the IN to each
processor that has a copy of the data
Lecture 26
Fall 2008
IN (Interconnection Network) Performance Metrics


Network cost

number of switches

number of (bidirectional) links on a switch to connect to the
network (plus one link to connect to the processor)

width in bits per link, length of link
Network bandwidth (NB) – represents the best case


Bisection bandwidth (BB) – represents the worst case


bandwidth of each link * number of links
divide the machine in two parts, each with half the nodes and
sum the bandwidth of the links that cross the dividing line
Other IN performance issues

latency on an unloaded network to send and receive messages

throughput – maximum # of messages transmitted per unit time

# routing hops worst case, congestion control and delay
Lecture 26
Fall 2008
Bus IN
Bidirectional
network switch
Processor
node

N processors, 1 switch (

Only 1 simultaneous transfer at a time
Lecture 26

NB = link (bus) bandwidth * 1

BB = link (bus) bandwidth * 1
), 1 link (the bus)
Fall 2008
Ring IN

N processors, N switches, 2 links/switch, N links

N simultaneous transfers


NB = link bandwidth * N

BB = link bandwidth * 2
If a link is as fast as a bus, the ring is only twice as fast
as a bus in the worst case, but is N times faster in the
best case
Lecture 26
Fall 2008
Fully Connected IN

N processors, N switches, N-1 links/switch,
(N*(N-1))/2 links

N simultaneous transfers
Lecture 26

NB = link bandwidth * (N*(N-1))/2

BB = link bandwidth * (N/2)2
Fall 2008
Crossbar (Xbar) Connected IN

N processors, N2 switches (unidirectional),2 links/switch,
N2 links

N simultaneous transfers

NB = link bandwidth * N

BB = link bandwidth * N/2
Lecture 26
Fall 2008
Hypercube (Binary N-cube) Connected IN
2-cube
3-cube

N processors, N switches, logN links/switch, (NlogN)/2
links

N simultaneous transfers
Lecture 26

NB = link bandwidth * (NlogN)/2

BB = link bandwidth * N/2
Fall 2008
2D and 3D Mesh/Torus Connected IN
N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D
torus) links/switch, 4N/2 links or 6N/2 links
 N simultaneous transfers

Lecture 26

NB = link bandwidth * 4N
or

BB = link bandwidth * 2 N1/2 or
link bandwidth * 6N
link bandwidth * 2 N2/3
Fall 2008
Fat Tree

Trees are good structures. People in CS use them all the
time. Suppose we wanted to make a tree network.
A

D
The bisection bandwidth on a tree is horrible - 1 link, at all times
The solution is to 'thicken' the upper links.


C
Any time A wants to send to C, it ties up the upper links,
so that B can't send to D.


B
More links as the tree gets thicker increases the bisection
Rather than design a bunch of N-port switches, use pairs
Lecture 26
Fall 2008
Fat Tree

N processors, log(N-1)*logN switches, 2 up + 4 down = 6
links/switch, N*logN links

N simultaneous transfers
Lecture 26

NB = link bandwidth * NlogN

BB = link bandwidth * 4
Fall 2008
SGI NUMAlink Fat Tree
www.embedded-computing.com/articles/woodacre
Lecture 26
Fall 2008
IN Comparison

For a 64 processor system
Bus
Network
bandwidth
1
Bisection
bandwidth
1
Total # of
Switches
1
Ring
Torus
6-cube
Fully
connected
Links per
switch
Total # of
links
Lecture 26
1
Fall 2008
Network Connected Multiprocessors
Proc
SGI Origin
R16000
Cray 3TE
Alpha
21164
Intel ASCI Red
Proc
Speed
# Proc
BW/link
(MB/sec)
fat tree
800
300MHz 2,048
3D torus
600
Intel
333MHz 9,632
mesh
800
IBM ASCI
White
Power3
375MHz 8,192
multistage
Omega
500
NEC ES
SX-5
500MHz 640*8
640-xbar
16000
NASA
Columbia
Intel
1.5GHz
Itanium2
512*20
IBM BG/L
Power
PC 440
65,536*2 3D torus,
fat tree,
barrier
Lecture 26
128
IN
Topology
0.7GHz
fat tree,
Infiniband
Fall 2008
IBM BlueGene
512-node proto
BlueGene/L
Peak Perf
1.0 / 2.0 TFlops/s
180 / 360 TFlops/s
Memory Size
128 GByte
16 / 32 TByte
Foot Print
9 sq feet
2500 sq feet
Total Power
9 KW
1.5 MW
# Processors
512 dual proc
65,536 dual proc
Networks
3D Torus, Tree,
Barrier
3D Torus, Tree,
Barrier
Torus BW
3 B/cycle
3 B/cycle
Lecture 26
Fall 2008
A BlueGene/L Chip
11GB/s
32K/32K L1
128
440 CPU
2KB
L2
5.5
Double FPU GB/s
256
256
700 MHz
256
32K/32K L1
128
440 CPU
Lecture 26
4MB
2KB
L2
3D torus
1
6 in, 6 out
1.6GHz
1.4Gb/s link
L3
ECC
eDRAM
128B line
8-way assoc
256
5.5
Double FPU GB/s
Gbit
ethernet
16KB
Multiport
SRAM
buffer
11GB/s
Fat tree
8
3 in, 3 out
350MHz
2.8Gb/s link
Barrier
4 global
barriers
DDR
control
144b DDR
256MB
5.5GB/s
Fall 2008
Networks of Workstations (NOWs) Clusters

Clusters of off-the-shelf, whole computers with multiple
private address spaces

Clusters are connected using the I/O bus of the
computers

lower bandwidth that multiprocessor that use the memory bus

lower speed network links

more conflicts with I/O traffic

Clusters of N processors have N copies of the OS limiting
the memory available for applications

Improved system availability and expandability


easier to replace a machine without bringing down the whole
system

allows rapid, incremental expandability
Economy-of-scale advantages with respect to costs
Lecture 26
Fall 2008
Commercial (NOW) Clusters
Proc
Proc
Speed
# Proc
Network
Dell
P4 Xeon
PowerEdge
3.06GHz 2,500
eServer
IBM SP
1.7GHz
2,944
VPI BigMac Apple G5
2.3GHz
2,200
HP ASCI Q
Alpha 21264
1.25GHz 8,192
LLNL
Thunder
Intel Itanium2 1.4GHz
1,024*4
Quadrics
Barcelona
PowerPC 970 2.2GHz
4,536
Myrinet
Lecture 26
Power4
Myrinet
Mellanox
Infiniband
Quadrics
Fall 2008
Summary

Flynn’s classification of processors - SISD, SIMD, MIMD




Q1 – How do processors share data?
Q2 – How do processors coordinate their activity?
Q3 – How scalable is the architecture (what is the maximum
number of processors)?
Shared address multis – UMAs and NUMAs



Scalability of bus connected UMAs limited (< ~ 36 processors)
Network connected NUMAs more scalable
Interconnection Networks (INs)
- fully connected, xbar
- ring
- mesh
- n-cube, fat tree
Message passing multis
 Cluster connected (NOWs) multis

Lecture 26
Fall 2008
Next Lecture and Reminders

Next lecture


CMPs (Chip Multiprocessors) & SMTs (Simultaneous Multithreading)
Reminders

Final is Thursday, December 18 from 10-11:50 AM in ITT 328
Lecture 26
Fall 2008
Download