Computer_Architectur Network Connected Multi s

advertisement
CSE 431
Computer Architecture
Fall 2005
Lecture 27. Network Connected Multi’s
Mary Jane Irwin ( www.cse.psu.edu/~mji )
www.cse.psu.edu/~cg431
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
CSE431 L27 NetworkMultis.1
Irwin, PSU, 2005
Review: Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O

Caches are used to reduce latency and to lower bus traffic

Must provide hardware for cache coherence and process
synchronization

Bus traffic and bandwidth limits scalability (<~ 36
processors)
CSE431 L27 NetworkMultis.2
Irwin, PSU, 2005
Review: Multiprocessor Basics

Q1 – How do they share data?

Q2 – How do they coordinate?

Q3 – How scalable is the architecture? How many
processors?
# of Proc
Communication Message passing 8 to 2048
model
Shared NUMA 8 to 256
address UMA
2 to 64
Physical
connection
CSE431 L27 NetworkMultis.3
Network
8 to 256
Bus
2 to 36
Irwin, PSU, 2005
Network Connected Multiprocessors
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Interconnection Network (IN)

Either a single address space (NUMA and ccNUMA) with
implicit processor communication via loads and stores or
multiple private memories with message passing
communication with sends and receives

Interconnection network supports interprocessor communication
CSE431 L27 NetworkMultis.4
Irwin, PSU, 2005
Summing 100,000 Numbers on 100 Processors

Start by distributing 1000 elements of vector A to each of
the local memories and summing each subset in parallel
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + Al[i];
/* sum local array subset

The processors then coordinate in adding together the sub
sums (Pn is the number of processors, send(x,y) sends
value y to processor x, and receive() receives a value)
half = 100;
limit = 100;
repeat
half = (half+1)/2;
/*dividing line
if (Pn>= half && Pn<limit) send(Pn-half,sum);
if (Pn<(limit/2)) sum = sum + receive();
limit = half;
until (half == 1);
/*final sum in P0’s sum
CSE431 L27 NetworkMultis.5
Irwin, PSU, 2005
An Example with 10 Processors
sum
sum
sum
sum
sum
sum
sum
sum
sum
sum
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
CSE431 L27 NetworkMultis.6
half = 10
Irwin, PSU, 2005
Communication in Network Connected Multi’s


Implicit communication via loads and stores

hardware designers have to provide coherent caches and
process synchronization primitive

lower communication overhead

harder to overlap computation with communication

more efficient to use an address to remote data when demanded
rather than to send for it in case it might be used (such a
machine has distributed shared memory (DSM))
Explicit communication via sends and receives

simplest solution for hardware designers

higher communication overhead

easier to overlap computation with communication

easier for the programmer to optimize communication
CSE431 L27 NetworkMultis.8
Irwin, PSU, 2005
Cache Coherency in NUMAs

For performance reasons we want to allow the shared
data to be stored in caches

Once again have multiple copies of the same data with
the same address in different processors


bus snooping won’t work, since there is no single bus on which all
memory references are broadcast
Directory-base protocols

keep a directory that is a repository for the state of every block in
main memory (which caches have copies, whether it is dirty, etc.)

directory entries can be distributed (sharing status of a block
always in a single known location) to reduce contention

directory controller sends explicit commands over the IN to each
processor that has a copy of the data
CSE431 L27 NetworkMultis.9
Irwin, PSU, 2005
IN Performance Metrics


Network cost

number of switches

number of (bidirectional) links on a switch to connect to the
network (plus one link to connect to the processor)

width in bits per link, length of link
Network bandwidth (NB) – represents the best case


Bisection bandwidth (BB) – represents the worst case


bandwidth of each link * number of links
divide the machine in two parts, each with half the nodes and
sum the bandwidth of the links that cross the dividing line
Other IN performance issues

latency on an unloaded network to send and receive messages

throughput – maximum # of messages transmitted per unit time

# routing hops worst case, congestion control and delay
CSE431 L27 NetworkMultis.10
Irwin, PSU, 2005
Bus IN
Bidirectional
network switch
Processor
node

N processors, 1 switch (

Only 1 simultaneous transfer at a time

NB = link (bus) bandwidth * 1

BB = link (bus) bandwidth * 1
CSE431 L27 NetworkMultis.11
), 1 link (the bus)
Irwin, PSU, 2005
Ring IN

N processors, N switches, 2 links/switch, N links

N simultaneous transfers


NB = link bandwidth * N

BB = link bandwidth * 2
If a link is as fast as a bus, the ring is only twice as fast
as a bus in the worst case, but is N times faster in the
best case
CSE431 L27 NetworkMultis.12
Irwin, PSU, 2005
Fully Connected IN

N processors, N switches, N-1 links/switch,
(N*(N-1))/2 links

N simultaneous transfers

NB = link bandwidth * (N*(N-1))/2

BB = link bandwidth * (N/2)2
CSE431 L27 NetworkMultis.13
Irwin, PSU, 2005
Crossbar (Xbar) Connected IN

N processors, N2 switches (unidirectional),2 links/switch,
N2 links

N simultaneous transfers

NB = link bandwidth * N

BB = link bandwidth * N/2
CSE431 L27 NetworkMultis.14
Irwin, PSU, 2005
Hypercube (Binary N-cube) Connected IN
2-cube
3-cube

N processors, N switches, logN links/switch, (NlogN)/2
links

N simultaneous transfers

NB = link bandwidth * (NlogN)/2

BB = link bandwidth * N/2
CSE431 L27 NetworkMultis.15
Irwin, PSU, 2005
2D and 3D Mesh/Torus Connected IN
N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D
torus) links/switch, 4N/2 links or 6N/2 links
 N simultaneous transfers


NB = link bandwidth * 4N

BB = link bandwidth * 2 N1/2 or
CSE431 L27 NetworkMultis.16
or
link bandwidth * 6N
link bandwidth * 2 N2/3
Irwin, PSU, 2005
Fat Tree

N processors, log(N-1)*logN switches, 2 up + 4 down = 6
links/switch, N*logN links

N simultaneous transfers

NB = link bandwidth * NlogN

BB = link bandwidth * 4
CSE431 L27 NetworkMultis.17
Irwin, PSU, 2005
Fat Tree

Trees are good structures. People in CS use them all the
time. Suppose we wanted to make a tree network.
A

D
The bisection bandwidth on a tree is horrible - 1 link, at all times
The solution is to 'thicken' the upper links.


C
Any time A wants to send to C, it ties up the upper links,
so that B can't send to D.


B
More links as the tree gets thicker increases the bisection
Rather than design a bunch of N-port switches, use pairs
CSE431 L27 NetworkMultis.18
Irwin, PSU, 2005
SGI NUMAlink Fat Tree
www.embedded-computing.com/articles/woodacre
CSE431 L27 NetworkMultis.19
Irwin, PSU, 2005
IN Comparison

For a 64 processor system
Bus
Network
bandwidth
1
Bisection
bandwidth
1
Total # of
Switches
1
Ring
Torus
6-cube
Fully
connected
Links per
switch
Total # of
links
CSE431 L27 NetworkMultis.20
1
Irwin, PSU, 2005
Network Connected Multiprocessors
Proc
SGI Origin
R16000
Cray 3TE
Alpha
21164
Intel ASCI Red
Proc
Speed
# Proc
BW/link
(MB/sec)
fat tree
800
300MHz 2,048
3D torus
600
Intel
333MHz 9,632
mesh
800
IBM ASCI
White
Power3
375MHz 8,192
multistage
Omega
500
NEC ES
SX-5
500MHz 640*8
640-xbar
16000
NASA
Columbia
Intel
1.5GHz
Itanium2
512*20
IBM BG/L
Power
PC 440
65,536*2 3D torus,
fat tree,
barrier
CSE431 L27 NetworkMultis.22
128
IN
Topology
0.7GHz
fat tree,
Infiniband
Irwin, PSU, 2005
IBM BlueGene
512-node proto
BlueGene/L
Peak Perf
1.0 / 2.0 TFlops/s
180 / 360 TFlops/s
Memory Size
128 GByte
16 / 32 TByte
Foot Print
9 sq feet
2500 sq feet
Total Power
9 KW
1.5 MW
# Processors
512 dual proc
65,536 dual proc
Networks
3D Torus, Tree,
Barrier
3D Torus, Tree,
Barrier
Torus BW
3 B/cycle
3 B/cycle
CSE431 L27 NetworkMultis.23
Irwin, PSU, 2005
A BlueGene/L Chip
11GB/s
32K/32K L1
128
440 CPU
2KB
L2
5.5
Double FPU GB/s
256
256
700 MHz
256
32K/32K L1
128
440 CPU
CSE431
4MB
2KB
L2
3D torus
1
6 in, 6 out
1.6GHz
1.4Gb/s link
L27 NetworkMultis.24
L3
ECC
eDRAM
128B line
8-way assoc
256
5.5
Double FPU GB/s
Gbit
ethernet
16KB
Multiport
SRAM
buffer
11GB/s
Fat tree
8
3 in, 3 out
350MHz
2.8Gb/s link
Barrier
4 global
barriers
DDR
control
144b DDR
256MB
5.5GB/s
Irwin, PSU, 2005
Networks of Workstations (NOWs) Clusters

Clusters of off-the-shelf, whole computers with multiple
private address spaces

Clusters are connected using the I/O bus of the
computers

lower bandwidth that multiprocessor that use the memory bus

lower speed network links

more conflicts with I/O traffic

Clusters of N processors have N copies of the OS limiting
the memory available for applications

Improved system availability and expandability


easier to replace a machine without bringing down the whole
system

allows rapid, incremental expandability
Economy-of-scale advantages with respect to costs
CSE431 L27 NetworkMultis.25
Irwin, PSU, 2005
Commercial (NOW) Clusters
Proc
Proc
Speed
# Proc
Network
Dell
P4 Xeon
PowerEdge
3.06GHz 2,500
eServer
IBM SP
1.7GHz
2,944
VPI BigMac Apple G5
2.3GHz
2,200
HP ASCI Q
Alpha 21264
1.25GHz 8,192
LLNL
Thunder
Intel Itanium2 1.4GHz
1,024*4
Quadrics
Barcelona
PowerPC 970 2.2GHz
4,536
Myrinet
CSE431 L27 NetworkMultis.26
Power4
Myrinet
Mellanox
Infiniband
Quadrics
Irwin, PSU, 2005
Summary

Flynn’s classification of processors - SISD, SIMD, MIMD




Q1 – How do processors share data?
Q2 – How do processors coordinate their activity?
Q3 – How scalable is the architecture (what is the maximum
number of processors)?
Shared address multis – UMAs and NUMAs



Scalability of bus connected UMAs limited (< ~ 36 processors)
Network connected NUMAs more scalable
Interconnection Networks (INs)
- fully connected, xbar
- ring
- mesh
- n-cube, fat tree
Message passing multis
 Cluster connected (NOWs) multis

CSE431 L27 NetworkMultis.27
Irwin, PSU, 2005
Next Lecture and Reminders

Next lecture
- Reading assignment – PH 9.7

Reminders

HW5 (and last) due Dec 6th (Part 1)

Check grade posting on-line (by your midterm exam number)
for correctness

Final exam (tentatively) schedule
- Tuesday, December 13th, 2:30-4:20, 22 Deike
CSE431 L27 NetworkMultis.28
Irwin, PSU, 2005
Download